Patents by Inventor Lacky Shah
Lacky Shah has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 12248788Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache.Type: GrantFiled: March 10, 2022Date of Patent: March 11, 2025Assignee: NVIDIA CorporationInventors: Prakash Bangalore Prabhakar, Gentaro Hirota, Ronny Krashinsky, Ze Long, Brian Pharris, Rajballav Dash, Jeff Tuckey, Jerome F. Duluk, Jr., Lacky Shah, Luke Durant, Jack Choquette, Eric Werness, Naman Govil, Manan Patel, Shayani Deb, Sandeep Navada, John Edmondson, Greg Palmer, Wish Gandhi, Ravi Manyam, Apoorv Parle, Olivier Giroux, Shirish Gadre, Steve Heinrich
-
Publication number: 20240289132Abstract: This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.Type: ApplicationFiled: May 10, 2024Publication date: August 29, 2024Inventors: Apoorv PARLE, Ronny KRASHINSKY, John EDMONDSON, Jack CHOQUETTE, Shirish GADRE, Steve HEINRICH, Manan PATEL, Prakash Bangalore PRABHAKAR, JR., Ravi MANYAM, Wish GANDHI, Lacky SHAH, Alexander L. Minkin
-
Patent number: 12020035Abstract: This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.Type: GrantFiled: March 10, 2022Date of Patent: June 25, 2024Assignee: NVIDIA CORPORATIONInventors: Apoorv Parle, Ronny Krashinsky, John Edmondson, Jack Choquette, Shirish Gadre, Steve Heinrich, Manan Patel, Prakash Bangalore Prabhakar, Jr., Ravi Manyam, Wish Gandhi, Lacky Shah, Alexander L. Minkin
-
Patent number: 11954518Abstract: Apparatuses, systems, and techniques to optimize processor resources at a user-defined level. In at least one embodiment, priority of one or more tasks are adjusted to prevent one or more other dependent tasks from entering an idle state due to lack of resources to consume.Type: GrantFiled: December 20, 2019Date of Patent: April 9, 2024Assignee: Nvidia CorporationInventors: Jonathon Evans, Lacky Shah, Phil Johnson, Jonah Alben, Brian Pharris, Greg Palmer, Brian Fahs
-
Publication number: 20230289215Abstract: A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and relative to each other. Hardware support for such CGAs guarantees concurrency and enables applications to see more data locality, reduced latency, and better synchronization between all the threads in tightly cooperating collections of CTAs programmably distributed across different (e.g., hierarchical) hardware domains or partitions.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Greg PALMER, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Prakash BANGALORE PRABHAKAR, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
-
Publication number: 20230289189Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Prakash BANGALORE PRABHAKAR, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Greg PALMER, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
-
Publication number: 20230289190Abstract: This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Apoorv PARLE, Ronny KRASHINSKY, John EDMONDSON, Jack CHOQUETTE, Shirish GADRE, Steve HEINRICH, Manan PATEL, Prakash Bangalore PRABHAKAR, JR., Ravi MANYAM, Wish GANDHI, Lacky SHAH, Alexander L. Minkin
-
Patent number: 11367160Abstract: A parallel processing unit (e.g., a GPU), in some examples, includes a hardware scheduler and hardware arbiter that launch graphics and compute work for simultaneous execution on a SIMD/SIMT processing unit. Each processing unit (e.g., a streaming multiprocessor) of the parallel processing unit operates in a graphics-greedy mode or a compute-greedy mode at respective times. The hardware arbiter, in response to a result of a comparison of at least one monitored performance or utilization metric to a user-configured threshold, can selectively cause the processing unit to run one or more compute work items from a compute queue when the processing unit is operating in the graphics-greedy mode, and cause the processing unit to run one or more graphics work items from a graphics queue when the processing unit is operating in the compute-greedy mode. Associated methods and systems are also described.Type: GrantFiled: August 2, 2018Date of Patent: June 21, 2022Assignee: NVIDIA CORPORATIONInventors: Rajballav Dash, Gregory Palmer, Gentaro Hirota, Lacky Shah, Jack Choquette, Emmett Kilgariff, Sriharsha Niverty, Milton Lei, Shirish Gadre, Omkar Paranjape, Lei Yang, Rouslan Dimitrov
-
Publication number: 20210191754Abstract: Apparatuses, systems, and techniques to optimize processor resources at a user-defined level. In at least one embodiment, priority of one or more tasks are adjusted to prevent one or more other dependent tasks from entering an idle state due to lack of resources to consume.Type: ApplicationFiled: December 20, 2019Publication date: June 24, 2021Inventors: Jonathon Evans, Lacky Shah, Phil Johnson, Jonah Alben, Brian Pharris, Greg Palmer, Brian Fahs
-
Publication number: 20200043123Abstract: A parallel processing unit (e.g., a GPU), in some examples, includes a hardware scheduler and hardware arbiter that launch graphics and compute work for simultaneous execution on a SIMD/SIMT processing unit. Each processing unit (e.g., a streaming multiprocessor) of the parallel processing unit operates in a graphics-greedy mode or a compute-greedy mode at respective times. The hardware arbiter, in response to a result of a comparison of at least one monitored performance or utilization metric to a user-configured threshold, can selectively cause the processing unit to run one or more compute work items from a compute queue when the processing unit is operating in the graphics-greedy mode, and cause the processing unit to run one or more graphics work items from a graphics queue when the processing unit is operating in the compute-greedy mode. Associated methods and systems are also described.Type: ApplicationFiled: August 2, 2018Publication date: February 6, 2020Inventors: Rajballav DASH, Gregory PALMER, Gentaro HIROTA, Lacky SHAH, Jack CHOQUETTE, Emmett KILGARIFF, Sriharsha NIVERTY, Milton LEI, Shirish GADRE, Omkar PARANJAPE, Lei YANG, Rouslan DIMITROV
-
Patent number: 10402323Abstract: In one embodiment of the present invention a cache unit organizes data stored in an attached memory to optimize accesses to compressed data. In operation, the cache unit introduces a layer of indirection between a physical address associated with a memory access request and groups of blocks in the attached memory. The layer of indirection—virtual tiles—enables the cache unit to selectively store compressed data that would conventionally be stored in separate physical tiles included in a group of blocks in a single physical tile. Because the cache unit stores compressed data associated with multiple physical tiles in a single physical tile and, more specifically, in adjacent locations within the single physical tile, the cache unit coalesces the compressed data into contiguous blocks. Subsequently, upon performing a read operation, the cache unit may retrieve the compressed data conventionally associated with separate physical tiles in a single read operation.Type: GrantFiled: October 28, 2015Date of Patent: September 3, 2019Assignee: NVIDIA CORPORATIONInventors: Praveen Krishnamurthy, Peter B. Holmquist, Wishwesh Gandhi, Timothy Purcell, Karan Mehra, Lacky Shah
-
Patent number: 9934145Abstract: In one embodiment of the present invention a cache unit organizes data stored in an attached memory to optimize accesses to compressed data. In operation, the cache unit introduces a layer of indirection between a physical address associated with a memory access request and groups of blocks in the attached memory. The layer of indirection—virtual tiles—enables the cache unit to selectively store compressed data that would conventionally be stored in separate physical tiles included in a group of blocks in a single physical tile. Because the cache unit stores compressed data associated with multiple physical tiles in a single physical tile and, more specifically, in adjacent locations within the single physical tile, the cache unit coalesces the compressed data into contiguous blocks. Subsequently, upon performing a read operation, the cache unit may retrieve the compressed data conventionally associated with separate physical tiles in a single read operation.Type: GrantFiled: October 28, 2015Date of Patent: April 3, 2018Assignee: NVIDIA CorporationInventors: Praveen Krishnamurthy, Peter B. Holmquist, Wishwesh Gandhi, Timothy Purcell, Karan Mehra, Lacky Shah
-
Publication number: 20170123977Abstract: In one embodiment of the present invention a cache unit organizes data stored in an attached memory to optimize accesses to compressed data. In operation, the cache unit introduces a layer of indirection between a physical address associated with a memory access request and groups of blocks in the attached memory. The layer of indirection—virtual tiles—enables the cache unit to selectively store compressed data that would conventionally be stored in separate physical tiles included in a group of blocks in a single physical tile. Because the cache unit stores compressed data associated with multiple physical tiles in a single physical tile and, more specifically, in adjacent locations within the single physical tile, the cache unit coalesces the compressed data into contiguous blocks. Subsequently, upon performing a read operation, the cache unit may retrieve the compressed data conventionally associated with separate physical tiles in a single read operation.Type: ApplicationFiled: October 28, 2015Publication date: May 4, 2017Inventors: Praveen KRISHNAMURTHY, Peter B. HOLMQUIST, Wishwesh GANDHI, Timothy PURCELL, Karan MEHRA, Lacky SHAH
-
Publication number: 20170123978Abstract: In one embodiment of the present invention a cache unit organizes data stored in an attached memory to optimize accesses to compressed data. In operation, the cache unit introduces a layer of indirection between a physical address associated with a memory access request and groups of blocks in the attached memory. The layer of indirection—virtual tiles—enables the cache unit to selectively store compressed data that would conventionally be stored in separate physical tiles included in a group of blocks in a single physical tile. Because the cache unit stores compressed data associated with multiple physical tiles in a single physical tile and, more specifically, in adjacent locations within the single physical tile, the cache unit coalesces the compressed data into contiguous blocks. Subsequently, upon performing a read operation, the cache unit may retrieve the compressed data conventionally associated with separate physical tiles in a single read operation.Type: ApplicationFiled: October 28, 2015Publication date: May 4, 2017Inventors: Praveen KRISHNAMURTHY, Peter B. HOLMQUIST, Wishwesh GANDHI, Timothy PURCELL, Karan MEHRA, Lacky SHAH
-
Publication number: 20150149733Abstract: Method and system for supporting speculative modification in a data cache are provided and described. In one embodiment, a speculative cache buffer includes a plurality of cache lines and a plurality of state indicators. At least one of the cache lines is operable to receive an evicted cache line from a cache. The at least one of the cache lines is operable to return the evicted cache line to the cache if the cache requests the evicted cache line. Further, the plurality of state indicators is operable to indicate a state of a corresponding cache line of the cache lines.Type: ApplicationFiled: January 14, 2011Publication date: May 28, 2015Inventors: Guillermo Rozas, Alexander Klaiber, David Dunn, Paul Serris, Lacky Shah
-
Patent number: 8893249Abstract: In a system that partitions an application program into page segments, a minimal portion of the application program is installed on a client system. The client prefetches page segments from the application server or the application server pushes additional page segments to the client. The application server begins streaming the requested page segments to the client when it receives a valid access token from the client. The client performs server load balancing across a plurality of application servers. If the client observes a non-response or slow response condition from an application server or license server, it switches to another application or license server.Type: GrantFiled: November 26, 2012Date of Patent: November 18, 2014Assignee: Numecent Holdings, Inc.Inventors: Daniel T. Arai, Sameer Panwar, Manuel E. Benitez, Anne M. Holler, Lacky Shah
-
Patent number: 8438298Abstract: An intelligent network streaming and execution system for conventionally coded applications provides a system that partitions an application program into page segments by observing the manner in which the application program is conventionally installed. A minimal portion of the application program is installed on a client system and the user launches the application in the same ways that applications on other client file systems are started. An application program server streams the page segments to the client as the application program executes on the client and the client stores the page segments in a cache. Page segments are requested by the client from the application server whenever a page fault occurs from the cache for the application program. The client prefetches page segments from the application server or the application server pushes additional page segments to the client based on the pattern of page segment requests for that particular application.Type: GrantFiled: June 13, 2006Date of Patent: May 7, 2013Assignee: Endeavors Technologies, Inc.Inventors: Daniel T. Arai, Sameer Panwar, Manuel E. Benitez, Anne M. Holler, Lacky Shah
-
Patent number: 7873793Abstract: Method and system for supporting speculative modification in a data cache are provided and described. A data cache comprises a plurality of cache lines. Each cache line includes a state indicator for indicating anyone of a plurality of states, wherein the plurality of states includes a speculative state to enable keeping track of speculative modification to data in the respective cache line. The speculative state enables a speculative modification to the data in the respective cache line to be made permanent in response to a first operation performed upon reaching a particular instruction boundary during speculative execution of instructions. Further, the speculative state enables the speculative modification to the data in the respective cache line to be undone in response to a second operation performed upon failing to reach the particular instruction boundary during speculative execution of instructions.Type: GrantFiled: May 29, 2007Date of Patent: January 18, 2011Inventors: Guillermo Rozas, Alexander Klaiber, David Dunn, Paul Serris, Lacky Shah
-
Patent number: 7725885Abstract: The present invention relates to a mechanism for adaptive run time compilation of traces of program code to achieve efficiency in compilation and overall execution. The mechanism uses a combination of interpretation and compilation in executing byte code. The mechanism selects code segments for compilation based upon frequency of execution and ease of compilation. The inventive mechanism is able to compile small code segments, which can be a subset of a method or subroutine, and comprising only single path execution. The mechanism thereby achieves efficiency in compilation by having less total code as well as having only straight line, or single path execution, code to compile.Type: GrantFiled: May 9, 2000Date of Patent: May 25, 2010Assignee: Hewlett-Packard Development Company, L.P.Inventors: Salil Pradhan, Lacky Shah
-
Method and system for conservatively managing store capacity available to a processor issuing stores
Patent number: 7606979Abstract: Method and system for conservatively managing store capacity available to a processor issuing stores are provided and described. In particular, a counter mechanism is utilized, whereas the counter mechanism is incremented or decremented based on the occurrence of particular events.Type: GrantFiled: December 12, 2006Date of Patent: October 20, 2009Inventors: Guillermo Rozas, Alexander Klaiber, David Dunn, Paul Serris, Lacky Shah