Patents by Inventor Lacky Shah

Lacky Shah has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Cooperative group arrays

Patent number: 12333311

Abstract: A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and relative to each other. Hardware support for such CGAs guarantees concurrency and enables applications to see more data locality, reduced latency, and better synchronization between all the threads in tightly cooperating collections of CTAs programmably distributed across different (e.g., hierarchical) hardware domains or partitions.

Type: Grant

Filed: March 10, 2022

Date of Patent: June 17, 2025

Assignee: NVIDIA Corporation

Inventors: Greg Palmer, Gentaro Hirota, Ronny Krashinsky, Ze Long, Brian Pharris, Rajballav Dash, Jeff Tuckey, Jerome F. Duluk, Jr., Lacky Shah, Luke Durant, Jack Choquette, Eric Werness, Naman Govil, Manan Patel, Shayani Deb, Sandeep Navada, John Edmondson, Prakash Bangalore Prabhakar, Wish Gandhi, Ravi Manyam, Apoorv Parle, Olivier Giroux, Shirish Gadre, Steve Heinrich
Distributed Shared Memory

Publication number: 20250173152

Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as backed by an L2 cache.

Type: Application

Filed: January 31, 2025

Publication date: May 29, 2025

Inventors: Prakash BANGALORE PRABHAKAR, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Greg PALMER, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
Distributed shared memory

Patent number: 12248788

Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache.

Type: Grant

Filed: March 10, 2022

Date of Patent: March 11, 2025

Assignee: NVIDIA Corporation

Inventors: Prakash Bangalore Prabhakar, Gentaro Hirota, Ronny Krashinsky, Ze Long, Brian Pharris, Rajballav Dash, Jeff Tuckey, Jerome F. Duluk, Jr., Lacky Shah, Luke Durant, Jack Choquette, Eric Werness, Naman Govil, Manan Patel, Shayani Deb, Sandeep Navada, John Edmondson, Greg Palmer, Wish Gandhi, Ravi Manyam, Apoorv Parle, Olivier Giroux, Shirish Gadre, Steve Heinrich
PROGRAMMATICALLY CONTROLLED DATA MULTICASTING ACROSS MULTIPLE COMPUTE ENGINES

Publication number: 20240289132

Abstract: This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.

Type: Application

Filed: May 10, 2024

Publication date: August 29, 2024

Inventors: Apoorv PARLE, Ronny KRASHINSKY, John EDMONDSON, Jack CHOQUETTE, Shirish GADRE, Steve HEINRICH, Manan PATEL, Prakash Bangalore PRABHAKAR, JR., Ravi MANYAM, Wish GANDHI, Lacky SHAH, Alexander L. Minkin
Programmatically controlled data multicasting across multiple compute engines

Patent number: 12020035

Abstract: This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.

Type: Grant

Filed: March 10, 2022

Date of Patent: June 25, 2024

Assignee: NVIDIA CORPORATION

Inventors: Apoorv Parle, Ronny Krashinsky, John Edmondson, Jack Choquette, Shirish Gadre, Steve Heinrich, Manan Patel, Prakash Bangalore Prabhakar, Jr., Ravi Manyam, Wish Gandhi, Lacky Shah, Alexander L. Minkin
User-defined metered priority queues

Patent number: 11954518

Abstract: Apparatuses, systems, and techniques to optimize processor resources at a user-defined level. In at least one embodiment, priority of one or more tasks are adjusted to prevent one or more other dependent tasks from entering an idle state due to lack of resources to consume.

Type: Grant

Filed: December 20, 2019

Date of Patent: April 9, 2024

Assignee: Nvidia Corporation

Inventors: Jonathon Evans, Lacky Shah, Phil Johnson, Jonah Alben, Brian Pharris, Greg Palmer, Brian Fahs
PROGRAMMATICALLY CONTROLLED DATA MULTICASTING ACROSS MULTIPLE COMPUTE ENGINES

Publication number: 20230289190

Abstract: This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Apoorv PARLE, Ronny KRASHINSKY, John EDMONDSON, Jack CHOQUETTE, Shirish GADRE, Steve HEINRICH, Manan PATEL, Prakash Bangalore PRABHAKAR, JR., Ravi MANYAM, Wish GANDHI, Lacky SHAH, Alexander L. Minkin
Cooperative Group Arrays

Publication number: 20230289215

Abstract: A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and relative to each other. Hardware support for such CGAs guarantees concurrency and enables applications to see more data locality, reduced latency, and better synchronization between all the threads in tightly cooperating collections of CTAs programmably distributed across different (e.g., hierarchical) hardware domains or partitions.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Greg PALMER, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Prakash BANGALORE PRABHAKAR, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
Distributed Shared Memory

Publication number: 20230289189

Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Prakash BANGALORE PRABHAKAR, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Greg PALMER, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
Simultaneous compute and graphics scheduling

Patent number: 11367160

Abstract: A parallel processing unit (e.g., a GPU), in some examples, includes a hardware scheduler and hardware arbiter that launch graphics and compute work for simultaneous execution on a SIMD/SIMT processing unit. Each processing unit (e.g., a streaming multiprocessor) of the parallel processing unit operates in a graphics-greedy mode or a compute-greedy mode at respective times. The hardware arbiter, in response to a result of a comparison of at least one monitored performance or utilization metric to a user-configured threshold, can selectively cause the processing unit to run one or more compute work items from a compute queue when the processing unit is operating in the graphics-greedy mode, and cause the processing unit to run one or more graphics work items from a graphics queue when the processing unit is operating in the compute-greedy mode. Associated methods and systems are also described.

Type: Grant

Filed: August 2, 2018

Date of Patent: June 21, 2022

Assignee: NVIDIA CORPORATION

Inventors: Rajballav Dash, Gregory Palmer, Gentaro Hirota, Lacky Shah, Jack Choquette, Emmett Kilgariff, Sriharsha Niverty, Milton Lei, Shirish Gadre, Omkar Paranjape, Lei Yang, Rouslan Dimitrov
USER-DEFINED METERED PRIORITY QUEUES

Publication number: 20210191754

Abstract: Apparatuses, systems, and techniques to optimize processor resources at a user-defined level. In at least one embodiment, priority of one or more tasks are adjusted to prevent one or more other dependent tasks from entering an idle state due to lack of resources to consume.

Type: Application

Filed: December 20, 2019

Publication date: June 24, 2021

Inventors: Jonathon Evans, Lacky Shah, Phil Johnson, Jonah Alben, Brian Pharris, Greg Palmer, Brian Fahs
SIMULTANEOUS COMPUTE AND GRAPHICS SCHEDULING

Publication number: 20200043123

Abstract: A parallel processing unit (e.g., a GPU), in some examples, includes a hardware scheduler and hardware arbiter that launch graphics and compute work for simultaneous execution on a SIMD/SIMT processing unit. Each processing unit (e.g., a streaming multiprocessor) of the parallel processing unit operates in a graphics-greedy mode or a compute-greedy mode at respective times. The hardware arbiter, in response to a result of a comparison of at least one monitored performance or utilization metric to a user-configured threshold, can selectively cause the processing unit to run one or more compute work items from a compute queue when the processing unit is operating in the graphics-greedy mode, and cause the processing unit to run one or more graphics work items from a graphics queue when the processing unit is operating in the compute-greedy mode. Associated methods and systems are also described.

Type: Application

Filed: August 2, 2018

Publication date: February 6, 2020

Inventors: Rajballav DASH, Gregory PALMER, Gentaro HIROTA, Lacky SHAH, Jack CHOQUETTE, Emmett KILGARIFF, Sriharsha NIVERTY, Milton LEI, Shirish GADRE, Omkar PARANJAPE, Lei YANG, Rouslan DIMITROV
Organizing memory to optimize memory accesses of compressed data

Patent number: 10402323

Abstract: In one embodiment of the present invention a cache unit organizes data stored in an attached memory to optimize accesses to compressed data. In operation, the cache unit introduces a layer of indirection between a physical address associated with a memory access request and groups of blocks in the attached memory. The layer of indirection—virtual tiles—enables the cache unit to selectively store compressed data that would conventionally be stored in separate physical tiles included in a group of blocks in a single physical tile. Because the cache unit stores compressed data associated with multiple physical tiles in a single physical tile and, more specifically, in adjacent locations within the single physical tile, the cache unit coalesces the compressed data into contiguous blocks. Subsequently, upon performing a read operation, the cache unit may retrieve the compressed data conventionally associated with separate physical tiles in a single read operation.

Type: Grant

Filed: October 28, 2015

Date of Patent: September 3, 2019

Assignee: NVIDIA CORPORATION

Inventors: Praveen Krishnamurthy, Peter B. Holmquist, Wishwesh Gandhi, Timothy Purcell, Karan Mehra, Lacky Shah
Organizing memory to optimize memory accesses of compressed data

Patent number: 9934145

Abstract: In one embodiment of the present invention a cache unit organizes data stored in an attached memory to optimize accesses to compressed data. In operation, the cache unit introduces a layer of indirection between a physical address associated with a memory access request and groups of blocks in the attached memory. The layer of indirection—virtual tiles—enables the cache unit to selectively store compressed data that would conventionally be stored in separate physical tiles included in a group of blocks in a single physical tile. Because the cache unit stores compressed data associated with multiple physical tiles in a single physical tile and, more specifically, in adjacent locations within the single physical tile, the cache unit coalesces the compressed data into contiguous blocks. Subsequently, upon performing a read operation, the cache unit may retrieve the compressed data conventionally associated with separate physical tiles in a single read operation.

Type: Grant

Filed: October 28, 2015

Date of Patent: April 3, 2018

Assignee: NVIDIA Corporation

Inventors: Praveen Krishnamurthy, Peter B. Holmquist, Wishwesh Gandhi, Timothy Purcell, Karan Mehra, Lacky Shah
Organizing Memory to Optimize Memory Accesses of Compressed Data

Publication number: 20170123977

Abstract: In one embodiment of the present invention a cache unit organizes data stored in an attached memory to optimize accesses to compressed data. In operation, the cache unit introduces a layer of indirection between a physical address associated with a memory access request and groups of blocks in the attached memory. The layer of indirection—virtual tiles—enables the cache unit to selectively store compressed data that would conventionally be stored in separate physical tiles included in a group of blocks in a single physical tile. Because the cache unit stores compressed data associated with multiple physical tiles in a single physical tile and, more specifically, in adjacent locations within the single physical tile, the cache unit coalesces the compressed data into contiguous blocks. Subsequently, upon performing a read operation, the cache unit may retrieve the compressed data conventionally associated with separate physical tiles in a single read operation.

Type: Application

Filed: October 28, 2015

Publication date: May 4, 2017

Inventors: Praveen KRISHNAMURTHY, Peter B. HOLMQUIST, Wishwesh GANDHI, Timothy PURCELL, Karan MEHRA, Lacky SHAH
Organizing Memory to Optimize Memory Accesses of Compressed Data

Publication number: 20170123978

Abstract: In one embodiment of the present invention a cache unit organizes data stored in an attached memory to optimize accesses to compressed data. In operation, the cache unit introduces a layer of indirection between a physical address associated with a memory access request and groups of blocks in the attached memory. The layer of indirection—virtual tiles—enables the cache unit to selectively store compressed data that would conventionally be stored in separate physical tiles included in a group of blocks in a single physical tile. Because the cache unit stores compressed data associated with multiple physical tiles in a single physical tile and, more specifically, in adjacent locations within the single physical tile, the cache unit coalesces the compressed data into contiguous blocks. Subsequently, upon performing a read operation, the cache unit may retrieve the compressed data conventionally associated with separate physical tiles in a single read operation.

Type: Application

Filed: October 28, 2015

Publication date: May 4, 2017

Inventors: Praveen KRISHNAMURTHY, Peter B. HOLMQUIST, Wishwesh GANDHI, Timothy PURCELL, Karan MEHRA, Lacky SHAH
SUPPORTING SPECULATIVE MODIFICATION IN A DATA CACHE

Publication number: 20150149733

Abstract: Method and system for supporting speculative modification in a data cache are provided and described. In one embodiment, a speculative cache buffer includes a plurality of cache lines and a plurality of state indicators. At least one of the cache lines is operable to receive an evicted cache line from a cache. The at least one of the cache lines is operable to return the evicted cache line to the cache if the cache requests the evicted cache line. Further, the plurality of state indicators is operable to indicate a state of a corresponding cache line of the cache lines.

Type: Application

Filed: January 14, 2011

Publication date: May 28, 2015

Inventors: Guillermo Rozas, Alexander Klaiber, David Dunn, Paul Serris, Lacky Shah
Intelligent network streaming and execution system for conventionally coded applications

Patent number: 8893249

Abstract: In a system that partitions an application program into page segments, a minimal portion of the application program is installed on a client system. The client prefetches page segments from the application server or the application server pushes additional page segments to the client. The application server begins streaming the requested page segments to the client when it receives a valid access token from the client. The client performs server load balancing across a plurality of application servers. If the client observes a non-response or slow response condition from an application server or license server, it switches to another application or license server.

Type: Grant

Filed: November 26, 2012

Date of Patent: November 18, 2014

Assignee: Numecent Holdings, Inc.

Inventors: Daniel T. Arai, Sameer Panwar, Manuel E. Benitez, Anne M. Holler, Lacky Shah
Intelligent network streaming and execution system for conventionally coded applications

Patent number: 8438298

Abstract: An intelligent network streaming and execution system for conventionally coded applications provides a system that partitions an application program into page segments by observing the manner in which the application program is conventionally installed. A minimal portion of the application program is installed on a client system and the user launches the application in the same ways that applications on other client file systems are started. An application program server streams the page segments to the client as the application program executes on the client and the client stores the page segments in a cache. Page segments are requested by the client from the application server whenever a page fault occurs from the cache for the application program. The client prefetches page segments from the application server or the application server pushes additional page segments to the client based on the pattern of page segment requests for that particular application.

Type: Grant

Filed: June 13, 2006

Date of Patent: May 7, 2013

Assignee: Endeavors Technologies, Inc.

Inventors: Daniel T. Arai, Sameer Panwar, Manuel E. Benitez, Anne M. Holler, Lacky Shah
Supporting speculative modification in a data cache

Patent number: 7873793

Abstract: Method and system for supporting speculative modification in a data cache are provided and described. A data cache comprises a plurality of cache lines. Each cache line includes a state indicator for indicating anyone of a plurality of states, wherein the plurality of states includes a speculative state to enable keeping track of speculative modification to data in the respective cache line. The speculative state enables a speculative modification to the data in the respective cache line to be made permanent in response to a first operation performed upon reaching a particular instruction boundary during speculative execution of instructions. Further, the speculative state enables the speculative modification to the data in the respective cache line to be undone in response to a second operation performed upon failing to reach the particular instruction boundary during speculative execution of instructions.

Type: Grant

Filed: May 29, 2007

Date of Patent: January 18, 2011

Inventors: Guillermo Rozas, Alexander Klaiber, David Dunn, Paul Serris, Lacky Shah

1 2 next