Patents by Inventor Jack Choquette

Jack Choquette has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Robust, efficient multiprocessor-coprocessor interface

Patent number: 11966737

Abstract: Systems and methods for an efficient and robust multiprocessor-coprocessor interface that may be used between a streaming multiprocessor and an acceleration coprocessor in a GPU are provided. According to an example implementation, in order to perform an acceleration of a particular operation using the coprocessor, the multiprocessor: issues a series of write instructions to write input data for the operation into coprocessor-accessible storage locations, issues an operation instruction to cause the coprocessor to execute the particular operation; and then issues a series of read instructions to read result data of the operation from coprocessor-accessible storage locations to multiprocessor-accessible storage locations.

Type: Grant

Filed: September 2, 2021

Date of Patent: April 23, 2024

Assignee: NVIDIA CORPORATION

Inventors: Ronald Charles Babich, Jr., John Burgess, Jack Choquette, Tero Karras, Samuli Laine, Ignacio Llamas, Gregory Muthler, William Parsons Newhall, Jr.
SCALARIZATION OF INSTRUCTIONS FOR SIMT ARCHITECTURES

Publication number: 20240118899

Abstract: Apparatuses, systems, and techniques to adapt instructions in a SIMT architecture for execution on serial execution units. In at least one embodiment, a set of one or more threads is selected from a group of active threads associated with an instruction and the instruction is executed for the set of one or more threads on a serial execution unit.

Type: Application

Filed: February 3, 2023

Publication date: April 11, 2024

Inventors: Aditya Avinash Atluri, Jack Choquette, Carter Edwards, Olivier Giroux, Praveen Kumar Kaushik, Ronny Krashinsky, Rishkul Kulkarni, Konstantinos Kyriakopoulos
Techniques for efficiently transferring data to a processor

Patent number: 11907717

Abstract: A technique for block data transfer is disclosed that reduces data transfer and memory access overheads and significantly reduces multiprocessor activity and energy consumption. Threads executing on a multiprocessor needing data stored in global memory can request and store the needed data in on-chip shared memory, which can be accessed by the threads multiple times. The data can be loaded from global memory and stored in shared memory using an instruction which directs the data into the shared memory without storing the data in registers and/or cache memory of the multiprocessor during the data transfer.

Type: Grant

Filed: February 8, 2023

Date of Patent: February 20, 2024

Assignee: NVIDIA Corporation

Inventors: Andrew Kerr, Jack Choquette, Xiaogang Qiu, Omkar Paranjape, Poornachandra Rao, Shirish Gadre, Steven J. Heinrich, Manan Patel, Olivier Giroux, Alan Kaatz
Convergence among concurrently executing threads

Patent number: 11847508

Abstract: Convergence of threads executing common code sections is facilitated using instructions inserted at strategic locations in computer code sections. The inserted instructions enable the threads in a warp or other group to cooperate with a thread scheduler to promote thread convergence.

Type: Grant

Filed: August 11, 2022

Date of Patent: December 19, 2023

Assignee: NVIDIA CORP.

Inventors: Daniel Robert Johnson, Jack Choquette, Olivier Giroux, Michael Patrick McKeown, Mark Stephenson, Sana Damani
High performance synchronization mechanisms for coordinating operations on a computer system

Patent number: 11803380

Abstract: To synchronize operations of a computing system, a new type of synchronization barrier is disclosed. In one embodiment, the disclosed synchronization barrier provides for certain synchronization mechanisms such as, for example, “Arrive” and “Wait” to be split to allow for greater flexibility and efficiency in coordinating synchronization. In another embodiment, the disclosed synchronization barrier allows for hardware components such as, for example, dedicated copy or direct-memory-access (DMA) engines to be synchronized with software-based threads.

Type: Grant

Filed: December 12, 2019

Date of Patent: October 31, 2023

Assignee: NVIDIA Corporation

Inventors: Olivier Giroux, Jack Choquette, Ronny Krashinsky, Steve Heinrich, Xiaogang Qiu, Shirish Gadre
FAST DATA SYNCHRONIZATION IN PROCESSORS AND MEMORY

Publication number: 20230315655

Abstract: A new synchronization system synchronizes data exchanges between producer processes and consumer processes which may be on the same or different processors in a multiprocessor system. The synchronization incurs less than one roundtrip of latency - in some implementations, in approximately 0.5 roundtrip times. A key aspect of the fast synchronization is that the producer’s data store is followed without delay with the updating of a barrier on which the consumer is waiting.

Type: Application

Filed: March 10, 2022

Publication date: October 5, 2023

Inventors: Jack CHOQUETTE, Ronny KRASHINSKY, Timothy GUO, Carter EDWARDS, Steve HEINRICH, John EDMONDSON, Prakash Bangalore PRABHAKAR, Apoorv PARLE, JR., Manan PATEL, Olivier GIROUX, Michael PELLAUER
Efficient Matrix Multiply and Add with a Group of Warps

Publication number: 20230289398

Abstract: This specification describes techniques for implementing matrix multiply and add (MMA) operations in graphics processing units (GPU)s and other processors. The implementations provide for a plurality of warps of threads to collaborate in generating the result matrix by enabling each thread to share its respective register files to be accessed by the datapaths associated with other threads in the group of warps. A state machine circuit controls a MMA execution among the warps executing on asynchronous computation units. A group MMA (GMMA) instruction provides for a descriptor to be provided as parameter where the descriptor may include information regarding size and formats of input data to be loaded into shared memory and/or the datapath.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Jack CHOQUETTE, Manan PATEL, Matt TYRLIK, Ronny KRASHINSKY
Distributed Shared Memory

Publication number: 20230289189

Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Prakash BANGALORE PRABHAKAR, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Greg PALMER, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
PROGRAMMATICALLY CONTROLLED DATA MULTICASTING ACROSS MULTIPLE COMPUTE ENGINES

Publication number: 20230289190

Abstract: This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Apoorv PARLE, Ronny KRASHINSKY, John EDMONDSON, Jack CHOQUETTE, Shirish GADRE, Steve HEINRICH, Manan PATEL, Prakash Bangalore PRABHAKAR, JR., Ravi MANYAM, Wish GANDHI, Lacky SHAH, Alexander L. Minkin
Cooperative Group Arrays

Publication number: 20230289215

Abstract: A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and relative to each other. Hardware support for such CGAs guarantees concurrency and enables applications to see more data locality, reduced latency, and better synchronization between all the threads in tightly cooperating collections of CTAs programmably distributed across different (e.g., hierarchical) hardware domains or partitions.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Greg PALMER, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Prakash BANGALORE PRABHAKAR, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
METHOD AND APPARATUS FOR EFFICIENT ACCESS TO MULTIDIMENSIONAL DATA STRUCTURES AND/OR OTHER LARGE DATA BLOCKS

Publication number: 20230289304

Abstract: A parallel processing unit comprises a plurality of processors each being coupled to a memory access hardware circuitry. Each memory access hardware circuitry is configured to receive, from the coupled processor, a memory access request specifying a coordinate of a multidimensional data structure, wherein the memory access hardware circuit is one of a plurality of memory access circuitry each coupled to a respective one of the processors; and, in response to the memory access request, translate the coordinate of the multidimensional data structure into plural memory addresses for the multidimensional data structure and using the plural memory addresses, asynchronously transfer at least a portion of the multidimensional data structure for processing by at least the coupled processor. The memory locations may be in the shared memory of the coupled processor and/or an external memory.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Alexander L. Minkin, Alan Kaatz, Oliver Giroux, Jack Choquette, Shirish Gadre, Manan Patel, John Tran, Ronny Krashinsky, Jeff Schottmiller
HARDWARE ACCELERATED SYNCHRONIZATION WITH ASYNCHRONOUS TRANSACTION SUPPORT

Publication number: 20230289242

Abstract: A new transaction barrier synchronization primitive enables executing threads and asynchronous transactions to synchronize across parallel processors. The asynchronous transactions may include transactions resulting from, for example, hardware data movement units such as direct memory units, etc. A hardware synchronization circuit may provide for the synchronization primitive to be stored in a cache memory so that barrier operations may be accelerated by the circuit. A new wait mechanism reduces software overhead associated with waiting on a barrier.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Timothy GUO, Jack CHOQUETTE, Shirish GADRE, Olivier GIROUX, Carter EDWARDS, John EDMONDSON, Manan PATEL, Raghavan MADHAVAN, JR., Jessie HUANG, Peter NELSON, Ronny KRASHINSKY
METHOD AND APPARATUS FOR EFFICIENT ACCESS TO MULTIDIMENSIONAL DATA STRUCTURES AND/OR OTHER LARGE DATA BLOCKS

Publication number: 20230289292

Abstract: A parallel processing unit comprises a plurality of processors each being coupled to a memory access hardware circuitry. Each memory access hardware circuitry is configured to receive, from the coupled processor, a memory access request specifying a coordinate of a multidimensional data structure, wherein the memory access hardware circuit is one of a plurality of memory access circuitry each coupled to a respective one of the processors; and, in response to the memory access request, translate the coordinate of the multidimensional data structure into plural memory addresses for the multidimensional data structure and using the plural memory addresses, asynchronously transfer at least a portion of the multidimensional data structure for processing by at least the coupled processor. The memory locations may be in the shared memory of the coupled processor and/or an external memory.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Alexander L. Minkin, Alan Kaatz, Olivier Giroux, Jack Choquette, Shirish Gadre, Manan Patel, John Tran, Ronny Krashinsky, Jeff Schottmiller
TECHNIQUES FOR EFFICIENTLY TRANSFERRING DATA TO A PROCESSOR

Publication number: 20230185570

Abstract: A technique for block data transfer is disclosed that reduces data transfer and memory access overheads and significantly reduces multiprocessor activity and energy consumption. Threads executing on a multiprocessor needing data stored in global memory can request and store the needed data in on-chip shared memory, which can be accessed by the threads multiple times. The data can be loaded from global memory and stored in shared memory using an instruction which directs the data into the shared memory without storing the data in registers and/or cache memory of the multiprocessor during the data transfer.

Type: Application

Filed: February 8, 2023

Publication date: June 15, 2023

Inventors: Andrew KERR, Jack Choquette, Xiaogang Qiu, Omkar Paranjape, Poornachandra Rao, Shirish Gadre, Steven J. Heinrich, Manan Patel, Olivier Giroux, Alan Kaatz
Techniques for efficiently transferring data to a processor

Patent number: 11604649

Abstract: A technique for block data transfer is disclosed that reduces data transfer and memory access overheads and significantly reduces multiprocessor activity and energy consumption. Threads executing on a multiprocessor needing data stored in global memory can request and store the needed data in on-chip shared memory, which can be accessed by the threads multiple times. The data can be loaded from global memory and stored in shared memory using an instruction which directs the data into the shared memory without storing the data in registers and/or cache memory of the multiprocessor during the data transfer.

Type: Grant

Filed: June 30, 2021

Date of Patent: March 14, 2023

Assignee: NVIDIA Corporation

Inventors: Andrew Kerr, Jack Choquette, Xiaogang Qiu, Omkar Paranjape, Poornachandra Rao, Shirish Gadre, Steven J. Heinrich, Manan Patel, Olivier Giroux, Alan Kaatz
CONVERGENCE AMONG CONCURRENTLY EXECUTING THREADS

Publication number: 20230038061

Abstract: Convergence of threads executing common code sections is facilitated using instructions inserted at strategic locations in computer code sections. The inserted instructions enable the threads in a warp or other group to cooperate with a thread scheduler to promote thread convergence.

Type: Application

Filed: August 11, 2022

Publication date: February 9, 2023

Applicant: NVIDIA Corp.

Inventors: Daniel Robert Johnson, Jack Choquette, Olivier Giroux, Michael Patrick McKeown, Mark Stephenson, Sana Damani
Convergence among concurrently executing threads

Patent number: 11442795

Abstract: Convergence of threads executing common code sections is facilitated using instructions inserted at strategic locations in computer code sections. The inserted instructions enable the threads in a warp or other group to cooperate with a thread scheduler to promote thread convergence.

Type: Grant

Filed: September 11, 2019

Date of Patent: September 13, 2022

Assignee: NVIDIA Corp.

Inventors: Daniel Robert Johnson, Jack Choquette, Oliver Giroux, Michael Patrick McKeown, Mark Stephenson, Sana Damani
Managing data sparsity for neural networks

Patent number: 11392829

Abstract: Approaches in accordance with various embodiments provide for the processing of sparse matrices for mathematical and programmatic operations. In particular, various embodiments enforce sparsity constraints for performing sparse matrix multiply-add instruction (MMA) operations. Deep neural networks can exhibit significant sparsity in the data used in operations, both in the activations and weights. The computational load can be reduced by excluding zero-valued data elements. A sparsity constraint is applied across all submatrices of a sparse matrix, providing fine-grained structured sparsity that is evenly distributed across the matrix. The matrix may then be compressed since a minimum number of elements of the matrix are known to have zero value. Matrix operations are then performed using these matrices.

Type: Grant

Filed: April 2, 2019

Date of Patent: July 19, 2022

Assignee: NVIDIA Corporation

Inventors: Jeff Pool, Ganesh Venkatesh, Jorge Albericio Latorre, Jack Choquette, Ronny Krashinsky, John Tran, Feng Xie, Ming Y. Siu, Manan Patel
Simultaneous compute and graphics scheduling

Patent number: 11367160

Abstract: A parallel processing unit (e.g., a GPU), in some examples, includes a hardware scheduler and hardware arbiter that launch graphics and compute work for simultaneous execution on a SIMD/SIMT processing unit. Each processing unit (e.g., a streaming multiprocessor) of the parallel processing unit operates in a graphics-greedy mode or a compute-greedy mode at respective times. The hardware arbiter, in response to a result of a comparison of at least one monitored performance or utilization metric to a user-configured threshold, can selectively cause the processing unit to run one or more compute work items from a compute queue when the processing unit is operating in the graphics-greedy mode, and cause the processing unit to run one or more graphics work items from a graphics queue when the processing unit is operating in the compute-greedy mode. Associated methods and systems are also described.

Type: Grant

Filed: August 2, 2018

Date of Patent: June 21, 2022

Assignee: NVIDIA CORPORATION

Inventors: Rajballav Dash, Gregory Palmer, Gentaro Hirota, Lacky Shah, Jack Choquette, Emmett Kilgariff, Sriharsha Niverty, Milton Lei, Shirish Gadre, Omkar Paranjape, Lei Yang, Rouslan Dimitrov
Unified cache for diverse memory traffic

Patent number: 11347668

Abstract: A unified cache subsystem includes a data memory configured as both a shared memory and a local cache memory. The unified cache subsystem processes different types of memory transactions using different data pathways. To process memory transactions that target shared memory, the unified cache subsystem includes a direct pathway to the data memory. To process memory transactions that do not target shared memory, the unified cache subsystem includes a tag processing pipeline configured to identify cache hits and cache misses. When the tag processing pipeline identifies a cache hit for a given memory transaction, the transaction is rerouted to the direct pathway to data memory. When the tag processing pipeline identifies a cache miss for a given memory transaction, the transaction is pushed into a first-in first-out (FIFO) until miss data is returned from external memory. The tag processing pipeline is also configured to process texture-oriented memory transactions.

Type: Grant

Filed: July 6, 2020

Date of Patent: May 31, 2022

Assignee: NVIDIA Corporation

Inventors: Xiaogang Qiu, Ronny Krashinsky, Steven Heinrich, Shirish Gadre, John Edmondson, Jack Choquette, Mark Gebhart, Ramesh Jandhyala, Poornachandra Rao, Omkar Paranjape, Michael Siu

1 2 3 4 next