Patents by Inventor Jack Choquette

Jack Choquette has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

TECHNIQUES FOR EFFICIENTLY TRANSFERRING DATA TO A PROCESSOR

Publication number: 20210326137

Abstract: A technique for block data transfer is disclosed that reduces data transfer and memory access overheads and significantly reduces multiprocessor activity and energy consumption. Threads executing on a multiprocessor needing data stored in global memory can request and store the needed data in on-chip shared memory, which can be accessed by the threads multiple times. The data can be loaded from global memory and stored in shared memory using an instruction which directs the data into the shared memory without storing the data in registers and/or cache memory of the multiprocessor during the data transfer.

Type: Application

Filed: June 30, 2021

Publication date: October 21, 2021

Inventors: Andrew KERR, Jack CHOQUETTE, Xiaogang QIU, Omkar PARANJAPE, Poornachandra RAO, Shirish GADRE, Steven J. HEINRICH, Manan PATEL, Olivier GIROUX, Alan KAATZ
Robust, efficient multiprocessor-coprocessor interface

Patent number: 11138009

Abstract: Systems and methods for an efficient and robust multiprocessor-coprocessor interface that may be used between a streaming multiprocessor and an acceleration coprocessor in a GPU are provided. According to an example implementation, in order to perform an acceleration of a particular operation using the coprocessor, the multiprocessor: issues a series of write instructions to write input data for the operation into coprocessor-accessible storage locations, issues an operation instruction to cause the coprocessor to execute the particular operation; and then issues a series of read instructions to read result data of the operation from coprocessor-accessible storage locations to multiprocessor-accessible storage locations.

Type: Grant

Filed: August 10, 2018

Date of Patent: October 5, 2021

Assignee: NVIDIA CORPORATION

Inventors: Ronald Charles Babich, Jr., John Burgess, Jack Choquette, Tero Karras, Samuli Laine, Ignacio Llamas, Gregory Muthler, William Parsons Newhall, Jr.
Techniques for efficiently transferring data to a processor

Patent number: 11080051

Abstract: A technique for block data transfer is disclosed that reduces data transfer and memory access overheads and significantly reduces multiprocessor activity and energy consumption. Threads executing on a multiprocessor needing data stored in global memory can request and store the needed data in on-chip shared memory, which can be accessed by the threads multiple times. The data can be loaded from global memory and stored in shared memory using an instruction which directs the data into the shared memory without storing the data in registers and/or cache memory of the multiprocessor during the data transfer.

Type: Grant

Filed: December 12, 2019

Date of Patent: August 3, 2021

Assignee: NVIDIA Corporation

Inventors: Andrew Kerr, Jack Choquette, Xiaogang Qiu, Omkar Paranjape, Poornachandra Rao, Shirish Gadre, Steven J. Heinrich, Manan Patel, Olivier Giroux, Alan Kaatz
TECHNIQUES FOR EFFICIENTLY TRANSFERRING DATA TO A PROCESSOR

Publication number: 20210124582

Abstract: A technique for block data transfer is disclosed that reduces data transfer and memory access overheads and significantly reduces multiprocessor activity and energy consumption. Threads executing on a multiprocessor needing data stored in global memory can request and store the needed data in on-chip shared memory, which can be accessed by the threads multiple times. The data can be loaded from global memory and stored in shared memory using an instruction which directs the data into the shared memory without storing the data in registers and/or cache memory of the multiprocessor during the data transfer.

Type: Application

Filed: December 12, 2019

Publication date: April 29, 2021

Inventors: Andrew Kerr, Jack Choquette, Xiaogang Qiu, Omkar Paranjape, Poornachandra Rao, Shirish Gadre, Steven J. Heinrich, Manan Patel, Olivier Giroux, Alan Kaatz
HIGH PERFORMANCE SYNCHRONIZATION MECHANISMS FOR COORDINATING OPERATIONS ON A COMPUTER SYSTEM

Publication number: 20210124627

Abstract: To synchronize operations of a computing system, a new type of synchronization barrier is disclosed. In one embodiment, the disclosed synchronization barrier provides for certain synchronization mechanisms such as, for example, “Arrive” and “Wait” to be split to allow for greater flexibility and efficiency in coordinating synchronization. In another embodiment, the disclosed synchronization barrier allows for hardware components such as, for example, dedicated copy or direct-memory-access (DMA) engines to be synchronized with software-based threads.

Type: Application

Filed: December 12, 2019

Publication date: April 29, 2021

Inventors: Olivier GIROUX, Jack CHOQUETTE, Ronny KRASHINSKY, Steve HEINRICH, Xiaogang QIU, Shirish GADRE
Techniques for comprehensively synchronizing execution threads

Patent number: 10977037

Abstract: In one embodiment, a synchronization instruction causes a processor to ensure that specified threads included within a warp concurrently execute a single subsequent instruction. The specified threads include at least a first thread and a second thread. In operation, the first thread arrives at the synchronization instruction. The processor determines that the second thread has not yet arrived at the synchronization instruction and configures the first thread to stop executing instructions. After issuing at least one instruction for the second thread, the processor determines that all the specified threads have arrived at the synchronization instruction. The processor then causes all the specified threads to execute the subsequent instruction. Advantageously, unlike conventional approaches to synchronizing threads, the synchronization instruction enables the processor to reliably and properly execute code that includes complex control flows and/or instructions that presuppose that threads are converged.

Type: Grant

Filed: October 7, 2019

Date of Patent: April 13, 2021

Assignee: NVIDIA Corporation

Inventors: Ajay Sudarshan Tirumala, Olivier Giroux, Peter Nelson, Jack Choquette
Binding constants at runtime for improved resource utilization

Patent number: 10877757

Abstract: A just-in-time (JIT) compiler binds constants to specific memory locations at runtime. The JIT compiler parses program code derived from a multithreaded application and identifies an instruction that references a uniform constant. The JIT compiler then determines a chain of pointers that originates within a root table specified in the multithreaded application and terminates at the uniform constant. The JIT compiler generates additional instructions for traversing the chain of pointers and inserts these instructions into the program code. A parallel processor executes this compiled code and, in doing so, causes a thread to traverse the chain of pointers and bind the uniform constant to a uniform register at runtime. Each thread in a group of threads executing on the parallel processor may then access the uniform constant.

Type: Grant

Filed: February 14, 2018

Date of Patent: December 29, 2020

Assignee: NVIDIA Corporation

Inventors: Ajay Tirumala, Jack Choquette, Manan Patel, Shirish Gadre, Praveen Kaushik, Amanpreet Grewal, Shekhar Divekar, Andrei Khodakovsky
UNIFIED CACHE FOR DIVERSE MEMORY TRAFFIC

Publication number: 20200401541

Abstract: A unified cache subsystem includes a data memory configured as both a shared memory and a local cache memory. The unified cache subsystem processes different types of memory transactions using different data pathways. To process memory transactions that target shared memory, the unified cache subsystem includes a direct pathway to the data memory. To process memory transactions that do not target shared memory, the unified cache subsystem includes a tag processing pipeline configured to identify cache hits and cache misses. When the tag processing pipeline identifies a cache hit for a given memory transaction, the transaction is rerouted to the direct pathway to data memory. When the tag processing pipeline identifies a cache miss for a given memory transaction, the transaction is pushed into a first-in first-out (FIFO) until miss data is returned from external memory. The tag processing pipeline is also configured to process texture-oriented memory transactions.

Type: Application

Filed: July 6, 2020

Publication date: December 24, 2020

Inventors: Xiaogang QIU, Ronny KRASHINSKY, Steven HEINRICH, Shirish GADRE, John EDMONDSON, Jack CHOQUETTE, Mark GEBHART, Ramesh JANDHYALA, Poornachandra RAO, Omkar PARANJAPE, Michael SIU
Uniform register file for improved resource utilization

Patent number: 10866806

Abstract: A compiler parses a multithreaded application into cohesive blocks of instructions. Cohesive blocks include instructions that do not diverge or converge. Each cohesive block is associated with one or more uniform registers. When a set of threads executes the instructions in a given cohesive block, each thread in the set may access the uniform register independently of the other threads in the set. Accordingly, the uniform register may store a single copy of data on behalf of all threads in the set of threads, thereby conserving resources.

Type: Grant

Filed: February 14, 2018

Date of Patent: December 15, 2020

Assignee: NVIDIA Corporation

Inventors: Ajay Tirumala, Jack Choquette, Manan Patel, Shirish Gadre, Praveen Kaushik
Thread-level sleep in a multithreaded architecture

Patent number: 10817295

Abstract: A streaming multiprocessor (SM) includes a nanosleep (NS) unit configured to cause individual threads executing on the SM to sleep for a programmer-specified interval of time. For a given thread, the NS unit parses a NANOSLEEP instruction and extracts a sleep time. The NS unit then maps the sleep time to a single bit of a timer and causes the thread to sleep. When the timer bit changes, the sleep time expires, and the NS unit awakens the thread. The thread may then continue executing. The SM also includes a nanotrap (NT) unit configured to issue traps using a similar timing mechanism to that described above. For a given thread, the NT unit parses a NANOTRAP instruction and extracts a trap time. The NT unit then maps the trap time to a single bit of a timer. When the timer bit changes, the NT unit issues a trap.

Type: Grant

Filed: April 28, 2017

Date of Patent: October 27, 2020

Assignee: NVIDIA Corporation

Inventors: Olivier Giroux, Peter Nelson, Jack Choquette, Ajay Sudarshan Tirumala
Unified cache for diverse memory traffic

Patent number: 10705994

Abstract: A unified cache subsystem includes a data memory configured as both a shared memory and a local cache memory. The unified cache subsystem processes different types of memory transactions using different data pathways. To process memory transactions that target shared memory, the unified cache subsystem includes a direct pathway to the data memory. To process memory transactions that do not target shared memory, the unified cache subsystem includes a tag processing pipeline configured to identify cache hits and cache misses. When the tag processing pipeline identifies a cache hit for a given memory transaction, the transaction is rerouted to the direct pathway to data memory. When the tag processing pipeline identifies a cache miss for a given memory transaction, the transaction is pushed into a first-in first-out (FIFO) until miss data is returned from external memory. The tag processing pipeline is also configured to process texture-oriented memory transactions.

Type: Grant

Filed: May 4, 2017

Date of Patent: July 7, 2020

Assignee: NVIDIA Corporation

Inventors: Xiaogang Qiu, Ronny Krashinsky, Steven Heinrich, Shirish Gadre, John Edmondson, Jack Choquette, Mark Gebhart, Ramesh Jandhyala, Poornachandra Rao, Omkar Paranjape, Michael Siu
CONVERGENCE AMONG CONCURRENTLY EXECUTING THREADS

Publication number: 20200081748

Abstract: Convergence of threads executing common code sections is facilitated using instructions inserted at strategic locations in computer code sections. The inserted instructions enable the threads in a warp or other group to cooperate with a thread scheduler to promote thread convergence.

Type: Application

Filed: September 11, 2019

Publication date: March 12, 2020

Applicant: NVIDIA Corp.

Inventors: Daniel Robert Johnson, Jack Choquette, Oliver Giroux, Michael Patrick McKeown, Mark Stephenson, Sana Damani
ROBUST, EFFICIENT MULTIPROCESSOR-COPROCESSOR INTERFACE

Publication number: 20200050451

Abstract: Systems and methods for an efficient and robust multiprocessor-coprocessor interface that may be used between a streaming multiprocessor and an acceleration coprocessor in a GPU are provided. According to an example implementation, in order to perform an acceleration of a particular operation using the coprocessor, the multiprocessor: issues a series of write instructions to write input data for the operation into coprocessor-accessible storage locations, issues an operation instruction to cause the coprocessor to execute the particular operation; and then issues a series of read instructions to read result data of the operation from coprocessor-accessible storage locations to multiprocessor-accessible storage locations.

Type: Application

Filed: August 10, 2018

Publication date: February 13, 2020

Inventors: Ronald Babich, John BURGESS, Jack CHOQUETTE, Tero KARRAS, Samuli LAINE, Ignacio LLAMAS, Gregory MUTHLER, William Parsons NEWHALL, JR.
SIMULTANEOUS COMPUTE AND GRAPHICS SCHEDULING

Publication number: 20200043123

Abstract: A parallel processing unit (e.g., a GPU), in some examples, includes a hardware scheduler and hardware arbiter that launch graphics and compute work for simultaneous execution on a SIMD/SIMT processing unit. Each processing unit (e.g., a streaming multiprocessor) of the parallel processing unit operates in a graphics-greedy mode or a compute-greedy mode at respective times. The hardware arbiter, in response to a result of a comparison of at least one monitored performance or utilization metric to a user-configured threshold, can selectively cause the processing unit to run one or more compute work items from a compute queue when the processing unit is operating in the graphics-greedy mode, and cause the processing unit to run one or more graphics work items from a graphics queue when the processing unit is operating in the compute-greedy mode. Associated methods and systems are also described.

Type: Application

Filed: August 2, 2018

Publication date: February 6, 2020

Inventors: Rajballav DASH, Gregory PALMER, Gentaro HIROTA, Lacky SHAH, Jack CHOQUETTE, Emmett KILGARIFF, Sriharsha NIVERTY, Milton LEI, Shirish GADRE, Omkar PARANJAPE, Lei YANG, Rouslan DIMITROV
TECHNIQUES FOR COMPREHENSIVELY SYNCHRONIZING EXECUTION THREADS

Publication number: 20200034143

Abstract: In one embodiment, a synchronization instruction causes a processor to ensure that specified threads included within a warp concurrently execute a single subsequent instruction. The specified threads include at least a first thread and a second thread. In operation, the first thread arrives at the synchronization instruction. The processor determines that the second thread has not yet arrived at the synchronization instruction and configures the first thread to stop executing instructions. After issuing at least one instruction for the second thread, the processor determines that all the specified threads have arrived at the synchronization instruction. The processor then causes all the specified threads to execute the subsequent instruction. Advantageously, unlike conventional approaches to synchronizing threads, the synchronization instruction enables the processor to reliably and properly execute code that includes complex control flows and/or instructions that presuppose that threads are converged.

Type: Application

Filed: October 7, 2019

Publication date: January 30, 2020

Inventors: Ajay Sudarshan Tirumala, Olivier Giroux, Peter Nelson, Jack Choquette
Unified cache for diverse memory traffic

Patent number: 10459861

Abstract: A unified cache subsystem includes a data memory configured as both a shared memory and a local cache memory. The unified cache subsystem processes different types of memory transactions using different data pathways. To process memory transactions that target shared memory, the unified cache subsystem includes a direct pathway to the data memory. To process memory transactions that do not target shared memory, the unified cache subsystem includes a tag processing pipeline configured to identify cache hits and cache misses. When the tag processing pipeline identifies a cache hit for a given memory transaction, the transaction is rerouted to the direct pathway to data memory. When the tag processing pipeline identifies a cache miss for a given memory transaction, the transaction is pushed into a first-in first-out (FIFO) until miss data is returned from external memory. The tag processing pipeline is also configured to process texture-oriented memory transactions.

Type: Grant

Filed: September 26, 2017

Date of Patent: October 29, 2019

Assignee: NVIDIA CORPORATION

Inventors: Xiaogang Qiu, Ronny Krashinsky, Steven Heinrich, Shirish Gadre, John Edmondson, Jack Choquette, Mark Gebhart, Ramesh Jandhyala, Poornachandra Rao, Omkar Paranjape, Michael Siu
Techniques for maintaining atomicity and ordering for pixel shader operations

Patent number: 10453168

Abstract: A tile coalescer within a graphics processing pipeline coalesces coverage data into tiles. The coverage data indicates, for a set of XY positions, whether a graphics primitive covers those XY positions. The tile indicates, for a larger set of XY positions, whether one or more graphics primitives cover those XY positions. The tile coalescer includes coverage data in the tile only once for each XY position, thereby allowing the API ordering of the graphics primitives covering each XY position to be preserved. The tile is then distributed to a set of streaming multiprocessors for shading and blending operations. The different streaming multiprocessors execute thread groups to process the tile. In doing so, those thread groups may perform read-modify-write operations with data stored in memory. Each such thread group is scheduled to execute via atomic operations, and according to the API order of the associated graphics primitives.

Type: Grant

Filed: August 17, 2018

Date of Patent: October 22, 2019

Assignee: NVIDIA CORPORATION

Inventors: Ziyad Hakura, Eric Lum, Dale Kirkland, Jack Choquette, Patrick R. Brown, Yury Y. Uralsky, Jeffrey Bolz
Techniques for comprehensively synchronizing execution threads

Patent number: 10437593

Abstract: A synchronization instruction causes a processor to ensure that specified threads included within a warp concurrently execute a single subsequent instruction. The specified threads include at least a first thread and a second thread. In operation, the first thread arrives at the synchronization instruction. The processor determines that the second thread has not yet arrived at the synchronization instruction and configures the first thread to stop executing instructions. After issuing at least one instruction for the second thread, the processor determines that all the specified threads have arrived at the synchronization instruction. The processor then causes all the specified threads to execute the subsequent instruction. Advantageously, unlike conventional approaches to synchronizing threads, the synchronization instruction enables the processor to reliably and properly execute code that includes complex control flows and/or instructions that presuppose that threads are converged.

Type: Grant

Filed: April 27, 2017

Date of Patent: October 8, 2019

Assignee: NVIDIA CORPORATION

Inventors: Ajay Sudarshan Tirumala, Olivier Giroux, Peter Nelson, Jack Choquette
UNIFORM REGISTER FILE FOR IMPROVED RESOURCE UTILIZATION

Publication number: 20190146796

Abstract: A compiler parses a multithreaded application into cohesive blocks of instructions. Cohesive blocks include instructions that do not diverge or converge. Each cohesive block is associated with one or more uniform registers. When a set of threads executes the instructions in a given cohesive block, each thread in the set may access the uniform register independently of the other threads in the set. Accordingly, the uniform register may store a single copy of data on behalf of all threads in the set of threads, thereby conserving resources.

Type: Application

Filed: February 14, 2018

Publication date: May 16, 2019

Inventors: Ajay TIRUMALA, Jack CHOQUETTE, Manan PATEL, Shirish GADRE, Praveen KAUSHIK
BINDING CONSTANTS AT RUNTIME FOR IMPROVED RESOURCE UTILIZATION

Publication number: 20190146817

Abstract: A just-in-time (JIT) compiler binds constants to specific memory locations at runtime. The JIT compiler parses program code derived from a multithreaded application and identifies an instruction that references a uniform constant. The JIT compiler then determines a chain of pointers that originates within a root table specified in the multithreaded application and terminates at the uniform constant. The JIT compiler generates additional instructions for traversing the chain of pointers and inserts these instructions into the program code. A parallel processor executes this compiled code and, in doing so, causes a thread to traverse the chain of pointers and bind the uniform constant to a uniform register at runtime. Each thread in a group of threads executing on the parallel processor may then access the uniform constant.

Type: Application

Filed: February 14, 2018

Publication date: May 16, 2019

Inventors: Ajay TIRUMALA, Jack CHOQUETTE, Manan PATEL, Shirish GADRE, Praveen KAUSHIK, Amanpreet GREWAL, Shekhar DIVEKAR, Andrei KHODAKOVSKY

prev 1 2 3 4 next