Patents by Inventor Luke Durant

Luke Durant has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Distributed Shared Memory

Publication number: 20230289189

Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Prakash BANGALORE PRABHAKAR, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Greg PALMER, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
Cooperative Group Arrays

Publication number: 20230289215

Abstract: A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and relative to each other. Hardware support for such CGAs guarantees concurrency and enables applications to see more data locality, reduced latency, and better synchronization between all the threads in tightly cooperating collections of CTAs programmably distributed across different (e.g., hierarchical) hardware domains or partitions.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Greg PALMER, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Prakash BANGALORE PRABHAKAR, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
Dynamic partitioning of execution resources

Patent number: 11307903

Abstract: Embodiments of the present invention set forth techniques for allocating execution resources to groups of threads within a graphics processing unit. A compute work distributor included in the graphics processing unit receives an indication from a process that a first group of threads is to be launched. The compute work distributor determines that a first subcontext associated with the process has at least one processor credit. In some embodiments, CTAs may be launched even when there are no processor credits, if one of the TPCs that was already acquired has sufficient space. The compute work distributor identifies a first processor included in a plurality of processors that has a processing load that is less than or equal to the processor loads associated with all other processors included in the plurality of processors. The compute work distributor launches the first group of threads to execute on the first processor.

Type: Grant

Filed: January 31, 2018

Date of Patent: April 19, 2022

Assignee: NVIDIA Corporation

Inventors: Jerome F. Duluk, Jr., Luke Durant, Ramon Matas Navarro, Alan Menezes, Jeffrey Tuckey, Gentaro Hirota, Brian Pharris
Dynamic partitioning of execution resources

Patent number: 10817338

Abstract: Embodiments of the present invention set forth techniques for allocating execution resources to groups of threads within a graphics processing unit. A compute work distributor included in the graphics processing unit receives an indication from a process that a first group of threads is to be launched. The compute work distributor determines that a first subcontext associated with the process has at least one processor credit. In some embodiments, CTAs may be launched even when there are no processor credits, if one of the TPCs that was already acquired has sufficient space. The compute work distributor identifies a first processor included in a plurality of processors that has a processing load that is less than or equal to the processor loads associated with all other processors included in the plurality of processors. The compute work distributor launches the first group of threads to execute on the first processor.

Type: Grant

Filed: January 31, 2018

Date of Patent: October 27, 2020

Assignee: NVIDIA Corporation

Inventors: Jerome F. Duluk, Jr., Luke Durant, Ramon Matas Navarro, Alan Menezes, Jeffrey Tuckey, Gentaro Hirota, Brian Pharris
DYNAMIC PARTITIONING OF EXECUTION RESOURCES

Publication number: 20190235928

Abstract: Embodiments of the present invention set forth techniques for allocating execution resources to groups of threads within a graphics processing unit. A compute work distributor included in the graphics processing unit receives an indication from a process that a first group of threads is to be launched. The compute work distributor determines that a first subcontext associated with the process has at least one processor credit. In some embodiments, CTAs may be launched even when there are no processor credits, if one of the TPCs that was already acquired has sufficient space. The compute work distributor identifies a first processor included in a plurality of processors that has a processing load that is less than or equal to the processor loads associated with all other processors included in the plurality of processors. The compute work distributor launches the first group of threads to execute on the first processor.

Type: Application

Filed: January 31, 2018

Publication date: August 1, 2019

Inventors: Jerome F. Duluk, JR., Luke Durant, Ramon Matas Navarro, Alan Menezes, Jeffrey Tuckey, Gentaro Hirota, Brian Pharris
DYNAMIC PARTITIONING OF EXECUTION RESOURCES

Publication number: 20190235924

Abstract: Embodiments of the present invention set forth techniques for allocating execution resources to groups of threads within a graphics processing unit. A compute work distributor included in the graphics processing unit receives an indication from a process that a first group of threads is to be launched. The compute work distributor determines that a first subcontext associated with the process has at least one processor credit. In some embodiments, CTAs may be launched even when there are no processor credits, if one of the TPCs that was already acquired has sufficient space. The compute work distributor identifies a first processor included in a plurality of processors that has a processing load that is less than or equal to the processor loads associated with all other processors included in the plurality of processors. The compute work distributor launches the first group of threads to execute on the first processor.

Type: Application

Filed: January 31, 2018

Publication date: August 1, 2019

Inventors: Jerome F. Duluk, Jr., Luke Durant, Ramon Matas Navarro, Alan Menezes, Jeffrey Tuckey, Gentaro Hirota, Brian Pharris
Cooperative thread array granularity context switch during trap handling

Patent number: 10289418

Abstract: Techniques are provided for handling a trap encountered in a thread that is part of a thread array that is being executed in a plurality of execution units. In these techniques, a data structure with an identifier associated with the thread is updated to indicate that the trap occurred during the execution of the thread array. Also in these techniques, the execution units execute a trap handling routine that includes a context switch. The execution units perform this context switch for at least one of the execution units as part of the trap handling routine while allowing the remaining execution units to exit the trap handling routine before the context switch. One advantage of the disclosed techniques is that the trap handling routine operates efficiently in parallel processors.

Type: Grant

Filed: December 27, 2012

Date of Patent: May 14, 2019

Assignee: NVIDIA CORPORATION

Inventors: Gerald F. Luiz, Philip Alexander Cuadra, Luke Durant, Shirish Gadre, Robert Ohannessian, Lacky V. Shah, Nicholas Wang, Arthur Danskin
Technique for saving and restoring thread group operating state

Patent number: 10235208

Abstract: A streaming multiprocessor (SM) included within a parallel processing unit (PPU) is configured to suspend a thread group executing on the SM and to save the operating state of the suspended thread group. A load-store unit (LSU) within the SM re-maps local memory associated with the thread group to a location in global memory. Subsequently, the SM may re-launch the suspended thread group. The LSU may then perform local memory access operations on behalf of the re-launched thread group with the re-mapped local memory that resides in global memory.

Type: Grant

Filed: December 11, 2012

Date of Patent: March 19, 2019

Assignee: NVIDIA CORPORATION

Inventors: Nicholas Wang, Lacky V. Shah, Gerald F. Luiz, Philip Alexander Cuadra, Luke Durant, Shirish Gadre
Cooperative thread array granularity context switch during trap handling

Patent number: 10095542

Abstract: Techniques are provided for restoring threads within a processing core. The techniques include, for a first thread group included in a plurality of thread groups, executing a context restore routine to restore from a memory a first portion of a context associated with the first thread group, determining whether the first thread group completed an assigned function, and, if the first thread group completed the assigned function, then exiting the context restore routine, or if the first thread group did not complete the assigned function, then executing one or more operations associated with a trap handler routine.

Type: Grant

Filed: October 30, 2017

Date of Patent: October 9, 2018

Assignee: NVIDIA CORPORATION

Inventors: Gerald F. Luiz, Philip Alexander Cuadra, Luke Durant, Shirish Gadre, Robert Ohannessian, Lacky V. Shah, Nicholas Wang, Arthur Merlin Danskin
Method and system for processing nested stream events

Patent number: 9928109

Abstract: One embodiment of the present disclosure sets forth a technique for enforcing cross stream dependencies in a parallel processing subsystem such as a graphics processing unit. The technique involves queuing waiting events to create cross stream dependencies and signaling events to indicated completion to the waiting events. A scheduler kernel examines a task status data structure from a corresponding stream and updates dependency counts for tasks and events within the stream. When each task dependency for a waiting event is satisfied, an associated task may execute.

Type: Grant

Filed: May 9, 2012

Date of Patent: March 27, 2018

Assignee: NVIDIA Corporation

Inventor: Luke Durant
COOPERATIVE THREAD ARRAY GRANULARITY CONTEXT SWITCH DURING TRAP HANDLING

Publication number: 20180052707

Abstract: Techniques are provided for restoring threads within a processing core. The techniques include, for a first thread group included in a plurality of thread groups, executing a context restore routine to restore from a memory a first portion of a context associated with the first thread group, determining whether the first thread group completed an assigned function, and, if the first thread group completed the assigned function, then exiting the context restore routine, or if the first thread group did not complete the assigned function, then executing one or more operations associated with a trap handler routine.

Type: Application

Filed: October 30, 2017

Publication date: February 22, 2018

Inventors: Gerald F. LUIZ, Philip Alexander CUADRA, Luke DURANT, Shirish GADRE, Robert OHANNESSIAN, Lacky V. SHAH, Nicholas Wang, Arthur Merlin DANSKIN
System and method for runtime scheduling of GPU tasks

Patent number: 9891949

Abstract: A method for scheduling work for processing by a GPU is disclosed. The method includes accessing a work completion data structure and accessing a work tracking data structure. Dependency logic analysis is then performed using work completion data and work tracking data. Work items that have dependencies are then launched into the GPU by using a software work item launch interface.

Type: Grant

Filed: March 6, 2013

Date of Patent: February 13, 2018

Assignee: Nvidia Corporation

Inventors: Timothy Paul Lottes, Daniel Wexler, Craig Duttweiler, Sean Treichler, Luke Durant, Philip Cuadra
Cooperative thread array granularity context switch during trap handling

Patent number: 9804885

Abstract: Techniques are provided for restoring threads within a processing core. The techniques include, for a first thread group included in a plurality of thread groups, executing a context restore routine to restore from a memory a first portion of a context associated with the first thread group, determining whether the first thread group completed an assigned function, and, if the first thread group completed the assigned function, then exiting the context restore routine, or if the first thread group did not complete the assigned function, then executing one or more operations associated with a trap handler routine.

Type: Grant

Filed: September 20, 2016

Date of Patent: October 31, 2017

Assignee: NVIDIA Corporation

Inventors: Gerald F. Luiz, Philip Alexander Cuadra, Luke Durant, Shirish Gadre, Robert Ohannessian, Lacky V. Shah, Nicholas Wang, Arthur Merlin Danskin
COOPERATIVE THREAD ARRAY GRANULARITY CONTEXT SWITCH DURING TRAP HANDLING

Publication number: 20170010914

Abstract: Techniques are provided for restoring threads within a processing core. The techniques include, for a first thread group included in a plurality of thread groups, executing a context restore routine to restore from a memory a first portion of a context associated with the first thread group, determining whether the first thread group completed an assigned function, and, if the first thread group completed the assigned function, then exiting the context restore routine, or if the first thread group did not complete the assigned function, then executing one or more operations associated with a trap handler routine.

Type: Application

Filed: September 20, 2016

Publication date: January 12, 2017

Inventors: Gerald F. LUIZ, Philip Alexander CUADRA, Luke DURANT, Shirish GADRE, Robert OHANNESSIAN, Lacky V. SHAH, Nicholas Wang, Arthur Merlin DANSKIN
Compute work distribution reference counters

Patent number: 9507638

Abstract: One embodiment of the present invention sets forth a technique for managing the allocation and release of resources during multi-threaded program execution. Programmable reference counters are initialized to values that limit the amount of resources for allocation to tasks that share the same reference counter. Resource parameters are specified for each task to define the amount of resources allocated for consumption by each array of execution threads that is launched to execute the task. The resource parameters also specify the behavior of the array for acquiring and releasing resources. Finally, during execution of each thread in the array, an exit instruction may be configured to override the release of the resources that were allocated to the array. The resources may then be retained for use by a child task that is generated during execution of a thread.

Type: Grant

Filed: November 8, 2011

Date of Patent: November 29, 2016

Assignee: NVIDIA Corporation

Inventors: Philip Alexander Cuadra, Karim M. Abdalla, Jerome F. Duluk, Jr., Luke Durant, Gerald F. Luiz, Timothy John Purcell, Lacky V. Shah
Cooperative thread array granularity context switch during trap handling

Patent number: 9448837

Abstract: Techniques are provided for restoring thread groups in a cooperative thread array (CTA) within a processing core. Each thread group in the CTA is launched to execute a context restore routine. Each thread group, executes the context restore routine to restore from a memory a first portion of context associated with the thread group, and determines whether the thread group completed an assigned function prior to executing the context restore routine. If the thread group completed an assigned function prior to executing the context restore routine, then the thread group exits the context restore routine. If the thread group did not complete the assigned function prior to executing the context restore routine, then the thread group executes one or more operations associated with a trap handler routine. One advantage of the disclosed techniques is that the trap handling routine operates efficiently in parallel processors.

Type: Grant

Filed: April 15, 2013

Date of Patent: September 20, 2016

Assignee: NVIDIA Corporation

Inventors: Gerald F. Luiz, Philip Alexander Cuadra, Luke Durant, Shirish Gadre, Robert Ohannessian, Lacky V. Shah, Nicholas Wang, Arthur Merlin Danskin
Techniques for managing the execution order of multiple nested tasks executing on a parallel processor

Patent number: 9436504

Abstract: One embodiment of the present disclosure sets forth an enhanced way for GPUs to queue new computational tasks into a task metadata descriptor queue (TMDQ). Specifically, memory for context data is pre-allocated when a new TMDQ is created. A new TMDQ may be integrated with an existing TMDQ, where computational tasks within that TMDQ include task from each of the original TMDQs. A scheduling operation is executed on completion of each computational task in order to preserve sequential execution of tasks without the use of atomic locking operations. One advantage of the disclosed technique is that GPUs are enabled to queue computational tasks within TMDQs, and also create an arbitrary number of new TMDQs to any arbitrary nesting level, without intervention by the CPU. Processing efficiency is enhanced where the GPU does not wait while the CPU creates and queues tasks.

Type: Grant

Filed: May 9, 2012

Date of Patent: September 6, 2016

Assignee: NVIDIA Corporation

Inventor: Luke Durant
System and method for launching callable functions

Patent number: 9086933

Abstract: A system and method are provided for launching a callable function. A processing system includes a host processor, a graphics processing unit, and a driver for launching a callable function. The driver is adapted to recognize at load time of a program that a first function within the program is a callable function. The driver is further adapted to generate a second function. The second function is adapted to receive arguments and translate the arguments from a calling convention for launching a function into a calling convention for calling a callable function. The second function is further adapted to call the first function using the translated arguments. The driver is also adapted to receive from the host processor or the GPU a procedure call representing a launch of the first function and, in response, launch the second function.

Type: Grant

Filed: October 1, 2012

Date of Patent: July 21, 2015

Assignee: NVIDIA CORPORATION

Inventors: Bastiaan Aarts, Luke Durant, Girish Bharambe, Vinod Grover
Virtual memory structure for coprocessors having memory allocation limitations

Patent number: 8904068

Abstract: One embodiment sets forth a technique for dynamically allocating memory during multi-threaded program execution for a coprocessor that does not support dynamic memory allocation, memory paging, or memory swapping. The coprocessor allocates an amount of memory to a program as a put buffer before execution of the program begins. If, during execution of the program by the coprocessor, a request presented by a thread to store data in the put buffer cannot be satisfied because the put buffer is full, the thread notifies a worker thread. The worker thread processes a notification generated by the thread by dynamically allocating a swap buffer within a memory that cannot be accessed by the coprocessor. The worker thread then pages the put buffer into the swap buffer during execution of the program to empty the put buffer, thereby enabling threads executing on the coprocessor to dynamically receive memory allocations during execution of the program.

Type: Grant

Filed: May 9, 2012

Date of Patent: December 2, 2014

Assignee: NVIDIA Corporation

Inventors: Luke Durant, Ze Long
SYSTEM AND METHOD FOR RUNTIME SCHEDULING OF GPU TASKS

Publication number: 20140259016

Abstract: A method for scheduling work for processing by a GPU is disclosed. The method includes accessing a work completion data structure and accessing a work tracking data structure. Dependency logic analysis is then performed using work completion data and work tracking data. Work items that have dependencies are then launched into the GPU by using a software work item launch interface.

Type: Application

Filed: March 6, 2013

Publication date: September 11, 2014

Applicant: NVIDIA CORPORATION

Inventors: Timothy Paul LOTTES, Daniel WEXLER, Craig DUTTWEILER, Sean TREICHLER, Luke DURANT, Philip CUADRA

1 2 next