Patents by Inventor Jeff Tuckey

Jeff Tuckey has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Distributed Shared Memory

Publication number: 20230289189

Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Prakash BANGALORE PRABHAKAR, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Greg PALMER, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
Flexible Migration of Executing Software Between Processing Components Without Need For Hardware Reset

Publication number: 20230289212

Abstract: Processing hardware of a processor is virtualized to provide a façade between a consistent programming interface and specific hardware instances. Hardware processor components can be permanently or temporarily disabled when not needed to support the consistent programming interface and/or to balance hardware processing across a hardware arrangement such as an integrated circuit. Executing software can be migrated from one hardware arrangement to another without need to reset the hardware.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Jerome F. DULUK, JR., Gentaro HIROTA, Ronny KRASHINSKY, Greg PALMER, Jeff TUCKEY, Kaushik NADADHUR, Philip Browning JOHNSON, Praveen JOGINIPALLY
Techniques for Scalable Load Balancing of Thread Groups in a Processor

Publication number: 20230289211

Abstract: A processor supports new thread group hierarchies by centralizing work distribution to provide hardware-guaranteed concurrent execution of thread groups in a thread group array through speculative launch and load balancing across processing cores. Efficiencies are realized by distributing grid rasterization among the processing cores.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Gentaro HIROTA, Tanmoy MANDAL, Jeff TUCKEY, Kevin STEPHANO, Chen MEI, Shayani DEB, Naman GOVIL, Rajballav DASH, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS
Virtualizing Hardware Processing Resources in a Processor

Publication number: 20230288471

Abstract: Processing hardware of a processor is virtualized to provide a façade between a consistent programming interface and specific hardware instances. Hardware processor components can be permanently or temporarily disabled when not needed to support the consistent programming interface and/or to balance hardware processing across a hardware arrangement such as an integrated circuit. Executing software can be migrated from one hardware arrangement to another without need to reset the hardware.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Jerome F. DULUK, Gentaro HIROTA, Ronny KRASHINSKY, Greg PALMER, Jeff TUCKEY, Kaushik NADADHUR, Philip Browning JOHNSON, Praveen JOGINIPALLY
Cooperative Group Arrays

Publication number: 20230289215

Abstract: A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and relative to each other. Hardware support for such CGAs guarantees concurrency and enables applications to see more data locality, reduced latency, and better synchronization between all the threads in tightly cooperating collections of CTAs programmably distributed across different (e.g., hierarchical) hardware domains or partitions.

Type: Application

Filed: March 10, 2022

Publication date: September 14, 2023

Inventors: Greg PALMER, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Prakash BANGALORE PRABHAKAR, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
Pre-fetching task descriptors of dependent tasks

Patent number: 11182207

Abstract: Techniques are disclosed for reducing the latency between the completion of a producer task and the launch of a consumer task dependent on the producer task. Such latency exists when the information needed to launch the consumer task is unavailable when the producer task completes. Thus, various techniques are disclosed, where a task management unit initiates the retrieval of the information needed to launch the consumer task from memory in parallel with the producer task being launched. Because the retrieval of such information is initiated in parallel with the launch of the producer task, the information is often available when the producer task completes, thus allowing for the consumer task to be launched without delay. The disclosed techniques, therefore, enable the latency between completing the producer task and launching the consumer task to be reduced.

Type: Grant

Filed: June 24, 2019

Date of Patent: November 23, 2021

Assignee: NVIDIA CORPORATION

Inventors: Gentaro Hirota, Brian Pharris, Jeff Tuckey, Robert Overman, Stephen Jones
EFFICIENTLY EXECUTING WORKLOADS SPECIFIED VIA TASK GRAPHS

Publication number: 20200401444

Abstract: Techniques are disclosed for reducing the latency between the completion of a producer task and the launch of a consumer task dependent on the producer task. Such latency exists when the information needed to launch the consumer task is unavailable when the producer task completes. Thus, various techniques are disclosed, where a task management unit initiates the retrieval of the information needed to launch the consumer task from memory in parallel with the producer task being launched. Because the retrieval of such information is initiated in parallel with the launch of the producer task, the information is often available when the producer task completes, thus allowing for the consumer task to be launched without delay. The disclosed techniques, therefore, enable the latency between completing the producer task and launching the consumer task to be reduced.

Type: Application

Filed: June 24, 2019

Publication date: December 24, 2020

Inventors: Gentaro HIROTA, Brian PHARRIS, Jeff TUCKEY, Robert OVERMAN, Stephen JONES
Replicated stateless copy engine

Patent number: 10423424

Abstract: Techniques are disclosed for performing an auxiliary operation via a compute engine associated with a host computing device. The method includes determining that the auxiliary operation is directed to the compute engine, and determining that the auxiliary operation is associated with a first context comprising a first set of state parameters. The method further includes determining a first subset of state parameters related to the auxiliary operation based on the first set of state parameters. The method further includes transmitting the first subset of state parameters to the compute engine, and transmitting the auxiliary operation to the compute engine. One advantage of the disclosed technique is that surface area and power consumption are reduced within the processor by utilizing copy engines that have no context switching capability.

Type: Grant

Filed: September 28, 2012

Date of Patent: September 24, 2019

Assignee: NVIDIA CORPORATION

Inventors: Lincoln G. Garlick, Philip Browning Johnson, Rafal Zboinski, Jeff Tuckey, Samuel H. Duncan, Peter C. Mills
Method and system for resolving thread divergences

Patent number: 9606808

Abstract: A computing device detects divergences between threads in a thread group executing on a parallel processing unit. The computing device includes an address divergence unit that identifies a subset of non-divergent threads included in the thread group. The address divergence unit stores instructions related to the subset of non-divergent threads in a multi-issue queue. The address divergence unit causes the instructions related to the subset of non-divergent threads to be retrieved from the multi-issue queue when the parallel processing unit is available. The address divergence unit causes the subset of non-divergent threads to be issued for execution on the parallel processing unit. The address divergence unit repeats the identifying, storing, and causing steps for the remaining threads in the thread group that are not included in the subset of non-divergent threads.

Type: Grant

Filed: January 11, 2012

Date of Patent: March 28, 2017

Assignee: NVIDIA Corporation

Inventors: Jack Choquette, Xiaogang Qiu, Jeff Tuckey, Michael (Ming Yiu) Siu, Robert J. Stoll, Olivier Giroux
REPLICATED STATELESS COPY ENGINE

Publication number: 20140095759

Abstract: Techniques are disclosed for performing an auxiliary operation via a compute engine associated with a host computing device. The method includes determining that the auxiliary operation is directed to the compute engine, and determining that the auxiliary operation is associated with a first context comprising a first set of state parameters. The method further includes determining a first subset of state parameters related to the auxiliary operation based on the first set of state parameters. The method further includes transmitting the first subset of state parameters to the compute engine, and transmitting the auxiliary operation to the compute engine. One advantage of the disclosed technique is that surface area and power consumption are reduced within the processor by utilizing copy engines that have no context switching capability.

Type: Application

Filed: September 28, 2012

Publication date: April 3, 2014

Applicant: NVIDIA CORPORATION

Inventors: Lincoln G. GARLICK, Philip Browning JOHNSON, Rafal ZBOINSKI, Jeff TUCKEY, Samuel H. DUNCAN, Peter C. MILLS
Method and System for Resolving Thread Divergences

Publication number: 20130179662

Abstract: An address divergence unit detects divergence between threads in a thread group and then separates those threads into a subset of non-divergent threads and a subset of divergent threads. In one embodiment, the address divergence unit causes instructions associated with the subset of non-divergent threads to be issued for execution on a parallel processing unit, while causing the instructions associated with the subset of divergent threads to be re-fetched and re-issued for execution.

Type: Application

Filed: January 11, 2012

Publication date: July 11, 2013

Inventors: Jack CHOQUETTE, Xiaogang Qiu, Jeff Tuckey, Michael (Ming Yiu) Siu, Robert J. Stoll, Olivier Giroux
Avoiding fragmentation loss in high speed burst oriented packet memory interface

Patent number: 6591316

Abstract: A packet memory interface. The interface includes an input mechanism which receives related data. The interface includes an output mechanism which transmits the data. The interface includes a mechanism for transferring at least a plurality of bytes of the data in each burst of a plurality of bursts from the input mechanism to the output mechanism without fragmentation loss in each burst. A method for transferring data through a packet memory interface. The method includes the steps of receiving data of the packet at an input mechanism of the interface. Then there is the step of transferring at least a plurality of bytes of data of the packet to an output mechanism in bursts without any fragmentation loss in the bursts.

Type: Grant

Filed: May 20, 1999

Date of Patent: July 8, 2003

Assignee: Marconi Communications, Inc.

Inventors: Peter Roman, Jeff Tuckey, Parthiban Kandappan