Patents by Inventor Gentaro Hirota
Gentaro Hirota has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 12248788Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache.Type: GrantFiled: March 10, 2022Date of Patent: March 11, 2025Assignee: NVIDIA CorporationInventors: Prakash Bangalore Prabhakar, Gentaro Hirota, Ronny Krashinsky, Ze Long, Brian Pharris, Rajballav Dash, Jeff Tuckey, Jerome F. Duluk, Jr., Lacky Shah, Luke Durant, Jack Choquette, Eric Werness, Naman Govil, Manan Patel, Shayani Deb, Sandeep Navada, John Edmondson, Greg Palmer, Wish Gandhi, Ravi Manyam, Apoorv Parle, Olivier Giroux, Shirish Gadre, Steve Heinrich
-
Publication number: 20230288471Abstract: Processing hardware of a processor is virtualized to provide a façade between a consistent programming interface and specific hardware instances. Hardware processor components can be permanently or temporarily disabled when not needed to support the consistent programming interface and/or to balance hardware processing across a hardware arrangement such as an integrated circuit. Executing software can be migrated from one hardware arrangement to another without need to reset the hardware.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Jerome F. DULUK, Gentaro HIROTA, Ronny KRASHINSKY, Greg PALMER, Jeff TUCKEY, Kaushik NADADHUR, Philip Browning JOHNSON, Praveen JOGINIPALLY
-
Publication number: 20230289189Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Prakash BANGALORE PRABHAKAR, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Greg PALMER, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
-
Publication number: 20230289215Abstract: A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and relative to each other. Hardware support for such CGAs guarantees concurrency and enables applications to see more data locality, reduced latency, and better synchronization between all the threads in tightly cooperating collections of CTAs programmably distributed across different (e.g., hierarchical) hardware domains or partitions.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Greg PALMER, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Prakash BANGALORE PRABHAKAR, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
-
Publication number: 20230289212Abstract: Processing hardware of a processor is virtualized to provide a façade between a consistent programming interface and specific hardware instances. Hardware processor components can be permanently or temporarily disabled when not needed to support the consistent programming interface and/or to balance hardware processing across a hardware arrangement such as an integrated circuit. Executing software can be migrated from one hardware arrangement to another without need to reset the hardware.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Jerome F. DULUK, JR., Gentaro HIROTA, Ronny KRASHINSKY, Greg PALMER, Jeff TUCKEY, Kaushik NADADHUR, Philip Browning JOHNSON, Praveen JOGINIPALLY
-
Publication number: 20230289211Abstract: A processor supports new thread group hierarchies by centralizing work distribution to provide hardware-guaranteed concurrent execution of thread groups in a thread group array through speculative launch and load balancing across processing cores. Efficiencies are realized by distributing grid rasterization among the processing cores.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Gentaro HIROTA, Tanmoy MANDAL, Jeff TUCKEY, Kevin STEPHANO, Chen MEI, Shayani DEB, Naman GOVIL, Rajballav DASH, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS
-
Patent number: 11720440Abstract: Various embodiments include a parallel processing computer system that detects memory errors as a memory client loads data from memory and disables the memory client from storing data to memory, thereby reducing the likelihood that the memory error propagates to other memory clients. The memory client initiates a stall sequence, while other memory clients continue to execute instructions and the memory continues to service memory load and store operations. When a memory error is detected, a specific bit pattern is stored in conjunction with the data associated with the memory error. When the data is copied from one memory to another memory, the specific bit pattern is also copied, in order to identify the data as having a memory error.Type: GrantFiled: July 12, 2021Date of Patent: August 8, 2023Assignee: NVIDIA CORPORATIONInventors: Naveen Cherukuri, Saurabh Hukerikar, Paul Racunas, Nirmal Raj Saxena, David Charles Patrick, Yiyang Feng, Abhijeet Ghadge, Steven James Heinrich, Adam Hendrickson, Gentaro Hirota, Praveen Joginipally, Vaishali Kulkarni, Peter C. Mills, Sandeep Navada, Manan Patel, Liang Yin
-
Publication number: 20230236878Abstract: In various embodiments, scheduling dependencies associated with tasks executed on a processor are decoupled from data dependencies associated with the tasks. Before the completion of a first task that is executing in the processor, a scheduling dependency specifying that a second task is dependent on the first task is resolved based on a pre-exit trigger. In response to the resolution of the scheduling dependency, the second task is launched on the processor.Type: ApplicationFiled: January 25, 2022Publication date: July 27, 2023Inventors: Jack Hilaire CHOQUETTE, Rajballav DASH, Shayani DEB, Gentaro HIROTA, Ronny M. KRASHINSKY, Ze LONG, Chen MEI, Manan PATEL, Ming Y. SIU
-
Publication number: 20230011863Abstract: Various embodiments include a parallel processing computer system that detects memory errors as a memory client loads data from memory and disables the memory client from storing data to memory, thereby reducing the likelihood that the memory error propagates to other memory clients. The memory client initiates a stall sequence, while other memory clients continue to execute instructions and the memory continues to service memory load and store operations. When a memory error is detected, a specific bit pattern is stored in conjunction with the data associated with the memory error. When the data is copied from one memory to another memory, the specific bit pattern is also copied, in order to identify the data as having a memory error.Type: ApplicationFiled: July 12, 2021Publication date: January 12, 2023Inventors: NAVEEN CHERUKURI, SAURABH HUKERIKAR, PAUL RACUNAS, NIRMAL RAJ SAXENA, DAVID CHARLES PATRICK, YIYANG FENG, ABHIJEET GHADGE, STEVEN JAMES HEINRICH, ADAM HENDRICKSON, GENTARO HIROTA, PRAVEEN JOGINIPALLY, VAISHALI KULKARNI, PETER C. MILLS, SANDEEP NAVADA, MANAN PATEL, LIANG YIN
-
Patent number: 11367160Abstract: A parallel processing unit (e.g., a GPU), in some examples, includes a hardware scheduler and hardware arbiter that launch graphics and compute work for simultaneous execution on a SIMD/SIMT processing unit. Each processing unit (e.g., a streaming multiprocessor) of the parallel processing unit operates in a graphics-greedy mode or a compute-greedy mode at respective times. The hardware arbiter, in response to a result of a comparison of at least one monitored performance or utilization metric to a user-configured threshold, can selectively cause the processing unit to run one or more compute work items from a compute queue when the processing unit is operating in the graphics-greedy mode, and cause the processing unit to run one or more graphics work items from a graphics queue when the processing unit is operating in the compute-greedy mode. Associated methods and systems are also described.Type: GrantFiled: August 2, 2018Date of Patent: June 21, 2022Assignee: NVIDIA CORPORATIONInventors: Rajballav Dash, Gregory Palmer, Gentaro Hirota, Lacky Shah, Jack Choquette, Emmett Kilgariff, Sriharsha Niverty, Milton Lei, Shirish Gadre, Omkar Paranjape, Lei Yang, Rouslan Dimitrov
-
Patent number: 11307903Abstract: Embodiments of the present invention set forth techniques for allocating execution resources to groups of threads within a graphics processing unit. A compute work distributor included in the graphics processing unit receives an indication from a process that a first group of threads is to be launched. The compute work distributor determines that a first subcontext associated with the process has at least one processor credit. In some embodiments, CTAs may be launched even when there are no processor credits, if one of the TPCs that was already acquired has sufficient space. The compute work distributor identifies a first processor included in a plurality of processors that has a processing load that is less than or equal to the processor loads associated with all other processors included in the plurality of processors. The compute work distributor launches the first group of threads to execute on the first processor.Type: GrantFiled: January 31, 2018Date of Patent: April 19, 2022Assignee: NVIDIA CorporationInventors: Jerome F. Duluk, Jr., Luke Durant, Ramon Matas Navarro, Alan Menezes, Jeffrey Tuckey, Gentaro Hirota, Brian Pharris
-
Patent number: 11182207Abstract: Techniques are disclosed for reducing the latency between the completion of a producer task and the launch of a consumer task dependent on the producer task. Such latency exists when the information needed to launch the consumer task is unavailable when the producer task completes. Thus, various techniques are disclosed, where a task management unit initiates the retrieval of the information needed to launch the consumer task from memory in parallel with the producer task being launched. Because the retrieval of such information is initiated in parallel with the launch of the producer task, the information is often available when the producer task completes, thus allowing for the consumer task to be launched without delay. The disclosed techniques, therefore, enable the latency between completing the producer task and launching the consumer task to be reduced.Type: GrantFiled: June 24, 2019Date of Patent: November 23, 2021Assignee: NVIDIA CORPORATIONInventors: Gentaro Hirota, Brian Pharris, Jeff Tuckey, Robert Overman, Stephen Jones
-
Publication number: 20200401444Abstract: Techniques are disclosed for reducing the latency between the completion of a producer task and the launch of a consumer task dependent on the producer task. Such latency exists when the information needed to launch the consumer task is unavailable when the producer task completes. Thus, various techniques are disclosed, where a task management unit initiates the retrieval of the information needed to launch the consumer task from memory in parallel with the producer task being launched. Because the retrieval of such information is initiated in parallel with the launch of the producer task, the information is often available when the producer task completes, thus allowing for the consumer task to be launched without delay. The disclosed techniques, therefore, enable the latency between completing the producer task and launching the consumer task to be reduced.Type: ApplicationFiled: June 24, 2019Publication date: December 24, 2020Inventors: Gentaro HIROTA, Brian PHARRIS, Jeff TUCKEY, Robert OVERMAN, Stephen JONES
-
Patent number: 10817338Abstract: Embodiments of the present invention set forth techniques for allocating execution resources to groups of threads within a graphics processing unit. A compute work distributor included in the graphics processing unit receives an indication from a process that a first group of threads is to be launched. The compute work distributor determines that a first subcontext associated with the process has at least one processor credit. In some embodiments, CTAs may be launched even when there are no processor credits, if one of the TPCs that was already acquired has sufficient space. The compute work distributor identifies a first processor included in a plurality of processors that has a processing load that is less than or equal to the processor loads associated with all other processors included in the plurality of processors. The compute work distributor launches the first group of threads to execute on the first processor.Type: GrantFiled: January 31, 2018Date of Patent: October 27, 2020Assignee: NVIDIA CorporationInventors: Jerome F. Duluk, Jr., Luke Durant, Ramon Matas Navarro, Alan Menezes, Jeffrey Tuckey, Gentaro Hirota, Brian Pharris
-
Publication number: 20200043123Abstract: A parallel processing unit (e.g., a GPU), in some examples, includes a hardware scheduler and hardware arbiter that launch graphics and compute work for simultaneous execution on a SIMD/SIMT processing unit. Each processing unit (e.g., a streaming multiprocessor) of the parallel processing unit operates in a graphics-greedy mode or a compute-greedy mode at respective times. The hardware arbiter, in response to a result of a comparison of at least one monitored performance or utilization metric to a user-configured threshold, can selectively cause the processing unit to run one or more compute work items from a compute queue when the processing unit is operating in the graphics-greedy mode, and cause the processing unit to run one or more graphics work items from a graphics queue when the processing unit is operating in the compute-greedy mode. Associated methods and systems are also described.Type: ApplicationFiled: August 2, 2018Publication date: February 6, 2020Inventors: Rajballav DASH, Gregory PALMER, Gentaro HIROTA, Lacky SHAH, Jack CHOQUETTE, Emmett KILGARIFF, Sriharsha NIVERTY, Milton LEI, Shirish GADRE, Omkar PARANJAPE, Lei YANG, Rouslan DIMITROV
-
Publication number: 20190235924Abstract: Embodiments of the present invention set forth techniques for allocating execution resources to groups of threads within a graphics processing unit. A compute work distributor included in the graphics processing unit receives an indication from a process that a first group of threads is to be launched. The compute work distributor determines that a first subcontext associated with the process has at least one processor credit. In some embodiments, CTAs may be launched even when there are no processor credits, if one of the TPCs that was already acquired has sufficient space. The compute work distributor identifies a first processor included in a plurality of processors that has a processing load that is less than or equal to the processor loads associated with all other processors included in the plurality of processors. The compute work distributor launches the first group of threads to execute on the first processor.Type: ApplicationFiled: January 31, 2018Publication date: August 1, 2019Inventors: Jerome F. Duluk, Jr., Luke Durant, Ramon Matas Navarro, Alan Menezes, Jeffrey Tuckey, Gentaro Hirota, Brian Pharris
-
Publication number: 20190235928Abstract: Embodiments of the present invention set forth techniques for allocating execution resources to groups of threads within a graphics processing unit. A compute work distributor included in the graphics processing unit receives an indication from a process that a first group of threads is to be launched. The compute work distributor determines that a first subcontext associated with the process has at least one processor credit. In some embodiments, CTAs may be launched even when there are no processor credits, if one of the TPCs that was already acquired has sufficient space. The compute work distributor identifies a first processor included in a plurality of processors that has a processing load that is less than or equal to the processor loads associated with all other processors included in the plurality of processors. The compute work distributor launches the first group of threads to execute on the first processor.Type: ApplicationFiled: January 31, 2018Publication date: August 1, 2019Inventors: Jerome F. Duluk, JR., Luke Durant, Ramon Matas Navarro, Alan Menezes, Jeffrey Tuckey, Gentaro Hirota, Brian Pharris
-
Patent number: 10217183Abstract: A system, method, and computer program product are provided for allocating processor resources to process compute workloads and graphics workloads substantially simultaneously. The method includes the steps of allocating a plurality of processing units to process tasks associated with a graphics pipeline, receiving a request to allocate at least one processing unit in the plurality of processing units to process tasks associated with a compute pipeline, and reallocating the at least one processing unit to process tasks associated with the compute pipeline.Type: GrantFiled: December 20, 2013Date of Patent: February 26, 2019Assignee: NVIDIA CORPORATIONInventors: Gregory S. Palmer, Jerome F. Duluk, Jr., Karim Maher Abdalla, Jonathon S. Evans, Adam Clark Weitkemper, Lacky Vasant Shah, Philip Browning Johnson, Gentaro Hirota
-
Patent number: 10114755Abstract: A system, method, and computer program product for warming a cache for a task launch is described. The method includes the steps of receiving a task data structure that defines a processing task, extracting information stored in a cache warming field of the task data structure, and, prior to executing the processing task, generating a cache warming instruction that is configured to load one or more entries of a cache storage with data fetched from a memory.Type: GrantFiled: June 14, 2013Date of Patent: October 30, 2018Assignee: NVIDIA CORPORATIONInventors: Scott Ricketts, Nicholas Wang, Shirish Gadre, Gentaro Hirota, Robert Ohannessian, Jr.
-
Patent number: 9715413Abstract: One embodiment of the present invention sets forth a technique for selecting a first processor included in a plurality of processors to receive work related to a compute task. The technique involves analyzing state data of each processor in the plurality of processors to identify one or more processors that have already been assigned one compute task and are eligible to receive work related to the one compute task, receiving, from each of the one or more processors identified as eligible, an availability value that indicates the capacity of the processor to receive new work, selecting a first processor to receive work related to the one compute task based on the availability values received from the one or more processors, and issuing, to the first processor via a cooperative thread array (CTA), the work related to the one compute task.Type: GrantFiled: January 18, 2012Date of Patent: July 25, 2017Assignee: NVIDIA CorporationInventors: Karim M. Abdalla, Lacky V. Shah, Jerome F. Duluk, Jr., Timothy John Purcell, Tanmoy Mandal, Gentaro Hirota