Patents by Inventor Ronny KRASHINSKY
Ronny KRASHINSKY has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20240118899Abstract: Apparatuses, systems, and techniques to adapt instructions in a SIMT architecture for execution on serial execution units. In at least one embodiment, a set of one or more threads is selected from a group of active threads associated with an instruction and the instruction is executed for the set of one or more threads on a serial execution unit.Type: ApplicationFiled: February 3, 2023Publication date: April 11, 2024Inventors: Aditya Avinash Atluri, Jack Choquette, Carter Edwards, Olivier Giroux, Praveen Kumar Kaushik, Ronny Krashinsky, Rishkul Kulkarni, Konstantinos Kyriakopoulos
-
Patent number: 11803380Abstract: To synchronize operations of a computing system, a new type of synchronization barrier is disclosed. In one embodiment, the disclosed synchronization barrier provides for certain synchronization mechanisms such as, for example, “Arrive” and “Wait” to be split to allow for greater flexibility and efficiency in coordinating synchronization. In another embodiment, the disclosed synchronization barrier allows for hardware components such as, for example, dedicated copy or direct-memory-access (DMA) engines to be synchronized with software-based threads.Type: GrantFiled: December 12, 2019Date of Patent: October 31, 2023Assignee: NVIDIA CorporationInventors: Olivier Giroux, Jack Choquette, Ronny Krashinsky, Steve Heinrich, Xiaogang Qiu, Shirish Gadre
-
Publication number: 20230315655Abstract: A new synchronization system synchronizes data exchanges between producer processes and consumer processes which may be on the same or different processors in a multiprocessor system. The synchronization incurs less than one roundtrip of latency - in some implementations, in approximately 0.5 roundtrip times. A key aspect of the fast synchronization is that the producer’s data store is followed without delay with the updating of a barrier on which the consumer is waiting.Type: ApplicationFiled: March 10, 2022Publication date: October 5, 2023Inventors: Jack CHOQUETTE, Ronny KRASHINSKY, Timothy GUO, Carter EDWARDS, Steve HEINRICH, John EDMONDSON, Prakash Bangalore PRABHAKAR, Apoorv PARLE, JR., Manan PATEL, Olivier GIROUX, Michael PELLAUER
-
Publication number: 20230289189Abstract: Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Prakash BANGALORE PRABHAKAR, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Greg PALMER, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
-
Publication number: 20230289398Abstract: This specification describes techniques for implementing matrix multiply and add (MMA) operations in graphics processing units (GPU)s and other processors. The implementations provide for a plurality of warps of threads to collaborate in generating the result matrix by enabling each thread to share its respective register files to be accessed by the datapaths associated with other threads in the group of warps. A state machine circuit controls a MMA execution among the warps executing on asynchronous computation units. A group MMA (GMMA) instruction provides for a descriptor to be provided as parameter where the descriptor may include information regarding size and formats of input data to be loaded into shared memory and/or the datapath.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Jack CHOQUETTE, Manan PATEL, Matt TYRLIK, Ronny KRASHINSKY
-
Publication number: 20230289212Abstract: Processing hardware of a processor is virtualized to provide a façade between a consistent programming interface and specific hardware instances. Hardware processor components can be permanently or temporarily disabled when not needed to support the consistent programming interface and/or to balance hardware processing across a hardware arrangement such as an integrated circuit. Executing software can be migrated from one hardware arrangement to another without need to reset the hardware.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Jerome F. DULUK, JR., Gentaro HIROTA, Ronny KRASHINSKY, Greg PALMER, Jeff TUCKEY, Kaushik NADADHUR, Philip Browning JOHNSON, Praveen JOGINIPALLY
-
Publication number: 20230289190Abstract: This specification describes a programmatic multicast technique enabling one thread (for example, in a cooperative group array (CGA) on a GPU) to request data on behalf of one or more other threads (for example, executing on respective processor cores of the GPU). The multicast is supported by tracking circuitry that interfaces between multicast requests received from processor cores and the available memory. The multicast is designed to reduce cache (for example, layer 2 cache) bandwidth utilization enabling strong scaling and smaller tile sizes.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Apoorv PARLE, Ronny KRASHINSKY, John EDMONDSON, Jack CHOQUETTE, Shirish GADRE, Steve HEINRICH, Manan PATEL, Prakash Bangalore PRABHAKAR, JR., Ravi MANYAM, Wish GANDHI, Lacky SHAH, Alexander L. Minkin
-
Publication number: 20230289304Abstract: A parallel processing unit comprises a plurality of processors each being coupled to a memory access hardware circuitry. Each memory access hardware circuitry is configured to receive, from the coupled processor, a memory access request specifying a coordinate of a multidimensional data structure, wherein the memory access hardware circuit is one of a plurality of memory access circuitry each coupled to a respective one of the processors; and, in response to the memory access request, translate the coordinate of the multidimensional data structure into plural memory addresses for the multidimensional data structure and using the plural memory addresses, asynchronously transfer at least a portion of the multidimensional data structure for processing by at least the coupled processor. The memory locations may be in the shared memory of the coupled processor and/or an external memory.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Alexander L. Minkin, Alan Kaatz, Oliver Giroux, Jack Choquette, Shirish Gadre, Manan Patel, John Tran, Ronny Krashinsky, Jeff Schottmiller
-
Publication number: 20230289215Abstract: A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and relative to each other. Hardware support for such CGAs guarantees concurrency and enables applications to see more data locality, reduced latency, and better synchronization between all the threads in tightly cooperating collections of CTAs programmably distributed across different (e.g., hierarchical) hardware domains or partitions.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Greg PALMER, Gentaro HIROTA, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS, Rajballav DASH, Jeff TUCKEY, Jerome F. DULUK, JR., Lacky SHAH, Luke DURANT, Jack CHOQUETTE, Eric WERNESS, Naman GOVIL, Manan PATEL, Shayani DEB, Sandeep NAVADA, John EDMONDSON, Prakash BANGALORE PRABHAKAR, Wish GANDHI, Ravi MANYAM, Apoorv PARLE, Olivier GIROUX, Shirish GADRE, Steve HEINRICH
-
Publication number: 20230289211Abstract: A processor supports new thread group hierarchies by centralizing work distribution to provide hardware-guaranteed concurrent execution of thread groups in a thread group array through speculative launch and load balancing across processing cores. Efficiencies are realized by distributing grid rasterization among the processing cores.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Gentaro HIROTA, Tanmoy MANDAL, Jeff TUCKEY, Kevin STEPHANO, Chen MEI, Shayani DEB, Naman GOVIL, Rajballav DASH, Ronny KRASHINSKY, Ze LONG, Brian PHARRIS
-
Publication number: 20230289242Abstract: A new transaction barrier synchronization primitive enables executing threads and asynchronous transactions to synchronize across parallel processors. The asynchronous transactions may include transactions resulting from, for example, hardware data movement units such as direct memory units, etc. A hardware synchronization circuit may provide for the synchronization primitive to be stored in a cache memory so that barrier operations may be accelerated by the circuit. A new wait mechanism reduces software overhead associated with waiting on a barrier.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Timothy GUO, Jack CHOQUETTE, Shirish GADRE, Olivier GIROUX, Carter EDWARDS, John EDMONDSON, Manan PATEL, Raghavan MADHAVAN, JR., Jessie HUANG, Peter NELSON, Ronny KRASHINSKY
-
Publication number: 20230289292Abstract: A parallel processing unit comprises a plurality of processors each being coupled to a memory access hardware circuitry. Each memory access hardware circuitry is configured to receive, from the coupled processor, a memory access request specifying a coordinate of a multidimensional data structure, wherein the memory access hardware circuit is one of a plurality of memory access circuitry each coupled to a respective one of the processors; and, in response to the memory access request, translate the coordinate of the multidimensional data structure into plural memory addresses for the multidimensional data structure and using the plural memory addresses, asynchronously transfer at least a portion of the multidimensional data structure for processing by at least the coupled processor. The memory locations may be in the shared memory of the coupled processor and/or an external memory.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Alexander L. Minkin, Alan Kaatz, Olivier Giroux, Jack Choquette, Shirish Gadre, Manan Patel, John Tran, Ronny Krashinsky, Jeff Schottmiller
-
Publication number: 20230288471Abstract: Processing hardware of a processor is virtualized to provide a façade between a consistent programming interface and specific hardware instances. Hardware processor components can be permanently or temporarily disabled when not needed to support the consistent programming interface and/or to balance hardware processing across a hardware arrangement such as an integrated circuit. Executing software can be migrated from one hardware arrangement to another without need to reset the hardware.Type: ApplicationFiled: March 10, 2022Publication date: September 14, 2023Inventors: Jerome F. DULUK, Gentaro HIROTA, Ronny KRASHINSKY, Greg PALMER, Jeff TUCKEY, Kaushik NADADHUR, Philip Browning JOHNSON, Praveen JOGINIPALLY
-
Publication number: 20230110438Abstract: Apparatuses, systems, and techniques are presented to perform one or more operations. In at least one embodiment, one or more data values, to be used by one or more neural networks, are caused to be replaced by one or more invalid data values.Type: ApplicationFiled: October 8, 2021Publication date: April 13, 2023Inventors: Pradeep Ramani, Alex Minkin, Alan Kaatz, Yang Xu, Ronny Krashinsky
-
Publication number: 20220365882Abstract: Apparatuses, systems, and techniques to control operation of a memory cache. In at least one embodiment, cache guidance is specified within application source code by associating guidance with declaration of a memory block, and then applying specified guidance to source code statements that access said memory block.Type: ApplicationFiled: August 5, 2021Publication date: November 17, 2022Inventors: Harold Carter Edwards, Luke David Durant, Stephen Jones, Jack H. Choquette, Ronny Krashinsky, Dmitri Vainbrand, Olivier Giroux, Olivier Francois Joseph Harel, Shirish Gadre, Ze Long, Matthieu Tardy, David Dastous St Hilaire, Gokul Ramaswamy Hirisave Chandra Shekhara, Jaydeep Marathe, Jaewook Shin, Jayashree Venkatesh, Girish Bhaskar Bharambe
-
Patent number: 11392829Abstract: Approaches in accordance with various embodiments provide for the processing of sparse matrices for mathematical and programmatic operations. In particular, various embodiments enforce sparsity constraints for performing sparse matrix multiply-add instruction (MMA) operations. Deep neural networks can exhibit significant sparsity in the data used in operations, both in the activations and weights. The computational load can be reduced by excluding zero-valued data elements. A sparsity constraint is applied across all submatrices of a sparse matrix, providing fine-grained structured sparsity that is evenly distributed across the matrix. The matrix may then be compressed since a minimum number of elements of the matrix are known to have zero value. Matrix operations are then performed using these matrices.Type: GrantFiled: April 2, 2019Date of Patent: July 19, 2022Assignee: NVIDIA CorporationInventors: Jeff Pool, Ganesh Venkatesh, Jorge Albericio Latorre, Jack Choquette, Ronny Krashinsky, John Tran, Feng Xie, Ming Y. Siu, Manan Patel
-
Patent number: 11347668Abstract: A unified cache subsystem includes a data memory configured as both a shared memory and a local cache memory. The unified cache subsystem processes different types of memory transactions using different data pathways. To process memory transactions that target shared memory, the unified cache subsystem includes a direct pathway to the data memory. To process memory transactions that do not target shared memory, the unified cache subsystem includes a tag processing pipeline configured to identify cache hits and cache misses. When the tag processing pipeline identifies a cache hit for a given memory transaction, the transaction is rerouted to the direct pathway to data memory. When the tag processing pipeline identifies a cache miss for a given memory transaction, the transaction is pushed into a first-in first-out (FIFO) until miss data is returned from external memory. The tag processing pipeline is also configured to process texture-oriented memory transactions.Type: GrantFiled: July 6, 2020Date of Patent: May 31, 2022Assignee: NVIDIA CorporationInventors: Xiaogang Qiu, Ronny Krashinsky, Steven Heinrich, Shirish Gadre, John Edmondson, Jack Choquette, Mark Gebhart, Ramesh Jandhyala, Poornachandra Rao, Omkar Paranjape, Michael Siu
-
Publication number: 20210124627Abstract: To synchronize operations of a computing system, a new type of synchronization barrier is disclosed. In one embodiment, the disclosed synchronization barrier provides for certain synchronization mechanisms such as, for example, “Arrive” and “Wait” to be split to allow for greater flexibility and efficiency in coordinating synchronization. In another embodiment, the disclosed synchronization barrier allows for hardware components such as, for example, dedicated copy or direct-memory-access (DMA) engines to be synchronized with software-based threads.Type: ApplicationFiled: December 12, 2019Publication date: April 29, 2021Inventors: Olivier GIROUX, Jack CHOQUETTE, Ronny KRASHINSKY, Steve HEINRICH, Xiaogang QIU, Shirish GADRE
-
Publication number: 20200401541Abstract: A unified cache subsystem includes a data memory configured as both a shared memory and a local cache memory. The unified cache subsystem processes different types of memory transactions using different data pathways. To process memory transactions that target shared memory, the unified cache subsystem includes a direct pathway to the data memory. To process memory transactions that do not target shared memory, the unified cache subsystem includes a tag processing pipeline configured to identify cache hits and cache misses. When the tag processing pipeline identifies a cache hit for a given memory transaction, the transaction is rerouted to the direct pathway to data memory. When the tag processing pipeline identifies a cache miss for a given memory transaction, the transaction is pushed into a first-in first-out (FIFO) until miss data is returned from external memory. The tag processing pipeline is also configured to process texture-oriented memory transactions.Type: ApplicationFiled: July 6, 2020Publication date: December 24, 2020Inventors: Xiaogang QIU, Ronny KRASHINSKY, Steven HEINRICH, Shirish GADRE, John EDMONDSON, Jack CHOQUETTE, Mark GEBHART, Ramesh JANDHYALA, Poornachandra RAO, Omkar PARANJAPE, Michael SIU
-
Patent number: 10705994Abstract: A unified cache subsystem includes a data memory configured as both a shared memory and a local cache memory. The unified cache subsystem processes different types of memory transactions using different data pathways. To process memory transactions that target shared memory, the unified cache subsystem includes a direct pathway to the data memory. To process memory transactions that do not target shared memory, the unified cache subsystem includes a tag processing pipeline configured to identify cache hits and cache misses. When the tag processing pipeline identifies a cache hit for a given memory transaction, the transaction is rerouted to the direct pathway to data memory. When the tag processing pipeline identifies a cache miss for a given memory transaction, the transaction is pushed into a first-in first-out (FIFO) until miss data is returned from external memory. The tag processing pipeline is also configured to process texture-oriented memory transactions.Type: GrantFiled: May 4, 2017Date of Patent: July 7, 2020Assignee: NVIDIA CorporationInventors: Xiaogang Qiu, Ronny Krashinsky, Steven Heinrich, Shirish Gadre, John Edmondson, Jack Choquette, Mark Gebhart, Ramesh Jandhyala, Poornachandra Rao, Omkar Paranjape, Michael Siu