Patents by Inventor Ming Y. Siu
Ming Y. Siu has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20210311733Abstract: A method, computer readable medium, and processor are disclosed for performing matrix multiply and accumulate (MMA) operations. The processor includes a datapath configured to execute the MMA operation to generate a plurality of elements of a result matrix at an output of the datapath. Each element of the result matrix is generated by calculating at least one dot product of corresponding pairs of vectors associated with matrix operands specified in an instruction for the MMA operation. A dot product operation includes the steps of: generating a plurality of partial products by multiplying each element of a first vector with a corresponding element of a second vector; aligning the plurality of partial products based on the exponents associated with each element of the first vector and each element of the second vector; and accumulating the plurality of aligned partial products into a result queue utilizing at least one adder.Type: ApplicationFiled: June 17, 2021Publication date: October 7, 2021Inventors: Brent Ralph Boswell, Ming Y. Siu, Jack H. Choquette, Jonah M. Alben, Stuart Oberman
-
Publication number: 20210303302Abstract: A method, computer readable medium, and processor are disclosed for performing matrix multiply and accumulate (MMA) operations. The processor includes a datapath configured to execute the MMA operation to generate a plurality of elements of a result matrix at an output of the datapath. Each element of the result matrix is generated by calculating at least one dot product of corresponding pairs of vectors associated with matrix operands specified in an instruction for the MMA operation. A dot product operation includes the steps of: generating a plurality of partial products by multiplying each element of a first vector with a corresponding element of a second vector; aligning the plurality of partial products based on the exponents associated with each element of the first vector and each element of the second vector; and accumulating the plurality of aligned partial products into a result queue utilizing at least one adder.Type: ApplicationFiled: January 4, 2021Publication date: September 30, 2021Inventors: Brent Ralph Boswell, Ming Y. Siu, Jack H. Choquette, Jonah M. Alben, Stuart Oberman
-
Patent number: 10884734Abstract: A method, computer readable medium, and processor are disclosed for performing matrix multiply and accumulate (MMA) operations. The processor includes a datapath configured to execute the MMA operation to generate a plurality of elements of a result matrix at an output of the datapath. Each element of the result matrix is generated by calculating at least one dot product of corresponding pairs of vectors associated with matrix operands specified in an instruction for the MMA operation. A dot product operation includes the steps of: generating a plurality of partial products by multiplying each element of a first vector with a corresponding element of a second vector; aligning the plurality of partial products based on the exponents associated with each element of the first vector and each element of the second vector; and accumulating the plurality of aligned partial products into a result queue utilizing at least one adder.Type: GrantFiled: July 1, 2019Date of Patent: January 5, 2021Assignee: NVIDIA CorporationInventors: Brent Ralph Boswell, Ming Y. Siu, Jack H. Choquette, Jonah M. Alben, Stuart Oberman
-
Publication number: 20200373941Abstract: In artificial neural networks, and other similar applications, there is typically a large amount of data involved that is considered sparse data. Due to the large size of the data involved in such applications, it is helpful to compress the data to save bandwidth resources when transmitting the data and save memory resources when storing the data. Introduced herein is a compression technique that selects elements with significant values from data and restructures them into a structured sparse format. By generating metadata that enforces the structured sparse format and organizing the data according to the metadata, the introduced technique not only reduces the size of the data but also consistently places the data in a particular format. As such, hardware can be simplified and optimized to process the data much faster and much more efficiently than the conventional compression techniques that rely on a non-structured sparsity format.Type: ApplicationFiled: May 30, 2019Publication date: November 26, 2020Inventors: Jorge Albericio Latorre, Ming Y. Siu
-
Publication number: 20200285618Abstract: Compressed data is oftentimes beneficial for reducing the computing resources required, for example, to transmit and store data. The compression of data is particularly useful when dealing with sparse data (data that includes numerous zeros or near-zero values) and only non-zero values above a certain threshold have significance. When dealing with compressed data, oftentimes the data needs to be decompressed for processing (e.g., by deep learning networks or other applications configured to operate on sparse, or other uncompressed data). Instructions are disclosed for supporting the decompression of compressed data by a processing unit such as a CPU and GPU.Type: ApplicationFiled: March 20, 2019Publication date: September 10, 2020Inventors: Jorge Albericio Latorre, Jack H. Choquette, Manan Maheshkumar Patel, Jeffrey Pool, Ming Y. Siu, Ronny Meir Krashinsky, Ganesh Venkatesh
-
Publication number: 20200125363Abstract: A method, computer readable medium, and processor are described herein for inline data inspection by using a decoder to decode a load instruction, including a signal to cause a circuit in a processor to indicate whether data loaded by a load instruction exceeds a threshold value. Moreover, an indication of whether data loaded by a load instruction exceeds a threshold value may be stored.Type: ApplicationFiled: December 9, 2019Publication date: April 23, 2020Inventors: Jeffrey Michael Pool, Andrew Kerr, John Tran, Ming Y. Siu, Stuart Oberman
-
Patent number: 10503507Abstract: A method, computer readable medium, and system are disclosed for inline data inspection. The method includes the steps of receiving, by a load/store unit, a load instruction and obtaining, by an inspection circuit that is coupled to the load/store unit, data specified by the load instruction. Additional steps include determining that the data equals zero and transmitting the data and a predicate signal to the load/store unit, wherein the predicate signal indicates that the data equals zero. Alternative additional steps include computing a predicate value based on a comparison between the data and a threshold value and transmitting the data and the predicate value to the load/store unit, wherein the predicate value is asserted when the data is less than the threshold value and is negated when the data is not less than the threshold value.Type: GrantFiled: August 31, 2017Date of Patent: December 10, 2019Assignee: NVIDIA CorporationInventors: Jeffrey Michael Pool, Andrew Kerr, John Tran, Ming Y. Siu, Stuart Oberman
-
Patent number: 10503513Abstract: A subsystem is configured to support a distributed instruction set architecture with primary and secondary execution pipelines. The primary execution pipeline supports the execution of a subset of instructions in the distributed instruction set architecture that are issued frequently. The secondary execution pipeline supports the execution of another subset of instructions in the distributed instruction set architecture that are issued less frequently. Both execution pipelines also support the execution of FFMA instructions as well as a common subset of instructions in the distributed instruction set architecture. When dispatching a requested instruction, an instruction scheduling unit is configured to select between the two execution pipelines based on various criteria. Those criteria may include power efficiency with which the instruction can be executed and availability of execution units to support execution of the instruction.Type: GrantFiled: October 23, 2013Date of Patent: December 10, 2019Assignee: NVIDIA CORPORATIONInventors: David Conrad Tannenbaum, Srinivasan (Vasu) Iyer, Stuart F. Oberman, Ming Y. Siu, Michael Alan Fetterman, John Matthew Burgess, Shirish Gadre
-
Publication number: 20190324747Abstract: A method, computer readable medium, and processor are disclosed for performing matrix multiply and accumulate (MMA) operations. The processor includes a datapath configured to execute the MMA operation to generate a plurality of elements of a result matrix at an output of the datapath. Each element of the result matrix is generated by calculating at least one dot product of corresponding pairs of vectors associated with matrix operands specified in an instruction for the MMA operation. A dot product operation includes the steps of: generating a plurality of partial products by multiplying each element of a first vector with a corresponding element of a second vector; aligning the plurality of partial products based on the exponents associated with each element of the first vector and each element of the second vector; and accumulating the plurality of aligned partial products into a result queue utilizing at least one adder.Type: ApplicationFiled: July 1, 2019Publication date: October 24, 2019Inventors: Brent Ralph Boswell, Ming Y. Siu, Jack H. Choquette, Jonah M. Alben, Stuart Oberman
-
Patent number: 10338919Abstract: A method, computer readable medium, and processor are disclosed for performing matrix multiply and accumulate (MMA) operations. The processor includes a datapath configured to execute the MMA operation to generate a plurality of elements of a result matrix at an output of the datapath. Each element of the result matrix is generated by calculating at least one dot product of corresponding pairs of vectors associated with matrix operands specified in an instruction for the MMA operation. A dot product operation includes the steps of: generating a plurality of partial products by multiplying each element of a first vector with a corresponding element of a second vector; aligning the plurality of partial products based on the exponents associated with each element of the first vector and each element of the second vector; and accumulating the plurality of aligned partial products into a result queue utilizing at least one adder.Type: GrantFiled: November 29, 2017Date of Patent: July 2, 2019Assignee: NVIDIA CorporationInventors: Brent Ralph Boswell, Ming Y. Siu, Jack H. Choquette, Jonah M. Alben, Stuart Oberman
-
Publication number: 20190065195Abstract: A method, computer readable medium, and system are disclosed for inline data inspection. The method includes the steps of receiving, by a load/store unit, a load instruction and obtaining, by an inspection circuit that is coupled to the load/store unit, data specified by the load instruction. Additional steps include determining that the data equals zero and transmitting the data and a predicate signal to the load/store unit, wherein the predicate signal indicates that the data equals zero. Alternative additional steps include computing a predicate value based on a comparison between the data and a threshold value and transmitting the data and the predicate value to the load/store unit, wherein the predicate value is asserted when the data is less than the threshold value and is negated when the data is not less than the threshold value.Type: ApplicationFiled: August 31, 2017Publication date: February 28, 2019Inventors: Jeffrey Michael Pool, Andrew Kerr, John Tran, Ming Y. Siu, Stuart Oberman
-
Patent number: 10217184Abstract: A processing unit includes multiple execution pipelines, each of which is coupled to a first input section for receiving input data for pixel processing and a second input section for receiving input data for vertex processing and to a first output section for storing processed pixel data and a second output section for storing processed vertex data. The processed vertex data is rasterized and scan converted into pixel data that is used as the input data for pixel processing. The processed pixel data is output to a raster analyzer.Type: GrantFiled: May 23, 2017Date of Patent: February 26, 2019Assignee: NVIDIA CORPORATIONInventors: John Erik Lindholm, Brett W. Coon, Stuart F. Oberman, Ming Y. Siu, Matthew P. Gerlach
-
Publication number: 20180321938Abstract: A method, computer readable medium, and processor are disclosed for performing matrix multiply and accumulate (MMA) operations. The processor includes a datapath configured to execute the MMA operation to generate a plurality of elements of a result matrix at an output of the datapath. Each element of the result matrix is generated by calculating at least one dot product of corresponding pairs of vectors associated with matrix operands specified in an instruction for the MMA operation. A dot product operation includes the steps of: generating a plurality of partial products by multiplying each element of a first vector with a corresponding element of a second vector; aligning the plurality of partial products based on the exponents associated with each element of the first vector and each element of the second vector; and accumulating the plurality of aligned partial products into a result queue utilizing at least one adder.Type: ApplicationFiled: November 29, 2017Publication date: November 8, 2018Inventors: Brent Ralph Boswell, Ming Y. Siu, Jack H. Choquette, Jonah M. Alben, Stuart Oberman
-
Patent number: 9830197Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.Type: GrantFiled: August 16, 2016Date of Patent: November 28, 2017Assignee: NVIDIA CorporationInventors: Brian Fahs, Ming Y Siu, Brett W. Coon, John R. Nickolls, Lars Nyland
-
Patent number: 9829956Abstract: An approach is provided for enabling power reduction in floating-point operations. In one example, a system receives floating-point numbers of a fused multiply-add instruction. The system determines the fused multiply-add instruction does not require compliance with a standard of precision for floating-point numbers. The system generates gating signals for an integrated circuit that is configured to perform operations of the fused multiply-add instruction. The system then sends the gating signals to the integrated circuit to turn off a plurality of logic gates included in the integrated circuit.Type: GrantFiled: November 21, 2012Date of Patent: November 28, 2017Assignee: NVIDIA CorporationInventors: David Conrad Tannenbaum, Colin Sprinkle, Stuart F. Oberman, Ming Y. Siu, Srinivasan Iyer, Ian-Chi Yan Kwong
-
Patent number: 9798543Abstract: One embodiment of the present invention sets forth a technique for allocating register file entries included in a register file to a thread group. A request to allocate a number of register file entries to the thread group is received. A required number of mapping table entries included in a register file mapping table (RFMT) is determined based on the request, where each mapping table entry included in the RFMT is associated with a different plurality of register file entries included in the register file. The RFMT is parsed to locate an available mapping table entry in the RFMT for each of the required mapping table entries. For each available mapping table entry, a register file pointer is associated with an address that corresponds to a first register file entry in the plurality of register file entries associated with the available mapping table entry.Type: GrantFiled: September 3, 2010Date of Patent: October 24, 2017Assignee: NVIDIA CorporationInventors: Michael Fiyak, Ming Y. Siu
-
Publication number: 20170256022Abstract: A processing unit includes multiple execution pipelines, each of which is coupled to a first input section for receiving input data for pixel processing and a second input section for receiving input data for vertex processing and to a first output section for storing processed pixel data and a second output section for storing processed vertex data. The processed vertex data is rasterized and scan converted into pixel data that is used as the input data for pixel processing. The processed pixel data is output to a raster analyzer.Type: ApplicationFiled: May 23, 2017Publication date: September 7, 2017Inventors: John Erik LINDHOLM, Brett W. COON, Stuart F. OBERMAN, Ming Y. SIU, Matthew P. GERLACH
-
Patent number: 9659339Abstract: A processing unit includes multiple execution pipelines, each of which is coupled to a first input section for receiving input data for pixel processing and a second input section for receiving input data for vertex processing and to a first output section for storing processed pixel data and a second output section for storing processed vertex data. The processed vertex data is rasterized and scan converted into pixel data that is used as the input data for pixel processing. The processed pixel data is output to a raster analyzer.Type: GrantFiled: March 25, 2013Date of Patent: May 23, 2017Assignee: NVIDIA CORPORATIONInventors: John Erik Lindholm, Brett W. Coon, Stuart F. Oberman, Ming Y. Siu, Matthew P. Gerlach
-
Publication number: 20160357560Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.Type: ApplicationFiled: August 16, 2016Publication date: December 8, 2016Inventors: Brian FAHS, Ming Y. SIU, Brett W. Coon, John R. NICKOLLS, Lars NYLAND
-
Publication number: 20160300319Abstract: A processing unit includes multiple execution pipelines, each of which is coupled to a first input section for receiving input data for pixel processing and a second input section for receiving input data for vertex processing and to a first output section for storing processed pixel data and a second output section for storing processed vertex data. The processed vertex data is rasterized and scan converted into pixel data that is used as the input data for pixel processing. The processed pixel data is output to a raster analyzer.Type: ApplicationFiled: March 25, 2013Publication date: October 13, 2016Applicant: NVIDIA CorporationInventors: John Erik LINDHOLM, Brett W. COON, Stuart F. OBERMAN, Ming Y. SIU, Matthew P. GERLACH