Patents by Inventor Norbert Juffa

Norbert Juffa has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Optimized 3D lighting computations using a logarithmic number system

Patent number: 9304739

Abstract: Embodiments of the present invention set forth a technique for optimizing the performance and efficiency of complex, software-based computations, such as lighting computations. Data entering a graphics application programming interface (API) in a conventional arithmetic representation, such as floating-point or fixed-point, is converted to an internal logarithmic representation for greater computational efficiency. Lighting computations are then performed using logarithmic space arithmetic routines that, on average, execute more efficiently than similar routines performed in a native floating-point format. The lighting computation results, represented as logarithmic space numbers, are converted back to floating-point numbers before being transmitted to a graphics processing unit (GPU) for further processing. Because of efficiencies of logarithmic space arithmetic, performance improvements may be realized relative to prior art approaches to performing software-based floating-point operations.

Type: Grant

Filed: December 11, 2006

Date of Patent: April 5, 2016

Assignee: NVIDIA Corporation

Inventor: Norbert Juffa
Graphics processor with memory management unit and cache coherent link

Patent number: 8860741

Abstract: In contrast to a conventional computing system in which the graphics processor (graphics processing unit or GPU) is treated as a slave to one or several CPUs, systems and methods are provided that allow the GPU to be treated as a central processing unit (CPU) from the perspective of the operating system. The GPU can access a memory space shared by other CPUs in the computing system. Caches utilized by the GPU may be coherent with caches utilized by other CPUs in the computing system. The GPU may share execution of general-purpose computations with other CPUs in the computing system.

Type: Grant

Filed: December 8, 2006

Date of Patent: October 14, 2014

Assignee: NVIDIA Corporation

Inventors: Norbert Juffa, Stuart F. Oberman
Efficient matrix multiplication on a parallel processing device

Patent number: 8589468

Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

Type: Grant

Filed: September 3, 2010

Date of Patent: November 19, 2013

Assignee: NVIDIA Corporation

Inventors: Norbert Juffa, Radoslav Danilak
Maximized memory throughput on parallel processing devices

Patent number: 8327123

Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.

Type: Grant

Filed: March 23, 2011

Date of Patent: December 4, 2012

Assignee: NVIDIA Corporation

Inventors: Norbert Juffa, Brett W. Coon
Pipelined integer division using floating-point reciprocal

Patent number: 8140608

Abstract: One embodiment of the present invention sets forth a technique for performing fast integer division using commonly available arithmetic operations. The technique may be implemented in a two-stage process using a single-precision floating point reciprocal in conjunction with integer addition and multiplication. Furthermore, the technique may be fully pipelined on many conventional processors for performance that is comparable to the best available high-performance alternatives.

Type: Grant

Filed: May 31, 2007

Date of Patent: March 20, 2012

Assignee: NVIDIA Corporation

Inventor: Norbert Juffa
MAXIMIZED MEMORY THROUGHPUT ON PARALLEL PROCESSING DEVICES

Publication number: 20110173414

Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.

Type: Application

Filed: March 23, 2011

Publication date: July 14, 2011

Applicant: NVIDIA Corporation

Inventors: Norbert Juffa, Brett W. Coon
Maximized memory throughput using cooperative thread arrays

Patent number: 7925860

Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.

Type: Grant

Filed: May 14, 2007

Date of Patent: April 12, 2011

Assignee: NVIDIA Corporation

Inventors: Norbert Juffa, Brett W. Coon
Graphics processing unit used for cryptographic processing

Patent number: 7916864

Abstract: A graphics processing unit is programmed to carry out cryptographic processing so that fast, effective cryptographic processing solutions can be provided without incurring additional hardware costs. The graphics processing unit can efficiently carry out cryptographic processing because it has an architecture that is configured to handle a large number of parallel processes. The cryptographic processing carried out on the graphics processing unit can be further improved by configuring the graphics processing unit to be capable of both floating point and integer operations.

Type: Grant

Filed: February 8, 2006

Date of Patent: March 29, 2011

Assignee: NVIDIA Corporation

Inventor: Norbert Juffa
Mapping the threads of a CTA to the elements of a tile for efficient matrix multiplication

Patent number: 7912889

Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

Type: Grant

Filed: June 16, 2006

Date of Patent: March 22, 2011

Assignee: NVIDIA Corporation

Inventors: Norbert Juffa, Radoslav Danilak
EFFICIENT MATRIX MULTIPLICATION ON A PARALLEL PROCESSING DEVICE

Publication number: 20100325187

Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

Type: Application

Filed: September 3, 2010

Publication date: December 23, 2010

Inventors: Norbert Juffa, Radoslav Danilak
Hardware/software-based mapping of CTAs to matrix tiles for efficient matrix multiplication

Patent number: 7836118

Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

Type: Grant

Filed: June 16, 2006

Date of Patent: November 16, 2010

Assignee: NVIDIA Corporation

Inventors: Norbert Juffa, Radoslav Danilak
Efficient matrix multiplication on a parallel processing device

Patent number: 7792895

Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

Type: Grant

Filed: June 16, 2006

Date of Patent: September 7, 2010

Assignee: NVIDIA Corporation

Inventors: Norbert Juffa, Radoslav Danilak
Hardware resource based mapping of cooperative thread arrays (CTA) to result matrix tiles for efficient matrix multiplication in computing system comprising plurality of multiprocessors

Patent number: 7506134

Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.

Type: Grant

Filed: June 16, 2006

Date of Patent: March 17, 2009

Assignee: NVIDIA Corporation

Inventors: Norbert Juffa, Radoslav Danilak
Matrix multiply with reduced bandwidth requirements

Publication number: 20070271325

Abstract: Systems and methods for reducing the bandwidth needed to read the inputs to a matrix multiply operation may improve system performance. Rather than reading a row of a first input matrix and a column of a second input matrix to produce a column of a product matrix, a column of the first input matrix and a single element of the second input matrix are read to produce a column of partial dot products of the product matrix. Therefore, the number of input matrix elements read to produce each product matrix element is reduced from 2N to N+1, where N is the number of elements in a column of the product matrix.

Type: Application

Filed: May 8, 2006

Publication date: November 22, 2007

Inventors: Norbert Juffa, John Nickolls
Graphics processing unit used for cryptographic processing

Publication number: 20070198412

Abstract: A graphics processing unit is programmed to carry out cryptographic processing so that fast, effective cryptographic processing solutions can be provided without incurring additional hardware costs. The graphics processing unit can efficiently carry out cryptographic processing because it has an architecture that is configured to handle a large number of parallel processes. The cryptographic processing carried out on the graphics processing unit can be further improved by configuring the graphics processing unit to be capable of both floating point and integer operations.

Type: Application

Filed: February 8, 2006

Publication date: August 23, 2007

Inventor: Norbert Juffa
Microprocessor including an efficient implementation of extreme value instructions

Patent number: 6557098

Abstract: An execution unit is provided for executing a first instruction which includes an opcode field, a first operand field, and a second operand field. The execution unit includes a first input register for receiving a first operand specified by a value of the first operand field, and a second input register for receiving a second operand specified by a value of the second operand field. The execution unit further includes a comparator unit which is coupled to receive a value of the opcode field for the first instruction. The comparator unit is also coupled to receive the first and second operand values from the first and second input registers, respectively. The execution further includes a multiplexer which receives a plurality of inputs. These inputs include a first constant value, a second constant value, and the values of the first and second operand.

Type: Grant

Filed: January 5, 2000

Date of Patent: April 29, 2003

Assignee: Advanced Micro Devices, Inc.

Inventors: Stuart Oberman, Norbert Juffa
Apparatus and method for superforwarding load operands in a microprocessor

Patent number: 6442677

Abstract: An apparatus and method for superforwarding load operands in a microprocessor are provided. An execution unit in a microprocessor is configured to receive a load instruction and a subsequent instruction. If the load instruction corresponds to a simple load instruction, a destination operand of the load instruction can be superforwarded to a subsequent instruction if the subsequent instruction specifies a source operand that depends on the destination operand of the load instruction. The subsequent instruction is not required to wait until a load instruction executes or completes and can be scheduled and/or executed prior to or at the same time as the load instruction. Consequently, latencies associated with operand dependencies may be reduced.

Type: Grant

Filed: June 10, 1999

Date of Patent: August 27, 2002

Assignee: Advanced Micro Devices, Inc.

Inventors: Derrick R. Meyer, Stephan G. Meier, Norbert Juffa
Method and apparatus for rapid execution of FCOM and FSTSW

Patent number: 6425074

Abstract: A microprocessor configured to rapidly execute floating point store status word (FSTSW) type instructions that are immediately preceded by floating point compare (FCOM) type instructions is disclosed. FCOM-type instructions are modified to store their results to an architectural floating point status word and a temporary destination register. If an FSTSW-type instruction is detected immediately following an FCOM-type instruction, then the FSTSW-type instruction is transformed into a special fast floating point store status word (FSTSWEF) instruction. Unlike the FSTSW-type instruction, which is serializing and negatively impacts performance, the FSTSWEF instruction is not serializing and allows execution to continue without undue serialization. A computer system and method for rapidly executing FSTSW instructions immediately preceded by FCOM-type instructions are also disclosed.

Type: Grant

Filed: September 10, 1999

Date of Patent: July 23, 2002

Assignee: Advanced Micro Devices, Inc.

Inventors: Stephan G. Meier, Norbert Juffa, Frederick D. Weber, Stuart F. Oberman
Apparatus and method for executing floating-point store instructions in a microprocessor

Patent number: 6408379

Abstract: An apparatus and method for executing floating-point store instructions in a microprocessor is provided. If store data of a floating-point store instruction corresponds to a tiny number and an underflow exception is masked, then a trap routine can be executed to generate corrected store data and complete the store operation. In response to detecting that store data corresponds to a tiny number and the underflow exception is masked, the store data, store address information, and opcode information can be stored prior to initiating the trap routine. The trap routine can be configured to access the store data, store address information, and opcode information. The trap routine can be configured to generate corrected store data and complete the store operation using the store data, store address information, and opcode information.

Type: Grant

Filed: June 10, 1999

Date of Patent: June 18, 2002

Assignee: Advanced Micro Devices, Inc.

Inventors: Norbert Juffa, Stephan Meier, Stuart Oberman, Scott White
Rapid execution of floating point load control word instructions

Patent number: 6405305

Abstract: A microprocessor with a floating point unit configured to rapidly execute floating point load control word (FLDCW) type instructions in an out of program order context is disclosed. The floating point unit is configured to schedule instructions older than the FLDCW-type instruction before the FLDCW-type instruction is scheduled. The FLDCW-type instruction acts as a barrier to prevent instructions occurring after the FLDCW-type instruction in program order from executing before the FLDCW-type instruction. Indicator bits may be used to simplify instruction scheduling, and copies of the floating point control word may be stored for instruction that have long execution cycles. A method and computer configured to rapidly execute FLDCW-type instructions in an out of program order context are also disclosed.

Type: Grant

Filed: September 10, 1999

Date of Patent: June 11, 2002

Assignee: Advanced Micro Devices, Inc.

Inventors: Stephan G. Meier, Jeffrey E. Trull, Derrick R. Meyer, Norbert Juffa

1 2 3 next