Patents by Inventor Norbert Juffa

Norbert Juffa has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 9304739
    Abstract: Embodiments of the present invention set forth a technique for optimizing the performance and efficiency of complex, software-based computations, such as lighting computations. Data entering a graphics application programming interface (API) in a conventional arithmetic representation, such as floating-point or fixed-point, is converted to an internal logarithmic representation for greater computational efficiency. Lighting computations are then performed using logarithmic space arithmetic routines that, on average, execute more efficiently than similar routines performed in a native floating-point format. The lighting computation results, represented as logarithmic space numbers, are converted back to floating-point numbers before being transmitted to a graphics processing unit (GPU) for further processing. Because of efficiencies of logarithmic space arithmetic, performance improvements may be realized relative to prior art approaches to performing software-based floating-point operations.
    Type: Grant
    Filed: December 11, 2006
    Date of Patent: April 5, 2016
    Assignee: NVIDIA Corporation
    Inventor: Norbert Juffa
  • Patent number: 8860741
    Abstract: In contrast to a conventional computing system in which the graphics processor (graphics processing unit or GPU) is treated as a slave to one or several CPUs, systems and methods are provided that allow the GPU to be treated as a central processing unit (CPU) from the perspective of the operating system. The GPU can access a memory space shared by other CPUs in the computing system. Caches utilized by the GPU may be coherent with caches utilized by other CPUs in the computing system. The GPU may share execution of general-purpose computations with other CPUs in the computing system.
    Type: Grant
    Filed: December 8, 2006
    Date of Patent: October 14, 2014
    Assignee: NVIDIA Corporation
    Inventors: Norbert Juffa, Stuart F. Oberman
  • Patent number: 8589468
    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.
    Type: Grant
    Filed: September 3, 2010
    Date of Patent: November 19, 2013
    Assignee: NVIDIA Corporation
    Inventors: Norbert Juffa, Radoslav Danilak
  • Patent number: 8327123
    Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.
    Type: Grant
    Filed: March 23, 2011
    Date of Patent: December 4, 2012
    Assignee: NVIDIA Corporation
    Inventors: Norbert Juffa, Brett W. Coon
  • Patent number: 8140608
    Abstract: One embodiment of the present invention sets forth a technique for performing fast integer division using commonly available arithmetic operations. The technique may be implemented in a two-stage process using a single-precision floating point reciprocal in conjunction with integer addition and multiplication. Furthermore, the technique may be fully pipelined on many conventional processors for performance that is comparable to the best available high-performance alternatives.
    Type: Grant
    Filed: May 31, 2007
    Date of Patent: March 20, 2012
    Assignee: NVIDIA Corporation
    Inventor: Norbert Juffa
  • Publication number: 20110173414
    Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.
    Type: Application
    Filed: March 23, 2011
    Publication date: July 14, 2011
    Applicant: NVIDIA Corporation
    Inventors: Norbert Juffa, Brett W. Coon
  • Patent number: 7925860
    Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.
    Type: Grant
    Filed: May 14, 2007
    Date of Patent: April 12, 2011
    Assignee: NVIDIA Corporation
    Inventors: Norbert Juffa, Brett W. Coon
  • Patent number: 7916864
    Abstract: A graphics processing unit is programmed to carry out cryptographic processing so that fast, effective cryptographic processing solutions can be provided without incurring additional hardware costs. The graphics processing unit can efficiently carry out cryptographic processing because it has an architecture that is configured to handle a large number of parallel processes. The cryptographic processing carried out on the graphics processing unit can be further improved by configuring the graphics processing unit to be capable of both floating point and integer operations.
    Type: Grant
    Filed: February 8, 2006
    Date of Patent: March 29, 2011
    Assignee: NVIDIA Corporation
    Inventor: Norbert Juffa
  • Patent number: 7912889
    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.
    Type: Grant
    Filed: June 16, 2006
    Date of Patent: March 22, 2011
    Assignee: NVIDIA Corporation
    Inventors: Norbert Juffa, Radoslav Danilak
  • Publication number: 20100325187
    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.
    Type: Application
    Filed: September 3, 2010
    Publication date: December 23, 2010
    Inventors: Norbert Juffa, Radoslav Danilak
  • Patent number: 7836118
    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.
    Type: Grant
    Filed: June 16, 2006
    Date of Patent: November 16, 2010
    Assignee: NVIDIA Corporation
    Inventors: Norbert Juffa, Radoslav Danilak
  • Patent number: 7792895
    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.
    Type: Grant
    Filed: June 16, 2006
    Date of Patent: September 7, 2010
    Assignee: NVIDIA Corporation
    Inventors: Norbert Juffa, Radoslav Danilak
  • Patent number: 7506134
    Abstract: The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.
    Type: Grant
    Filed: June 16, 2006
    Date of Patent: March 17, 2009
    Assignee: NVIDIA Corporation
    Inventors: Norbert Juffa, Radoslav Danilak
  • Publication number: 20070271325
    Abstract: Systems and methods for reducing the bandwidth needed to read the inputs to a matrix multiply operation may improve system performance. Rather than reading a row of a first input matrix and a column of a second input matrix to produce a column of a product matrix, a column of the first input matrix and a single element of the second input matrix are read to produce a column of partial dot products of the product matrix. Therefore, the number of input matrix elements read to produce each product matrix element is reduced from 2N to N+1, where N is the number of elements in a column of the product matrix.
    Type: Application
    Filed: May 8, 2006
    Publication date: November 22, 2007
    Inventors: Norbert Juffa, John Nickolls
  • Publication number: 20070198412
    Abstract: A graphics processing unit is programmed to carry out cryptographic processing so that fast, effective cryptographic processing solutions can be provided without incurring additional hardware costs. The graphics processing unit can efficiently carry out cryptographic processing because it has an architecture that is configured to handle a large number of parallel processes. The cryptographic processing carried out on the graphics processing unit can be further improved by configuring the graphics processing unit to be capable of both floating point and integer operations.
    Type: Application
    Filed: February 8, 2006
    Publication date: August 23, 2007
    Inventor: Norbert Juffa
  • Patent number: 6557098
    Abstract: An execution unit is provided for executing a first instruction which includes an opcode field, a first operand field, and a second operand field. The execution unit includes a first input register for receiving a first operand specified by a value of the first operand field, and a second input register for receiving a second operand specified by a value of the second operand field. The execution unit further includes a comparator unit which is coupled to receive a value of the opcode field for the first instruction. The comparator unit is also coupled to receive the first and second operand values from the first and second input registers, respectively. The execution further includes a multiplexer which receives a plurality of inputs. These inputs include a first constant value, a second constant value, and the values of the first and second operand.
    Type: Grant
    Filed: January 5, 2000
    Date of Patent: April 29, 2003
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Stuart Oberman, Norbert Juffa
  • Patent number: 6442677
    Abstract: An apparatus and method for superforwarding load operands in a microprocessor are provided. An execution unit in a microprocessor is configured to receive a load instruction and a subsequent instruction. If the load instruction corresponds to a simple load instruction, a destination operand of the load instruction can be superforwarded to a subsequent instruction if the subsequent instruction specifies a source operand that depends on the destination operand of the load instruction. The subsequent instruction is not required to wait until a load instruction executes or completes and can be scheduled and/or executed prior to or at the same time as the load instruction. Consequently, latencies associated with operand dependencies may be reduced.
    Type: Grant
    Filed: June 10, 1999
    Date of Patent: August 27, 2002
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Derrick R. Meyer, Stephan G. Meier, Norbert Juffa
  • Patent number: 6425074
    Abstract: A microprocessor configured to rapidly execute floating point store status word (FSTSW) type instructions that are immediately preceded by floating point compare (FCOM) type instructions is disclosed. FCOM-type instructions are modified to store their results to an architectural floating point status word and a temporary destination register. If an FSTSW-type instruction is detected immediately following an FCOM-type instruction, then the FSTSW-type instruction is transformed into a special fast floating point store status word (FSTSWEF) instruction. Unlike the FSTSW-type instruction, which is serializing and negatively impacts performance, the FSTSWEF instruction is not serializing and allows execution to continue without undue serialization. A computer system and method for rapidly executing FSTSW instructions immediately preceded by FCOM-type instructions are also disclosed.
    Type: Grant
    Filed: September 10, 1999
    Date of Patent: July 23, 2002
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Stephan G. Meier, Norbert Juffa, Frederick D. Weber, Stuart F. Oberman
  • Patent number: 6408379
    Abstract: An apparatus and method for executing floating-point store instructions in a microprocessor is provided. If store data of a floating-point store instruction corresponds to a tiny number and an underflow exception is masked, then a trap routine can be executed to generate corrected store data and complete the store operation. In response to detecting that store data corresponds to a tiny number and the underflow exception is masked, the store data, store address information, and opcode information can be stored prior to initiating the trap routine. The trap routine can be configured to access the store data, store address information, and opcode information. The trap routine can be configured to generate corrected store data and complete the store operation using the store data, store address information, and opcode information.
    Type: Grant
    Filed: June 10, 1999
    Date of Patent: June 18, 2002
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Norbert Juffa, Stephan Meier, Stuart Oberman, Scott White
  • Patent number: 6405305
    Abstract: A microprocessor with a floating point unit configured to rapidly execute floating point load control word (FLDCW) type instructions in an out of program order context is disclosed. The floating point unit is configured to schedule instructions older than the FLDCW-type instruction before the FLDCW-type instruction is scheduled. The FLDCW-type instruction acts as a barrier to prevent instructions occurring after the FLDCW-type instruction in program order from executing before the FLDCW-type instruction. Indicator bits may be used to simplify instruction scheduling, and copies of the floating point control word may be stored for instruction that have long execution cycles. A method and computer configured to rapidly execute FLDCW-type instructions in an out of program order context are also disclosed.
    Type: Grant
    Filed: September 10, 1999
    Date of Patent: June 11, 2002
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Stephan G. Meier, Jeffrey E. Trull, Derrick R. Meyer, Norbert Juffa