Patents by Inventor John R. Nickolls

John R. Nickolls has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Parallel data processing systems and methods using cooperative thread arrays and thread identifier values to determine processing behavior

Patent number: 7861060

Abstract: Parallel data processing systems and methods use cooperative thread arrays (CTAs), i.e., groups of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique identifier (thread ID) that can be assigned at thread launch time. The thread ID controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Mechanisms for loading and launching CTAs in a representative processing core and for synchronizing threads within a CTA are also described.

Type: Grant

Filed: December 15, 2005

Date of Patent: December 28, 2010

Assignee: NVIDIA Corporation

Inventors: John R. Nickolls, Stephen D. Lew
Fast fourier transforms and related transforms using cooperative thread arrays

Patent number: 7836116

Abstract: A linear transform such as a Fast Fourier Transform (FFT) is performed on an input data set having a number of points using one or more arrays of concurrent threads that are capable of sharing data with each other. Each thread of one thread array reads two or more of the points, performs an appropriate “butterfly” calculation to generate two or more new points, then stores the new points in a memory location that is accessible to other threads of the array. Each thread determines which points it is to read based at least in part on a unique thread identifier assigned thereto. Multiple transform stages can be handled by a single thread array, or different levels can be handled by different thread arrays.

Type: Grant

Filed: June 15, 2006

Date of Patent: November 16, 2010

Assignee: NVIDIA Corporation

Inventors: Nolan D. Goodnight, John R. Nickolls, Radoslav Danilak
Generating event signals for performance register control using non-operative instructions

Patent number: 7809928

Abstract: One embodiment of an instruction decoder includes an instruction parser configured to process a first non-operative instruction and to generate a first event signal corresponding to the first non-operative instruction, and a first event multiplexer configured to receive the first event signal from the instruction parser, to select the first event signal from one or more event signals and to transmit the first event signal to an event logic block. The instruction decoder may be implemented in a multithreaded processing unit, such as a shader unit, and the occurrences of the first event signal may be tracked when one or more threads are executed within the processing unit. The resulting event signal count may provide a designer with a better understanding of the behavior of a program, such as a shader program, executed within the processing unit, thereby facilitating overall processing unit and program design.

Type: Grant

Filed: December 20, 2005

Date of Patent: October 5, 2010

Assignee: NVIDIA Corporation

Inventors: Roger L. Allen, Brett W. Coon, Ian A. Buck, John R. Nickolls
Synchronization of threads in a cooperative thread array

Patent number: 7788468

Abstract: A “cooperative thread array,” or “CTA,” is a group of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique thread identifier assigned at thread launch time that controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Different threads of the CTA are advantageously synchronized at appropriate points during CTA execution using a barrier synchronization technique in which barrier instructions in the CTA program are detected and used to suspend execution of some threads until a specified number of other threads also reaches the barrier point.

Type: Grant

Filed: December 15, 2005

Date of Patent: August 31, 2010

Assignee: NVIDIA Corporation

Inventors: John R. Nickolls, Stephen D. Lew, Brett W. Coon, Peter C. Mills
Processing an indirect branch instruction in a SIMD architecture

Patent number: 7761697

Abstract: One embodiment of a computing system configured to manage divergent threads in a thread group includes a stack configured to store at least one token and a multithreaded processing unit. The multithreaded processing unit is configured to perform the steps of fetching a program instruction, determining that the program instruction is an indirect branch instruction, and processing the indirect branch instruction as a sequence of two-way branches to execute an indirect branch instruction with multiple branch addresses. Indirect branch instructions may be used to allow greater flexibility since the branch address or multiple branch addresses do not need to be determined at compile time.

Type: Grant

Filed: November 6, 2006

Date of Patent: July 20, 2010

Assignee: NVIDIA Corporation

Inventors: Brett W. Coon, John Erik Lindholm, Peter C. Mills, John R. Nickolls
Apparatus and method for debugging a graphics processing unit in response to a debug instruction

Patent number: 7711990

Abstract: A system includes a graphics processing unit with a processor responsive to a debug instruction that initiates the storage of execution state information. A memory stores the execution state information. A central processing unit executes a debugging program to analyze the execution state information.

Type: Grant

Filed: December 13, 2005

Date of Patent: May 4, 2010

Assignee: Nvidia Corporation

Inventors: John R. Nickolls, Roger L. Allen, Brian K. Cabral, Brett W. Coon, Robert C. Keller
Single interconnect providing read and write access to a memory shared by concurrent threads

Patent number: 7680988

Abstract: A shared memory is usable by concurrent threads in a multithreaded processor, with any addressable storage location in the shared memory being readable and writeable by any of the threads. Processing engines that execute the threads are coupled to the shared memory via an interconnect that transfers data in only one direction (e.g., from the shared memory to the processing engines); the same interconnect supports both read and write operations. The interconnect advantageously supports multiple parallel read or write operations.

Type: Grant

Filed: October 30, 2006

Date of Patent: March 16, 2010

Assignee: NVIDIA Corporation

Inventors: John R. Nickolls, Brett W. Coon, Ming Y. Siu, Stuart F. Oberman, Samuel Liu
Bit reversal methods for a parallel processor

Patent number: 7640284

Abstract: Parallelism in a processor is exploited to permute a data set based on bit reversal of indices associated with data points in the data set. Permuted data can be stored in a memory having entries arranged in banks, where entries in different banks can be accessed in parallel. A destination location in the memory for a particular data point from the data set is determined based on the bit-reversed index associated with that data point. The bit-reversed index can be further modified so that at least some of the destination locations determined by different parallel processes are in different banks, allowing multiple points of the bit-reversed data set to be written in parallel.

Type: Grant

Filed: June 15, 2006

Date of Patent: December 29, 2009

Assignee: NVIDIA Corporation

Inventors: Nolan D. Goodnight, John R. Nickolls
Register file allocation

Patent number: 7634621

Abstract: Circuits, methods, and apparatus that provide the die area and power savings of a single-ported memory with the performance advantages of a multiported memory. One example provides register allocation methods for storing data in a multiple-bank register file. In a thin register allocation method, data for a process is stored in a single bank. In this way, different processes use different banks to avoid conflicts. In a fat register allocation method, processes store data in each bank. In this way, if one process uses a large number of registers, those registers are spread among the banks, avoiding a situation where one bank is filled and other processes are forced to share a reduced number of banks. In a hybrid register allocation method, processes store data in more than one bank, but fewer than all the banks. Each of these methods may be combined in varying ways.

Type: Grant

Filed: November 3, 2006

Date of Patent: December 15, 2009

Assignee: NVIDIA Corporation

Inventors: Brett W. Coon, John Erik Lindholm, Gary Tarolli, Svetoslav D. Tzvetkov, John R. Nickolls, Ming Y. Siu
Atomic memory operators in a parallel processor

Patent number: 7627723

Abstract: Methods, apparatuses, and systems are presented for updating data in memory while executing multiple threads of instructions, involving receiving a single instruction from one of a plurality of concurrently executing threads of instructions, in response to the single instruction received, reading data from a specific memory location, performing an operation involving the data read from the memory location to generate a result, and storing the result to the specific memory location, without requiring separate load and store instructions, and in response to the single instruction received, precluding another one of the plurality of threads of instructions from altering data at the specific memory location while reading of the data from the specific memory location, performing the operation involving the data, and storing the result to the specific memory location.

Type: Grant

Filed: September 21, 2006

Date of Patent: December 1, 2009

Assignee: NVIDIA Corporation

Inventors: Ian A. Buck, John R. Nickolls, Michael C. Shebanow, Lars S. Nyland
Apparatus and method for monitoring and debugging a graphics processing unit

Patent number: 7600155

Abstract: A system has a graphics processing unit with a processor to monitor selected criteria and circuitry to initiate the storage of execution state information when the selected criteria reaches a specified state. A memory stores execution state information. A central processing unit executes a debugging program to analyze the execution state information.

Type: Grant

Filed: December 13, 2005

Date of Patent: October 6, 2009

Assignee: NVIDIA Corporation

Inventors: John R. Nickolls, Roger L. Allen, Brian K. Cabral, Brett W. Coon, Robert C. Keller
Indirect Function Call Instructions in a Synchronous Parallel Thread Processor

Publication number: 20090240931

Abstract: An indirect branch instruction takes an address register as an argument in order to provide indirect function call capability for single-instruction multiple-thread (SIMT) processor architectures. The indirect branch instruction is used to implement indirect function calls, virtual function calls, and switch statements to improve processing performance compared with using sequential chains of tests and branches.

Type: Application

Filed: March 24, 2008

Publication date: September 24, 2009

Inventors: Brett W. Coon, John R. Nickolls, Lars Nyland, Peter C. Mills, John Erik Lindholm
SYSTEMS AND METHODS FOR COALESCING MEMORY ACCESSES OF PARALLEL THREADS

Publication number: 20090240895

Abstract: One embodiment of the present invention sets forth a technique for efficiently and flexibly performing coalesced memory accesses for a thread group. For each read application request that services a thread group, the core interface generates one pending request table (PRT) entry and one or more memory access requests. The core interface determines the number of memory access requests and the size of each memory access request based on the spread of the memory access addresses in the application request. Each memory access request specifies the particular threads that the memory access request services. The PRT entry tracks the number of pending memory access requests. As the memory interface completes each memory access request, the core interface uses information in the memory access request and the corresponding PRT entry to route the returned data.

Type: Application

Filed: March 24, 2008

Publication date: September 24, 2009

Inventors: Lars Nyland, John R. Nickolls, Gentaro Hirota, Tanmoy Mandal
Lock Mechanism to Enable Atomic Updates to Shared Memory

Publication number: 20090240860

Abstract: A system and method for locking and unlocking access to a shared memory for atomic operations provides immediate feedback indicating whether or not the lock was successful. Read data is returned to the requestor with the lock status. The lock status may be changed concurrently when locking during a read or unlocking during a write. Therefore, it is not necessary to check the lock status as a separate transaction prior to or during a read-modify-write operation. Additionally, a lock or unlock may be explicitly specified for each atomic memory operation. Therefore, lock operations are not performed for operations that do not modify the contents of a memory location.

Type: Application

Filed: March 24, 2008

Publication date: September 24, 2009

Inventors: Brett W. Coon, John R. Nickolls, Lars Nyland, Peter C. Mills
Parallel data processing systems and methods using cooperative thread arrays and SIMD instruction issue

Patent number: 7584342

Abstract: Parallel data processing systems and methods use cooperative thread arrays (CTAs), i.e., groups of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique identifier (thread ID) that can be assigned at thread launch time and that controls various aspects of the thread's processing behavior, such as the portion of the input data set to be processed by each thread, the portion of the output data set to be produced by each thread, and/or sharing of intermediate results among threads. Where groups of threads are executed in SIMD parallelism, thread IDs for threads in the same SIMD group are generated and assigned in parallel, allowing different SIMD groups to be launched in rapid succession.

Type: Grant

Filed: December 15, 2005

Date of Patent: September 1, 2009

Assignee: NVIDIA Corporation

Inventors: Bryon S. Nordquist, John R. Nickolls, Luis I. Bacayo
Counter-based delay of dependent thread group execution

Patent number: 7526634

Abstract: Systems and methods for synchronizing processing work performed by threads, cooperative thread arrays (CTAs), or “sets” of CTAs. A central processing unit can load launch commands for a first set of CTAs and a second set of CTAs in a pushbuffer, and specify a dependency of the second set upon completion of execution of the first set. A parallel or graphics processor (GPU) can autonomously execute the first set of CTAs and delay execution of the second set of CTAs until the first set of CTAs is complete. In some embodiments the GPU may determine that a third set of CTAs is not dependent upon the first set, and may launch the third set of CTAs while the second set of CTAs is delayed. In this manner, the GPU may execute launch commands out of order with respect to the order of the launch commands in the pushbuffer.

Type: Grant

Filed: September 27, 2006

Date of Patent: April 28, 2009

Assignee: Nvidia Corporation

Inventors: Jerome F. Duluk, Jr., Stephen D. Lew, John R. Nickolls
Defect tolerant redundancy

Patent number: 7477091

Abstract: Circuits, methods, and apparatus for using redundant circuitry on integrated circuits in order to increase manufacturing yields. One exemplary embodiment of the present invention provides a circuit configuration wherein functional circuit blocks in a group of circuit blocks are selected by multiplexers. Multiplexers at the input and output of the group of circuit blocks steer input and output signals to and from functional circuit blocks, avoiding circuit blocks found to be defective or nonfunctional. Multiple groups of these circuit blocks may be arranged in series and in parallel. Alternate multiplexer configurations may be used in order to provide a higher level of redundancy. Other embodiments use all functional circuit blocks and sort integrated circuits based on the level of functionality or performance. Other embodiments provide methods of testing integrated circuits having one or more of these circuit configurations.

Type: Grant

Filed: April 12, 2005

Date of Patent: January 13, 2009

Assignee: NVIDIA Corporation

Inventor: John R. Nickolls
Register based queuing for texture requests

Patent number: 7456835

Abstract: A graphics processing unit can queue a large number of texture requests to balance out the variability of texture requests without the need for a large texture request buffer. A dedicated texture request buffer queues the relatively small texture commands and parameters. Additionally, for each queued texture command, an associated set of texture arguments, which are typically much larger than the texture command, are stored in a general purpose register. The texture unit retrieves texture commands from the texture request buffer and then fetches the associated texture arguments from the appropriate general purpose register. The texture arguments may be stored in the general purpose register designated as the destination of the final texture value computed by the texture unit. Because the destination register must be allocated for the final texture value as texture commands are queued, storing the texture arguments in this register does not consume any additional registers.

Type: Grant

Filed: January 25, 2006

Date of Patent: November 25, 2008

Assignee: Nvidia Corporation

Inventors: John Erik Lindholm, John R. Nickolls, Simon S. Moy, Brett W. Coon
VIRTUAL ARCHITECTURE AND INSTRUCTION SET FOR PARALLEL THREAD COMPUTING

Publication number: 20080184211

Abstract: A virtual architecture and instruction set support explicit parallel-thread computing. The virtual architecture defines a virtual processor that supports concurrent execution of multiple virtual threads with multiple levels of data sharing and coordination (e.g., synchronization) between different virtual threads, as well as a virtual execution driver that controls the virtual processor. A virtual instruction set architecture for the virtual processor is used to define behavior of a virtual thread and includes instructions related to parallel thread behavior, e.g., data sharing and synchronization. Using the virtual platform, programmers can develop application programs in which virtual threads execute concurrently to process data; virtual translators and drivers adapt the application code to particular hardware on which it is to execute, transparently to the programmer.

Type: Application

Filed: January 26, 2007

Publication date: July 31, 2008

Applicant: NVIDIA Corporation

Inventors: John R. Nickolls, Henry P. Moreton, Lars S. Nyland, Ian A. Buck, Richard C. Johnson, Robert S. Glanville, Jayant B. Kolhe
Galois field multiplier array for use within a finite field arithmetic unit

Patent number: 7403964

Abstract: A Galois field multiplier array includes a 1st register, a 2nd register, a 3rd register, and a plurality of multiplier cells. The 1st register stores bits of a 1st operand. The 2nd register stores bits of a 2nd operand. The 3rd register stores bits of a generating polynomial that corresponds to one of a plurality of applications (e.g., FEC, CRC, Reed Solomon, et cetera). The plurality of multiplier cells is arranged in rows and columns. Each of the multiplier cells outputs a sum and a product and each cell includes five inputs. The 1st input receives a preceding cell's multiply output, the 2nd input receives at least one bit of the 2nd operand, the 3rd input receives a preceding cell's sum output, a 4th input receives at least one bit of the generating polynomial, and the 5th input receives a feedback term from a preceding cell in a preceding row. The multiplier cells in the 1st row have the 1st input, 3rd input, and 5th input set to corresponding initialization values in accordance with the 2nd operand.

Type: Grant

Filed: June 12, 2003

Date of Patent: July 22, 2008

Assignee: Broadcom Corporation

Inventors: Joshua Porten, Won Kim, Scott D. Johnson, John R. Nickolls

prev 1 2 3 4 5 next