Patents by Inventor Brett W. Coon

Brett W. Coon has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

COOPERATIVE THREAD ARRAY REDUCTION AND SCAN OPERATIONS

Publication number: 20140019724

Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.

Type: Application

Filed: September 12, 2013

Publication date: January 16, 2014

Applicant: NVIDIA Corporation

Inventors: Brian FAHS, Ming Y. SIU, Brett W. COON, John R. NICKOLLS, Lars NYLAND
Configurable cache for multiple clients

Patent number: 8595425

Abstract: One embodiment of the present invention sets forth a technique for providing a L1 cache that is a central storage resource. The L1 cache services multiple clients with diverse latency and bandwidth requirements. The L1 cache may be reconfigured to create multiple storage spaces enabling the L1 cache may replace dedicated buffers, caches, and FIFOs in previous architectures. A “direct mapped” storage region that is configured within the L1 cache may replace dedicated buffers, FIFOs, and interface paths, allowing clients of the L1 cache to exchange attribute and primitive data. The direct mapped storage region may used as a global register file. A “local and global cache” storage region configured within the L1 cache may be used to support load/store memory requests to multiple spaces. These spaces include global, local, and call-return stack (CRS) memory.

Type: Grant

Filed: September 25, 2009

Date of Patent: November 26, 2013

Assignee: NVIDIA Corporation

Inventors: Alexander L. Minkin, Steven James Heinrich, RaJeshwaran Selvanesan, Brett W. Coon, Charles McCarver, Anjana Rajendran, Stewart G. Carlton
Support for non-local returns in parallel thread SIMD engine

Patent number: 8572355

Abstract: One embodiment of the present invention sets forth a method for executing a non-local return instruction in a parallel thread processor. The method comprises the steps of receiving, within the thread group, a first long jump instruction and, in response, popping a first token from the execution stack. The method also comprises determining whether the first token is a first long jump token that was pushed onto the execution stack when a first push instruction associated with the first long jump instruction was executed, and when the first token is the first long jump token, jumping to the second instruction based on the address specified by the first long jump token, or, when the first token is not the first long jump token, disabling the active thread until the first long jump token is popped from the execution stack.

Type: Grant

Filed: September 13, 2010

Date of Patent: October 29, 2013

Assignee: Nvidia Corporation

Inventors: Guillermo Juan Rozas, Brett W. Coon
Cooperative thread array reduction and scan operations

Patent number: 8539204

Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.

Type: Grant

Filed: September 24, 2010

Date of Patent: September 17, 2013

Assignee: Nvidia Corporation

Inventors: Brian Fahs, Ming Y. Siu, Brett W. Coon, John R. Nickolls, Lars Nyland
Trap handler architecture for a parallel processing unit

Patent number: 8522000

Abstract: A trap handler architecture is incorporated into a parallel processing subsystem such as a GPU. The trap handler architecture minimizes design complexity and verification efforts for concurrently executing threads by imposing a property that all thread groups associated with a streaming multi-processor are either all executing within their respective code segments or are all executing within the trap handler code segment.

Type: Grant

Filed: September 29, 2009

Date of Patent: August 27, 2013

Assignee: Nvidia Corporation

Inventors: Michael C. Shebanow, Jack Choquette, Brett W. Coon, Steven J. Heinrich, Aravind Kalaiah, John R. Nickolls, Daniel Salinas, Ming Y. Siu, Tommy Thorn, Nicholas Wang
Programmable graphics processor for multithreaded execution of programs

Patent number: 8405665

Abstract: A processing unit includes multiple execution pipelines, each of which is coupled to a first input section for receiving input data for pixel processing and a second input section for receiving input data for vertex processing and to a first output section for storing processed pixel data and a second output section for storing processed vertex data. The processed vertex data is rasterized and scan converted into pixel data that is used as the input data for pixel processing. The processed pixel data is output to a raster analyzer.

Type: Grant

Filed: May 7, 2012

Date of Patent: March 26, 2013

Assignee: Nvidia Corporation

Inventors: John Erik Lindholm, Brett W. Coon, Stuart F. Oberman, Ming Y. Siu, Matthew P. Gerlach
Lock mechanism to enable atomic updates to shared memory

Patent number: 8375176

Abstract: A system and method for locking and unlocking access to a shared memory for atomic operations provides immediate feedback indicating whether or not the lock was successful. Read data is returned to the requestor with the lock status. The lock status may be changed concurrently when locking during a read or unlocking during a write. Therefore, it is not necessary to check the lock status as a separate transaction prior to or during a read-modify-write operation. Additionally, a lock or unlock may be explicitly specified for each atomic memory operation. Therefore, lock operations are not performed for operations that do not modify the contents of a memory location.

Type: Grant

Filed: October 18, 2011

Date of Patent: February 12, 2013

Assignee: NVIDIA Corporation

Inventors: Brett W. Coon, John R. Nickolls, Lars Nyland, Peter C. Mills
Maximized memory throughput on parallel processing devices

Patent number: 8327123

Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.

Type: Grant

Filed: March 23, 2011

Date of Patent: December 4, 2012

Assignee: NVIDIA Corporation

Inventors: Norbert Juffa, Brett W. Coon
Indirect function call instructions in a synchronous parallel thread processor

Patent number: 8312254

Abstract: An indirect branch instruction takes an address register as an argument in order to provide indirect function call capability for single-instruction multiple-thread (SIMT) processor architectures. The indirect branch instruction is used to implement indirect function calls, virtual function calls, and switch statements to improve processing performance compared with using sequential chains of tests and branches.

Type: Grant

Filed: March 24, 2008

Date of Patent: November 13, 2012

Assignee: NVIDIA Corporation

Inventors: Brett W. Coon, John R. Nickolls, Lars Nyland, Peter C. Mills, John Erik Lindholm
Unified addressing and instructions for accessing parallel memory spaces

Patent number: 8271763

Abstract: One embodiment of the present invention sets forth a technique for unifying the addressing of multiple distinct parallel memory spaces into a single address space for a thread. A unified memory space address is converted into an address that accesses one of the parallel memory spaces for that thread. A single type of load or store instruction may be used that specifies the unified memory space address for a thread instead of using a different type of load or store instruction to access each of the distinct parallel memory spaces.

Type: Grant

Filed: September 25, 2009

Date of Patent: September 18, 2012

Assignee: NVIDIA Corporation

Inventors: John R. Nickolls, Brett W. Coon, Ian A. Buck, Robert Steven Glanville
PROGRAMMABLE GRAPHICS PROCESSOR FOR MULTITHREADED EXECUTION OF PROGRAMS

Publication number: 20120218267

Abstract: A processing unit includes multiple execution pipelines, each of which is coupled to a first input section for receiving input data for pixel processing and a second input section for receiving input data for vertex processing and to a first output section for storing processed pixel data and a second output section for storing processed vertex data. The processed vertex data is rasterized and scan converted into pixel data that is used as the input data for pixel processing. The processed pixel data is output to a raster analyzer.

Type: Application

Filed: May 7, 2012

Publication date: August 30, 2012

Inventors: John Erik Lindholm, Brett W. Coon, Stuart F. Oberman, Ming Y. Siu, Matthew P. Gerlach
SHARED SINGLE-ACCESS MEMORY WITH MANAGEMENT OF MULTIPLE PARALLEL REQUESTS

Publication number: 20120221808

Abstract: A memory is used by concurrent threads in a multithreaded processor. Any addressable storage location is accessible by any of the concurrent threads, but only one location at a time is accessible. The memory is coupled to parallel processing engines that generate a group of parallel memory access requests, each specifying a target address that might be the same or different for different requests. Serialization logic selects one of the target addresses and determines which of the requests specify the selected target address. All such requests are allowed to proceed in parallel, while other requests are deferred. Deferred requests may be regenerated and processed through the serialization logic so that a group of requests can be satisfied by accessing each different target address in the group exactly once.

Type: Application

Filed: May 7, 2012

Publication date: August 30, 2012

Applicant: NVIDIA Corporation

Inventors: Brett W. Coon, Ming Y. Siu, Weizhong Xu, Stuart F. Oberman, John R. Nickolls, Peter C. Mills
Shader performance registers

Patent number: 8253748

Abstract: One embodiment of a system for collecting performance data for a multithreaded processing unit includes a plurality of independent performance registers, each configured to count hardware-based and/or software-based events. Functional blocks within the multithreaded processing unit are configured to generate various event signals, and subsets of the events are selected and used to generate one or more functions, each of which increments one of the performance registers. By accessing the contents of the performance registers, a user may observe and characterize the behavior of the different functional blocks within the multithreaded processing unit when one or more threads are executed within the processing unit. The contents of the performance registers may also be used to modify the behavior of the program running on the multithreaded processing unit, to modify a global performance register or to trigger an interrupt.

Type: Grant

Filed: November 29, 2005

Date of Patent: August 28, 2012

Assignee: NVIDIA Corporation

Inventors: Roger L. Allen, Brett W. Coon
Hierarchical processor array

Patent number: 8237705

Abstract: Apparatuses and methods are presented for a hierarchical processor. The processor comprises, at a first level of hierarchy, a plurality of similarly structured first level components, wherein each of the plurality of similarly structured first level components includes at least one combined function module capable of performing multiple classes of graphics operations, each of the multiple classes of graphics operations being associated with a different stage of graphics processing.

Type: Grant

Filed: October 10, 2011

Date of Patent: August 7, 2012

Assignee: NVIDIA Corporation

Inventors: John Erik Lindholm, John S. Montrym, Emmett M. Kilgariff, Simon S. Moy, Sean Jeffrey Treichler, Brett W. Coon, David Kirk, John Danskin
Scoreboard having size indicators for tracking sequential destination register usage in a multi-threaded processor

Patent number: 8225076

Abstract: A scoreboard memory for a processing unit has separate memory regions allocated to each of the multiple threads to be processed. For each thread, the scoreboard memory stores register identifiers of registers that have pending writes. When an instruction is added to an instruction buffer, the register identifiers of the registers specified in the instruction are compared with the register identifiers stored in the scoreboard memory for that instruction's thread, and a multi-bit value representing the comparison result is generated. The multi-bit value is stored with the instruction in the instruction buffer and may be updated as instructions belonging to the same thread complete their execution. Before the instruction is issued for execution, this multi-bit value is checked. If this multi-bit value indicates that none of the registers specified in the instruction have pending writes, the instruction is allowed to issue for execution.

Type: Grant

Filed: September 18, 2008

Date of Patent: July 17, 2012

Assignee: NVIDIA Corporation

Inventors: Brett W. Coon, Peter C. Mills, Stuart F. Oberman, Ming Y. Siu
Programmable graphics processor for multithreaded execution of programs

Patent number: 8174531

Abstract: A processing unit includes multiple execution pipelines, each of which is coupled to a first input section for receiving input data for pixel processing and a second input section for receiving input data for vertex processing and to a first output section for storing processed pixel data and a second output section for storing processed vertex data. The processed vertex data is rasterized and scan converted into pixel data that is used as the input data for pixel processing. The processed pixel data is output to a raster analyzer.

Type: Grant

Filed: December 29, 2009

Date of Patent: May 8, 2012

Assignee: NVIDIA Corporation

Inventors: John Erik Lindholm, Brett W. Coon, Stuart F. Oberman, Ming Y. Siu, Matthew P. Gerlach
Shared single-access memory with management of multiple parallel requests

Patent number: 8176265

Abstract: A memory is used by concurrent threads in a multithreaded processor. Any addressable storage location is accessible by any of the concurrent threads, but only one location at a time is accessible. The memory is coupled to parallel processing engines that generate a group of parallel memory access requests, each specifying a target address that might be the same or different for different requests. Serialization logic selects one of the target addresses and determines which of the requests specify the selected target address. All such requests are allowed to proceed in parallel, while other requests are deferred. Deferred requests may be regenerated and processed through the serialization logic so that a group of requests can be satisfied by accessing each different target address in the group exactly once.

Type: Grant

Filed: June 21, 2011

Date of Patent: May 8, 2012

Assignee: NVIDIA Corporation

Inventors: Brett W. Coon, Ming Y. Siu, Weizhong Xu, Stuart F. Oberman, John R. Nickolls, Peter C. Mills
THREAD GROUP SCHEDULER FOR COMPUTING ON A PARALLEL THREAD PROCESSOR

Publication number: 20120110586

Abstract: A parallel thread processor executes thread groups belonging to multiple cooperative thread arrays (CTAs). At each cycle of the parallel thread processor, an instruction scheduler selects a thread group to be issued for execution during a subsequent cycle. The instruction scheduler selects a thread group to issue for execution by (i) identifying a pool of available thread groups, (ii) identifying a CTA that has the greatest seniority value, and (iii) selecting the thread group that has the greatest credit value from within the CTA with the greatest seniority value.

Type: Application

Filed: September 28, 2011

Publication date: May 3, 2012

Inventors: Brett W. Coon, John R. Nickolls, John Erik Lindholm, Robert J. Stoll, Nicholas Wang, Jack Hilaire Choquette, Kathleen Elliott Nickolls
Subdividing a shader program

Patent number: 8159496

Abstract: Methods and apparatus for subdividing a shader program into regions or “phases” of instructions identifiable by phase identifiers (IDs) inserted into the shader program are provided. The phase IDs may be used to constrain execution of the shader program to prohibit texture fetches in later phases from being executed before a texture fetch in a current phase has completed. Other operations (e.g., math operations) within the current phase, however, may be allowed to execute while waiting for the current phase texture fetch to complete.

Type: Grant

Filed: June 1, 2009

Date of Patent: April 17, 2012

Assignee: NVIDIA Corporation

Inventors: John Erik Lindholm, Brett W. Coon, Gary M Tarolli
EFFICIENT IMPLEMENTATION OF ARRAYS OF STRUCTURES ON SIMT AND SIMD ARCHITECTURES

Publication number: 20120089792

Abstract: One embodiment of the present invention sets forth a technique providing an optimized way to allocate and access memory across a plurality of thread/data lanes. Specifically, the device driver receives an instruction targeted to a memory set up as an array of structures of arrays. The device driver computes an address within the memory using information about the number of thread/data lanes and parameters from the instruction itself. The result is a memory allocation and access approach where the device driver properly computes the target address in the memory. Advantageously, processing efficiency is improved where memory in a parallel processing subsystem is internally stored and accessed as an array of structures of arrays, proportional to the SIMT/SIMD group width (the number of threads or lanes per execution group).

Type: Application

Filed: September 28, 2011

Publication date: April 12, 2012

Inventors: Brian FAHS, John R. Nickolls, Kathleen Elliott Nickolls, Henry Packard Moreton, Brett W. Coon

prev 1 2 3 4 5 next