Patents by Inventor Brett W. Coon
Brett W. Coon has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20140019724Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.Type: ApplicationFiled: September 12, 2013Publication date: January 16, 2014Applicant: NVIDIA CorporationInventors: Brian FAHS, Ming Y. SIU, Brett W. COON, John R. NICKOLLS, Lars NYLAND
-
Patent number: 8595425Abstract: One embodiment of the present invention sets forth a technique for providing a L1 cache that is a central storage resource. The L1 cache services multiple clients with diverse latency and bandwidth requirements. The L1 cache may be reconfigured to create multiple storage spaces enabling the L1 cache may replace dedicated buffers, caches, and FIFOs in previous architectures. A “direct mapped” storage region that is configured within the L1 cache may replace dedicated buffers, FIFOs, and interface paths, allowing clients of the L1 cache to exchange attribute and primitive data. The direct mapped storage region may used as a global register file. A “local and global cache” storage region configured within the L1 cache may be used to support load/store memory requests to multiple spaces. These spaces include global, local, and call-return stack (CRS) memory.Type: GrantFiled: September 25, 2009Date of Patent: November 26, 2013Assignee: NVIDIA CorporationInventors: Alexander L. Minkin, Steven James Heinrich, RaJeshwaran Selvanesan, Brett W. Coon, Charles McCarver, Anjana Rajendran, Stewart G. Carlton
-
Patent number: 8572355Abstract: One embodiment of the present invention sets forth a method for executing a non-local return instruction in a parallel thread processor. The method comprises the steps of receiving, within the thread group, a first long jump instruction and, in response, popping a first token from the execution stack. The method also comprises determining whether the first token is a first long jump token that was pushed onto the execution stack when a first push instruction associated with the first long jump instruction was executed, and when the first token is the first long jump token, jumping to the second instruction based on the address specified by the first long jump token, or, when the first token is not the first long jump token, disabling the active thread until the first long jump token is popped from the execution stack.Type: GrantFiled: September 13, 2010Date of Patent: October 29, 2013Assignee: Nvidia CorporationInventors: Guillermo Juan Rozas, Brett W. Coon
-
Patent number: 8539204Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.Type: GrantFiled: September 24, 2010Date of Patent: September 17, 2013Assignee: Nvidia CorporationInventors: Brian Fahs, Ming Y. Siu, Brett W. Coon, John R. Nickolls, Lars Nyland
-
Patent number: 8522000Abstract: A trap handler architecture is incorporated into a parallel processing subsystem such as a GPU. The trap handler architecture minimizes design complexity and verification efforts for concurrently executing threads by imposing a property that all thread groups associated with a streaming multi-processor are either all executing within their respective code segments or are all executing within the trap handler code segment.Type: GrantFiled: September 29, 2009Date of Patent: August 27, 2013Assignee: Nvidia CorporationInventors: Michael C. Shebanow, Jack Choquette, Brett W. Coon, Steven J. Heinrich, Aravind Kalaiah, John R. Nickolls, Daniel Salinas, Ming Y. Siu, Tommy Thorn, Nicholas Wang
-
Patent number: 8405665Abstract: A processing unit includes multiple execution pipelines, each of which is coupled to a first input section for receiving input data for pixel processing and a second input section for receiving input data for vertex processing and to a first output section for storing processed pixel data and a second output section for storing processed vertex data. The processed vertex data is rasterized and scan converted into pixel data that is used as the input data for pixel processing. The processed pixel data is output to a raster analyzer.Type: GrantFiled: May 7, 2012Date of Patent: March 26, 2013Assignee: Nvidia CorporationInventors: John Erik Lindholm, Brett W. Coon, Stuart F. Oberman, Ming Y. Siu, Matthew P. Gerlach
-
Patent number: 8375176Abstract: A system and method for locking and unlocking access to a shared memory for atomic operations provides immediate feedback indicating whether or not the lock was successful. Read data is returned to the requestor with the lock status. The lock status may be changed concurrently when locking during a read or unlocking during a write. Therefore, it is not necessary to check the lock status as a separate transaction prior to or during a read-modify-write operation. Additionally, a lock or unlock may be explicitly specified for each atomic memory operation. Therefore, lock operations are not performed for operations that do not modify the contents of a memory location.Type: GrantFiled: October 18, 2011Date of Patent: February 12, 2013Assignee: NVIDIA CorporationInventors: Brett W. Coon, John R. Nickolls, Lars Nyland, Peter C. Mills
-
Patent number: 8327123Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.Type: GrantFiled: March 23, 2011Date of Patent: December 4, 2012Assignee: NVIDIA CorporationInventors: Norbert Juffa, Brett W. Coon
-
Patent number: 8312254Abstract: An indirect branch instruction takes an address register as an argument in order to provide indirect function call capability for single-instruction multiple-thread (SIMT) processor architectures. The indirect branch instruction is used to implement indirect function calls, virtual function calls, and switch statements to improve processing performance compared with using sequential chains of tests and branches.Type: GrantFiled: March 24, 2008Date of Patent: November 13, 2012Assignee: NVIDIA CorporationInventors: Brett W. Coon, John R. Nickolls, Lars Nyland, Peter C. Mills, John Erik Lindholm
-
Patent number: 8271763Abstract: One embodiment of the present invention sets forth a technique for unifying the addressing of multiple distinct parallel memory spaces into a single address space for a thread. A unified memory space address is converted into an address that accesses one of the parallel memory spaces for that thread. A single type of load or store instruction may be used that specifies the unified memory space address for a thread instead of using a different type of load or store instruction to access each of the distinct parallel memory spaces.Type: GrantFiled: September 25, 2009Date of Patent: September 18, 2012Assignee: NVIDIA CorporationInventors: John R. Nickolls, Brett W. Coon, Ian A. Buck, Robert Steven Glanville
-
Publication number: 20120218267Abstract: A processing unit includes multiple execution pipelines, each of which is coupled to a first input section for receiving input data for pixel processing and a second input section for receiving input data for vertex processing and to a first output section for storing processed pixel data and a second output section for storing processed vertex data. The processed vertex data is rasterized and scan converted into pixel data that is used as the input data for pixel processing. The processed pixel data is output to a raster analyzer.Type: ApplicationFiled: May 7, 2012Publication date: August 30, 2012Inventors: John Erik Lindholm, Brett W. Coon, Stuart F. Oberman, Ming Y. Siu, Matthew P. Gerlach
-
Publication number: 20120221808Abstract: A memory is used by concurrent threads in a multithreaded processor. Any addressable storage location is accessible by any of the concurrent threads, but only one location at a time is accessible. The memory is coupled to parallel processing engines that generate a group of parallel memory access requests, each specifying a target address that might be the same or different for different requests. Serialization logic selects one of the target addresses and determines which of the requests specify the selected target address. All such requests are allowed to proceed in parallel, while other requests are deferred. Deferred requests may be regenerated and processed through the serialization logic so that a group of requests can be satisfied by accessing each different target address in the group exactly once.Type: ApplicationFiled: May 7, 2012Publication date: August 30, 2012Applicant: NVIDIA CorporationInventors: Brett W. Coon, Ming Y. Siu, Weizhong Xu, Stuart F. Oberman, John R. Nickolls, Peter C. Mills
-
Patent number: 8253748Abstract: One embodiment of a system for collecting performance data for a multithreaded processing unit includes a plurality of independent performance registers, each configured to count hardware-based and/or software-based events. Functional blocks within the multithreaded processing unit are configured to generate various event signals, and subsets of the events are selected and used to generate one or more functions, each of which increments one of the performance registers. By accessing the contents of the performance registers, a user may observe and characterize the behavior of the different functional blocks within the multithreaded processing unit when one or more threads are executed within the processing unit. The contents of the performance registers may also be used to modify the behavior of the program running on the multithreaded processing unit, to modify a global performance register or to trigger an interrupt.Type: GrantFiled: November 29, 2005Date of Patent: August 28, 2012Assignee: NVIDIA CorporationInventors: Roger L. Allen, Brett W. Coon
-
Patent number: 8237705Abstract: Apparatuses and methods are presented for a hierarchical processor. The processor comprises, at a first level of hierarchy, a plurality of similarly structured first level components, wherein each of the plurality of similarly structured first level components includes at least one combined function module capable of performing multiple classes of graphics operations, each of the multiple classes of graphics operations being associated with a different stage of graphics processing.Type: GrantFiled: October 10, 2011Date of Patent: August 7, 2012Assignee: NVIDIA CorporationInventors: John Erik Lindholm, John S. Montrym, Emmett M. Kilgariff, Simon S. Moy, Sean Jeffrey Treichler, Brett W. Coon, David Kirk, John Danskin
-
Patent number: 8225076Abstract: A scoreboard memory for a processing unit has separate memory regions allocated to each of the multiple threads to be processed. For each thread, the scoreboard memory stores register identifiers of registers that have pending writes. When an instruction is added to an instruction buffer, the register identifiers of the registers specified in the instruction are compared with the register identifiers stored in the scoreboard memory for that instruction's thread, and a multi-bit value representing the comparison result is generated. The multi-bit value is stored with the instruction in the instruction buffer and may be updated as instructions belonging to the same thread complete their execution. Before the instruction is issued for execution, this multi-bit value is checked. If this multi-bit value indicates that none of the registers specified in the instruction have pending writes, the instruction is allowed to issue for execution.Type: GrantFiled: September 18, 2008Date of Patent: July 17, 2012Assignee: NVIDIA CorporationInventors: Brett W. Coon, Peter C. Mills, Stuart F. Oberman, Ming Y. Siu
-
Patent number: 8174531Abstract: A processing unit includes multiple execution pipelines, each of which is coupled to a first input section for receiving input data for pixel processing and a second input section for receiving input data for vertex processing and to a first output section for storing processed pixel data and a second output section for storing processed vertex data. The processed vertex data is rasterized and scan converted into pixel data that is used as the input data for pixel processing. The processed pixel data is output to a raster analyzer.Type: GrantFiled: December 29, 2009Date of Patent: May 8, 2012Assignee: NVIDIA CorporationInventors: John Erik Lindholm, Brett W. Coon, Stuart F. Oberman, Ming Y. Siu, Matthew P. Gerlach
-
Patent number: 8176265Abstract: A memory is used by concurrent threads in a multithreaded processor. Any addressable storage location is accessible by any of the concurrent threads, but only one location at a time is accessible. The memory is coupled to parallel processing engines that generate a group of parallel memory access requests, each specifying a target address that might be the same or different for different requests. Serialization logic selects one of the target addresses and determines which of the requests specify the selected target address. All such requests are allowed to proceed in parallel, while other requests are deferred. Deferred requests may be regenerated and processed through the serialization logic so that a group of requests can be satisfied by accessing each different target address in the group exactly once.Type: GrantFiled: June 21, 2011Date of Patent: May 8, 2012Assignee: NVIDIA CorporationInventors: Brett W. Coon, Ming Y. Siu, Weizhong Xu, Stuart F. Oberman, John R. Nickolls, Peter C. Mills
-
Publication number: 20120110586Abstract: A parallel thread processor executes thread groups belonging to multiple cooperative thread arrays (CTAs). At each cycle of the parallel thread processor, an instruction scheduler selects a thread group to be issued for execution during a subsequent cycle. The instruction scheduler selects a thread group to issue for execution by (i) identifying a pool of available thread groups, (ii) identifying a CTA that has the greatest seniority value, and (iii) selecting the thread group that has the greatest credit value from within the CTA with the greatest seniority value.Type: ApplicationFiled: September 28, 2011Publication date: May 3, 2012Inventors: Brett W. Coon, John R. Nickolls, John Erik Lindholm, Robert J. Stoll, Nicholas Wang, Jack Hilaire Choquette, Kathleen Elliott Nickolls
-
Patent number: 8159496Abstract: Methods and apparatus for subdividing a shader program into regions or “phases” of instructions identifiable by phase identifiers (IDs) inserted into the shader program are provided. The phase IDs may be used to constrain execution of the shader program to prohibit texture fetches in later phases from being executed before a texture fetch in a current phase has completed. Other operations (e.g., math operations) within the current phase, however, may be allowed to execute while waiting for the current phase texture fetch to complete.Type: GrantFiled: June 1, 2009Date of Patent: April 17, 2012Assignee: NVIDIA CorporationInventors: John Erik Lindholm, Brett W. Coon, Gary M Tarolli
-
Publication number: 20120089792Abstract: One embodiment of the present invention sets forth a technique providing an optimized way to allocate and access memory across a plurality of thread/data lanes. Specifically, the device driver receives an instruction targeted to a memory set up as an array of structures of arrays. The device driver computes an address within the memory using information about the number of thread/data lanes and parameters from the instruction itself. The result is a memory allocation and access approach where the device driver properly computes the target address in the memory. Advantageously, processing efficiency is improved where memory in a parallel processing subsystem is internally stored and accessed as an array of structures of arrays, proportional to the SIMT/SIMD group width (the number of threads or lanes per execution group).Type: ApplicationFiled: September 28, 2011Publication date: April 12, 2012Inventors: Brian FAHS, John R. Nickolls, Kathleen Elliott Nickolls, Henry Packard Moreton, Brett W. Coon