Patents by Inventor Brett W. Coon
Brett W. Coon has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20120036329Abstract: A system and method for locking and unlocking access to a shared memory for atomic operations provides immediate feedback indicating whether or not the lock was successful. Read data is returned to the requestor with the lock status. The lock status may be changed concurrently when locking during a read or unlocking during a write. Therefore, it is not necessary to check the lock status as a separate transaction prior to or during a read-modify-write operation. Additionally, a lock or unlock may be explicitly specified for each atomic memory operation. Therefore, lock operations are not performed for operations that do not modify the contents of a memory location.Type: ApplicationFiled: October 18, 2011Publication date: February 9, 2012Inventors: Brett W. Coon, John R. Nickolls, Lars Nyland, Peter C. Mills
-
Publication number: 20120026175Abstract: Apparatuses and methods are presented for a hierarchical processor. The processor comprises, at a first level of hierarchy, a plurality of similarly structured first level components, wherein each of the plurality of similarly structured first level components includes at least one combined function module capable of performing multiple classes of graphics operations, each of the multiple classes of graphics operations being associated with a different stage of graphics processing.Type: ApplicationFiled: October 10, 2011Publication date: February 2, 2012Applicant: NVIDIA CorporationInventors: John Erik Lindholm, John S. Montrym, Emmett M. Kilgariff, Simon S. Moy, Sean Jeffrey Treichler, Brett W. Coon, David Kirk, John Danskin
-
Patent number: 8108625Abstract: Concurrent threads in a multithreaded processor share access to a memory, with any location in the shared memory being accessible by any thread. In one embodiment, the shared memory has multiple independently-addressable memory banks, and one location per bank can be accessed in parallel. Parallel processing engines executing the threads generate a group of parallel memory access requests. Address conflict logic determines whether the requests can be satisfied in parallel (e.g., based on bank access constraints) and serializes the requests to the extent needed to avoid conflicts. In some embodiments, data read from one address in the shared memory can be broadcast to multiple processing engines.Type: GrantFiled: October 30, 2006Date of Patent: January 31, 2012Assignee: NVIDIA CorporationInventors: Brett W. Coon, Ming Y. Siu, Weizhong Xu, Stuart F. Oberman, John R. Nickolls, Peter C. Mills
-
Patent number: 8095829Abstract: A global processor operating mode is used select whether a processor stops processing when an error is detected or ignores the error and continues processing while overriding values as needed to recover from the error. When a soldier-on mode is enabled the system attempts to recover from the error while also recording the error state of the first error in on-chip registers for later analysis. When the soldier-on mode is not enabled and an error occurs, the system stops processing and the error is reported up to the operating system.Type: GrantFiled: November 2, 2007Date of Patent: January 10, 2012Assignee: NVIDIA CorporationInventors: Brett W. Coon, Bryon S. Nordquist
-
Patent number: 8077174Abstract: Apparatuses and methods are presented for a hierarchical processor. The processor comprises, at a first level of hierarchy, a plurality of similarly structured first level components, wherein each of the plurality of similarly structured first level components includes at least one combined function module capable of performing multiple classes of graphics operations, each of the multiple classes of graphics operations being associated with a different stage of graphics processing.Type: GrantFiled: November 1, 2007Date of Patent: December 13, 2011Assignee: NVIDIA CorporationInventors: John Erik Lindholm, John S. Montrym, Emmett M. Kilgariff, Simon S. Moy, Sean Jeffrey Treichler, Brett W. Coon, David Kirk, John Danskin
-
Patent number: 8074224Abstract: Embodiments of the present invention facilitate dynamically adapting to state information changes in a graphics processing environment. In one embodiment, a master register holds state information corresponding to units of work (threads) to be performed. The state information in the master register is copied to a per-group state register when a group of threads is to be launched. The per-group state register is coupled to processing engines configured to process the threads, so that the processing engines read state information from the per-group state register rather than the master register. In another embodiment, a number of master registers may be used to store state information for different types of threads.Type: GrantFiled: December 19, 2005Date of Patent: December 6, 2011Assignee: NVIDIA CorporationInventors: Bryon S. Nordquist, Brett W. Coon
-
Patent number: 8055856Abstract: A system and method for locking and unlocking access to a shared memory for atomic operations provides immediate feedback indicating whether or not the lock was successful. Read data is returned to the requestor with the lock status. The lock status may be changed concurrently when locking during a read or unlocking during a write. Therefore, it is not necessary to check the lock status as a separate transaction prior to or during a read-modify-write operation. Additionally, a lock or unlock may be explicitly specified for each atomic memory operation. Therefore, lock operations are not performed for operations that do not modify the contents of a memory location.Type: GrantFiled: March 24, 2008Date of Patent: November 8, 2011Assignee: NVIDIA CorporationInventors: Brett W. Coon, John R. Nickolls, Lars Nyland, Peter C. Mills
-
Publication number: 20110252204Abstract: A memory is used by concurrent threads in a multithreaded processor. Any addressable storage location is accessible by any of the concurrent threads, but only one location at a time is accessible. The memory is coupled to parallel processing engines that generate a group of parallel memory access requests, each specifying a target address that might be the same or different for different requests. Serialization logic selects one of the target addresses and determines which of the requests specify the selected target address. All such requests are allowed to proceed in parallel, while other requests are deferred. Deferred requests may be regenerated and processed through the serialization logic so that a group of requests can be satisfied by accessing each different target address in the group exactly once.Type: ApplicationFiled: June 21, 2011Publication date: October 13, 2011Applicant: NVIDIA CorporationInventors: Brett W. Coon, Ming Y. Siu, Weizhong Xu, Stuart F. Oberman, John R. Nickolls, Peter C. Mills
-
Publication number: 20110173414Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.Type: ApplicationFiled: March 23, 2011Publication date: July 14, 2011Applicant: NVIDIA CorporationInventors: Norbert Juffa, Brett W. Coon
-
Patent number: 7949855Abstract: A processor buffers asynchronous threads. Instructions requiring operations provided by a plurality of execution units are divided into phases, each phase having at least one computation operation and at least one memory access operation. Instructions within each phase are qualified and prioritized. The instructions may be qualified based on the status of the execution unit needed to execute one or more of the current instructions. The instructions may also be qualified based on an age of each instruction, status of the execution units, a divergence potential, locality, thread diversity, and resource requirements. Qualified instructions may be prioritized based on execution units needed to execute instructions and the execution units in use. One or more of the prioritized instructions is issued per cycle to the plurality of execution units.Type: GrantFiled: April 28, 2008Date of Patent: May 24, 2011Assignee: NVIDIA CorporationInventors: Peter C. Mills, John Erik Lindholm, Brett W. Coon, Gary M. Tarolli, John Matthew Burgess
-
Patent number: 7925860Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.Type: GrantFiled: May 14, 2007Date of Patent: April 12, 2011Assignee: NVIDIA CorporationInventors: Norbert Juffa, Brett W. Coon
-
Publication number: 20110078418Abstract: One embodiment of the present invention sets forth a method for executing a non-local return instruction in a parallel thread processor. The method comprises the steps of receiving, within the thread group, a first long jump instruction and, in response, popping a first token from the execution stack. The method also comprises determining whether the first token is a first long jump token that was pushed onto the execution stack when a first push instruction associated with the first long jump instruction was executed, and when the first token is the first long jump token, jumping to the second instruction based on the address specified by the first long jump token, or, when the first token is not the first long jump token, disabling the active thread until the first long jump token is popped from the execution stack.Type: ApplicationFiled: September 13, 2010Publication date: March 31, 2011Inventors: Guillermo Juan Rozas, Brett W. Coon
-
Publication number: 20110078406Abstract: One embodiment of the present invention sets forth a technique for unifying the addressing of multiple distinct parallel memory spaces into a single address space for a thread. A unified memory space address is converted into an address that accesses one of the parallel memory spaces for that thread. A single type of load or store instruction may be used that specifies the unified memory space address for a thread instead of using a different type of load or store instruction to access each of the distinct parallel memory spaces.Type: ApplicationFiled: September 25, 2009Publication date: March 31, 2011Inventors: John R. Nickolls, Brett W. Coon, Ian A. Buck, Robert Steven Glanville
-
Publication number: 20110078427Abstract: A trap handler architecture is incorporated into a parallel processing subsystem such as a GPU. The trap handler architecture minimizes design complexity and verification efforts for concurrently executing threads by imposing a property that all thread groups associated with a streaming multi-processor are either all executing within their respective code segments or are all executing within the trap handler code segment.Type: ApplicationFiled: September 29, 2009Publication date: March 31, 2011Inventors: Michael C. Shebanow, Jack Choquette, Brett W. Coon, Steven J. Heinrich, Aravind Kalaiah, John R. Nickolls, Daniel Salinas, Ming Y. Siu, Tommy Thorn, Nicholas Wang
-
Publication number: 20110078692Abstract: One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.Type: ApplicationFiled: September 21, 2010Publication date: March 31, 2011Inventors: John R. NICKOLLS, Steven James Heinrich, Brett W. Coon, Michael C. Shebanow
-
Publication number: 20110078367Abstract: One embodiment of the present invention sets forth a technique for providing a L1 cache that is a central storage resource. The L1 cache services multiple clients with diverse latency and bandwidth requirements. The L1 cache may be reconfigured to create multiple storage spaces enabling the L1 cache may replace dedicated buffers, caches, and FIFOs in previous architectures. A “direct mapped” storage region that is configured within the L1 cache may replace dedicated buffers, FIFOs, and interface paths, allowing clients of the L1 cache to exchange attribute and primitive data. The direct mapped storage region may used as a global register file. A “local and global cache” storage region configured within the L1 cache may be used to support load/store memory requests to multiple spaces. These spaces include global, local, and call-return stack (CRS) memory.Type: ApplicationFiled: September 25, 2009Publication date: March 31, 2011Inventors: Alexander L. Minkin, Steven James Heinrich, RaJeshwaren Selvanesan, Brett W. Coon, Charles McCarver, Anjana Rajendran, Stewart G. Carlton
-
Publication number: 20110078417Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.Type: ApplicationFiled: September 24, 2010Publication date: March 31, 2011Inventors: Brian FAHS, Ming Y. Siu, Brett W. Coon, John R. Nickolls, Lars Nyland
-
Publication number: 20110078381Abstract: A method for managing a parallel cache hierarchy in a processing unit. The method including receiving an instruction that includes a cache operations modifier that identifies a level of the parallel cache hierarchy in which to cache data associated with the instruction; and implementing a cache replacement policy based on the cache operations modifier.Type: ApplicationFiled: September 24, 2010Publication date: March 31, 2011Inventors: Steven James HEINRICH, Alexander L. Minkin, Brett W. Coon, Rajeshwaran Selvanesan, Robert Steven Glanville, Charles McCarver, Anjana Rajendran, Stewart Glenn Carlton, John R. Nickolls, Brian Fahs
-
Publication number: 20110072244Abstract: One embodiment of the present invention sets forth a technique for ensuring cache access instructions are scheduled for execution in a multi-threaded system to improve cache locality and system performance. A credit-based technique may be used to control instruction by instruction scheduling for each warp in a group so that the group of warps is processed uniformly. A credit is computed for each warp and the credit contributes to a weight for each warp. The weight is used to select instructions for the warps that are issued for execution.Type: ApplicationFiled: September 17, 2010Publication date: March 24, 2011Inventors: John Erik Lindholm, Brett W. Coon, Jered Wierzbicki, Robert J. Stoll, Stuart F. Oberman
-
Publication number: 20110072213Abstract: A method for managing a parallel cache hierarchy in a processing unit. The method includes receiving an instruction from a scheduler unit, where the instruction comprises a load instruction or a store instruction; determining that the instruction includes a cache operations modifier that identifies a policy for caching data associated with the instruction at one or more levels of the parallel cache hierarchy; and executing the instruction and caching the data associated with the instruction based on the cache operations modifier.Type: ApplicationFiled: September 22, 2010Publication date: March 24, 2011Inventors: John R. NICKOLLS, Brett W. Coon, Michael C. Shebanow