Patents by Inventor Brett W. Coon

Brett W. Coon has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20120036329
    Abstract: A system and method for locking and unlocking access to a shared memory for atomic operations provides immediate feedback indicating whether or not the lock was successful. Read data is returned to the requestor with the lock status. The lock status may be changed concurrently when locking during a read or unlocking during a write. Therefore, it is not necessary to check the lock status as a separate transaction prior to or during a read-modify-write operation. Additionally, a lock or unlock may be explicitly specified for each atomic memory operation. Therefore, lock operations are not performed for operations that do not modify the contents of a memory location.
    Type: Application
    Filed: October 18, 2011
    Publication date: February 9, 2012
    Inventors: Brett W. Coon, John R. Nickolls, Lars Nyland, Peter C. Mills
  • Publication number: 20120026175
    Abstract: Apparatuses and methods are presented for a hierarchical processor. The processor comprises, at a first level of hierarchy, a plurality of similarly structured first level components, wherein each of the plurality of similarly structured first level components includes at least one combined function module capable of performing multiple classes of graphics operations, each of the multiple classes of graphics operations being associated with a different stage of graphics processing.
    Type: Application
    Filed: October 10, 2011
    Publication date: February 2, 2012
    Applicant: NVIDIA Corporation
    Inventors: John Erik Lindholm, John S. Montrym, Emmett M. Kilgariff, Simon S. Moy, Sean Jeffrey Treichler, Brett W. Coon, David Kirk, John Danskin
  • Patent number: 8108625
    Abstract: Concurrent threads in a multithreaded processor share access to a memory, with any location in the shared memory being accessible by any thread. In one embodiment, the shared memory has multiple independently-addressable memory banks, and one location per bank can be accessed in parallel. Parallel processing engines executing the threads generate a group of parallel memory access requests. Address conflict logic determines whether the requests can be satisfied in parallel (e.g., based on bank access constraints) and serializes the requests to the extent needed to avoid conflicts. In some embodiments, data read from one address in the shared memory can be broadcast to multiple processing engines.
    Type: Grant
    Filed: October 30, 2006
    Date of Patent: January 31, 2012
    Assignee: NVIDIA Corporation
    Inventors: Brett W. Coon, Ming Y. Siu, Weizhong Xu, Stuart F. Oberman, John R. Nickolls, Peter C. Mills
  • Patent number: 8095829
    Abstract: A global processor operating mode is used select whether a processor stops processing when an error is detected or ignores the error and continues processing while overriding values as needed to recover from the error. When a soldier-on mode is enabled the system attempts to recover from the error while also recording the error state of the first error in on-chip registers for later analysis. When the soldier-on mode is not enabled and an error occurs, the system stops processing and the error is reported up to the operating system.
    Type: Grant
    Filed: November 2, 2007
    Date of Patent: January 10, 2012
    Assignee: NVIDIA Corporation
    Inventors: Brett W. Coon, Bryon S. Nordquist
  • Patent number: 8077174
    Abstract: Apparatuses and methods are presented for a hierarchical processor. The processor comprises, at a first level of hierarchy, a plurality of similarly structured first level components, wherein each of the plurality of similarly structured first level components includes at least one combined function module capable of performing multiple classes of graphics operations, each of the multiple classes of graphics operations being associated with a different stage of graphics processing.
    Type: Grant
    Filed: November 1, 2007
    Date of Patent: December 13, 2011
    Assignee: NVIDIA Corporation
    Inventors: John Erik Lindholm, John S. Montrym, Emmett M. Kilgariff, Simon S. Moy, Sean Jeffrey Treichler, Brett W. Coon, David Kirk, John Danskin
  • Patent number: 8074224
    Abstract: Embodiments of the present invention facilitate dynamically adapting to state information changes in a graphics processing environment. In one embodiment, a master register holds state information corresponding to units of work (threads) to be performed. The state information in the master register is copied to a per-group state register when a group of threads is to be launched. The per-group state register is coupled to processing engines configured to process the threads, so that the processing engines read state information from the per-group state register rather than the master register. In another embodiment, a number of master registers may be used to store state information for different types of threads.
    Type: Grant
    Filed: December 19, 2005
    Date of Patent: December 6, 2011
    Assignee: NVIDIA Corporation
    Inventors: Bryon S. Nordquist, Brett W. Coon
  • Patent number: 8055856
    Abstract: A system and method for locking and unlocking access to a shared memory for atomic operations provides immediate feedback indicating whether or not the lock was successful. Read data is returned to the requestor with the lock status. The lock status may be changed concurrently when locking during a read or unlocking during a write. Therefore, it is not necessary to check the lock status as a separate transaction prior to or during a read-modify-write operation. Additionally, a lock or unlock may be explicitly specified for each atomic memory operation. Therefore, lock operations are not performed for operations that do not modify the contents of a memory location.
    Type: Grant
    Filed: March 24, 2008
    Date of Patent: November 8, 2011
    Assignee: NVIDIA Corporation
    Inventors: Brett W. Coon, John R. Nickolls, Lars Nyland, Peter C. Mills
  • Publication number: 20110252204
    Abstract: A memory is used by concurrent threads in a multithreaded processor. Any addressable storage location is accessible by any of the concurrent threads, but only one location at a time is accessible. The memory is coupled to parallel processing engines that generate a group of parallel memory access requests, each specifying a target address that might be the same or different for different requests. Serialization logic selects one of the target addresses and determines which of the requests specify the selected target address. All such requests are allowed to proceed in parallel, while other requests are deferred. Deferred requests may be regenerated and processed through the serialization logic so that a group of requests can be satisfied by accessing each different target address in the group exactly once.
    Type: Application
    Filed: June 21, 2011
    Publication date: October 13, 2011
    Applicant: NVIDIA Corporation
    Inventors: Brett W. Coon, Ming Y. Siu, Weizhong Xu, Stuart F. Oberman, John R. Nickolls, Peter C. Mills
  • Publication number: 20110173414
    Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.
    Type: Application
    Filed: March 23, 2011
    Publication date: July 14, 2011
    Applicant: NVIDIA Corporation
    Inventors: Norbert Juffa, Brett W. Coon
  • Patent number: 7949855
    Abstract: A processor buffers asynchronous threads. Instructions requiring operations provided by a plurality of execution units are divided into phases, each phase having at least one computation operation and at least one memory access operation. Instructions within each phase are qualified and prioritized. The instructions may be qualified based on the status of the execution unit needed to execute one or more of the current instructions. The instructions may also be qualified based on an age of each instruction, status of the execution units, a divergence potential, locality, thread diversity, and resource requirements. Qualified instructions may be prioritized based on execution units needed to execute instructions and the execution units in use. One or more of the prioritized instructions is issued per cycle to the plurality of execution units.
    Type: Grant
    Filed: April 28, 2008
    Date of Patent: May 24, 2011
    Assignee: NVIDIA Corporation
    Inventors: Peter C. Mills, John Erik Lindholm, Brett W. Coon, Gary M. Tarolli, John Matthew Burgess
  • Patent number: 7925860
    Abstract: In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.
    Type: Grant
    Filed: May 14, 2007
    Date of Patent: April 12, 2011
    Assignee: NVIDIA Corporation
    Inventors: Norbert Juffa, Brett W. Coon
  • Publication number: 20110078418
    Abstract: One embodiment of the present invention sets forth a method for executing a non-local return instruction in a parallel thread processor. The method comprises the steps of receiving, within the thread group, a first long jump instruction and, in response, popping a first token from the execution stack. The method also comprises determining whether the first token is a first long jump token that was pushed onto the execution stack when a first push instruction associated with the first long jump instruction was executed, and when the first token is the first long jump token, jumping to the second instruction based on the address specified by the first long jump token, or, when the first token is not the first long jump token, disabling the active thread until the first long jump token is popped from the execution stack.
    Type: Application
    Filed: September 13, 2010
    Publication date: March 31, 2011
    Inventors: Guillermo Juan Rozas, Brett W. Coon
  • Publication number: 20110078406
    Abstract: One embodiment of the present invention sets forth a technique for unifying the addressing of multiple distinct parallel memory spaces into a single address space for a thread. A unified memory space address is converted into an address that accesses one of the parallel memory spaces for that thread. A single type of load or store instruction may be used that specifies the unified memory space address for a thread instead of using a different type of load or store instruction to access each of the distinct parallel memory spaces.
    Type: Application
    Filed: September 25, 2009
    Publication date: March 31, 2011
    Inventors: John R. Nickolls, Brett W. Coon, Ian A. Buck, Robert Steven Glanville
  • Publication number: 20110078427
    Abstract: A trap handler architecture is incorporated into a parallel processing subsystem such as a GPU. The trap handler architecture minimizes design complexity and verification efforts for concurrently executing threads by imposing a property that all thread groups associated with a streaming multi-processor are either all executing within their respective code segments or are all executing within the trap handler code segment.
    Type: Application
    Filed: September 29, 2009
    Publication date: March 31, 2011
    Inventors: Michael C. Shebanow, Jack Choquette, Brett W. Coon, Steven J. Heinrich, Aravind Kalaiah, John R. Nickolls, Daniel Salinas, Ming Y. Siu, Tommy Thorn, Nicholas Wang
  • Publication number: 20110078692
    Abstract: One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.
    Type: Application
    Filed: September 21, 2010
    Publication date: March 31, 2011
    Inventors: John R. NICKOLLS, Steven James Heinrich, Brett W. Coon, Michael C. Shebanow
  • Publication number: 20110078367
    Abstract: One embodiment of the present invention sets forth a technique for providing a L1 cache that is a central storage resource. The L1 cache services multiple clients with diverse latency and bandwidth requirements. The L1 cache may be reconfigured to create multiple storage spaces enabling the L1 cache may replace dedicated buffers, caches, and FIFOs in previous architectures. A “direct mapped” storage region that is configured within the L1 cache may replace dedicated buffers, FIFOs, and interface paths, allowing clients of the L1 cache to exchange attribute and primitive data. The direct mapped storage region may used as a global register file. A “local and global cache” storage region configured within the L1 cache may be used to support load/store memory requests to multiple spaces. These spaces include global, local, and call-return stack (CRS) memory.
    Type: Application
    Filed: September 25, 2009
    Publication date: March 31, 2011
    Inventors: Alexander L. Minkin, Steven James Heinrich, RaJeshwaren Selvanesan, Brett W. Coon, Charles McCarver, Anjana Rajendran, Stewart G. Carlton
  • Publication number: 20110078417
    Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.
    Type: Application
    Filed: September 24, 2010
    Publication date: March 31, 2011
    Inventors: Brian FAHS, Ming Y. Siu, Brett W. Coon, John R. Nickolls, Lars Nyland
  • Publication number: 20110078381
    Abstract: A method for managing a parallel cache hierarchy in a processing unit. The method including receiving an instruction that includes a cache operations modifier that identifies a level of the parallel cache hierarchy in which to cache data associated with the instruction; and implementing a cache replacement policy based on the cache operations modifier.
    Type: Application
    Filed: September 24, 2010
    Publication date: March 31, 2011
    Inventors: Steven James HEINRICH, Alexander L. Minkin, Brett W. Coon, Rajeshwaran Selvanesan, Robert Steven Glanville, Charles McCarver, Anjana Rajendran, Stewart Glenn Carlton, John R. Nickolls, Brian Fahs
  • Publication number: 20110072244
    Abstract: One embodiment of the present invention sets forth a technique for ensuring cache access instructions are scheduled for execution in a multi-threaded system to improve cache locality and system performance. A credit-based technique may be used to control instruction by instruction scheduling for each warp in a group so that the group of warps is processed uniformly. A credit is computed for each warp and the credit contributes to a weight for each warp. The weight is used to select instructions for the warps that are issued for execution.
    Type: Application
    Filed: September 17, 2010
    Publication date: March 24, 2011
    Inventors: John Erik Lindholm, Brett W. Coon, Jered Wierzbicki, Robert J. Stoll, Stuart F. Oberman
  • Publication number: 20110072213
    Abstract: A method for managing a parallel cache hierarchy in a processing unit. The method includes receiving an instruction from a scheduler unit, where the instruction comprises a load instruction or a store instruction; determining that the instruction includes a cache operations modifier that identifies a policy for caching data associated with the instruction at one or more levels of the parallel cache hierarchy; and executing the instruction and caching the data associated with the instruction based on the cache operations modifier.
    Type: Application
    Filed: September 22, 2010
    Publication date: March 24, 2011
    Inventors: John R. NICKOLLS, Brett W. Coon, Michael C. Shebanow