Patents by Inventor John R. Nickolls

John R. Nickolls has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20110252204
    Abstract: A memory is used by concurrent threads in a multithreaded processor. Any addressable storage location is accessible by any of the concurrent threads, but only one location at a time is accessible. The memory is coupled to parallel processing engines that generate a group of parallel memory access requests, each specifying a target address that might be the same or different for different requests. Serialization logic selects one of the target addresses and determines which of the requests specify the selected target address. All such requests are allowed to proceed in parallel, while other requests are deferred. Deferred requests may be regenerated and processed through the serialization logic so that a group of requests can be satisfied by accessing each different target address in the group exactly once.
    Type: Application
    Filed: June 21, 2011
    Publication date: October 13, 2011
    Applicant: NVIDIA Corporation
    Inventors: Brett W. Coon, Ming Y. Siu, Weizhong Xu, Stuart F. Oberman, John R. Nickolls, Peter C. Mills
  • Publication number: 20110238955
    Abstract: Parallelism in a parallel processing subsystem is exploited in a scalable manner. A problem to be solved can be hierarchically decomposed into at least two levels of sub-problems. Individual threads of program execution are defined to solve the lowest-level sub-problems. The threads are grouped into one or more thread arrays, each of which solves a higher-level sub-problem. The thread arrays are executable by processing cores, each of which can execute at least one thread array at a time. Thread arrays can be grouped into grids of independent thread arrays, which solve still higher-level sub-problems or an entire problem. Thread arrays within a grid, or entire grids, can be distributed across all of the available processing cores as available in a particular system implementation.
    Type: Application
    Filed: May 2, 2011
    Publication date: September 29, 2011
    Applicant: NVIDIA Corporation
    Inventors: John R. Nickolls, Stephen D. Lew
  • Patent number: 7937567
    Abstract: Parallelism in a parallel processing subsystem is exploited in a scalable manner. A problem to be solved can be hierarchically decomposed into at least two levels of sub-problems. Individual threads of program execution are defined to solve the lowest-level sub-problems. The threads are grouped into one or more thread arrays, each of which solves a higher-level sub-problem. The thread arrays are executable by processing cores, each of which can execute at least one thread array at a time. Thread arrays can be grouped into grids of independent thread arrays, which solve still higher-level sub-problems or an entire problem. Thread arrays within a grid, or entire grids, can be distributed across all of the available processing cores as available in a particular system implementation.
    Type: Grant
    Filed: November 1, 2006
    Date of Patent: May 3, 2011
    Assignee: Nvidia Corporation
    Inventors: John R. Nickolls, Stephen D. Lew
  • Publication number: 20110087860
    Abstract: Parallel data processing systems and methods use cooperative thread arrays (CTAs), i.e., groups of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique identifier (thread ID) that can be assigned at thread launch time. The thread ID controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Mechanisms for loading and launching CTAs in a representative processing core and for synchronizing threads within a CTA are also described.
    Type: Application
    Filed: December 17, 2010
    Publication date: April 14, 2011
    Applicant: NVIDIA Corporation
    Inventors: John R. Nickolls, Stephen D. Lew
  • Publication number: 20110082961
    Abstract: The invention sets forth an L1 cache architecture that includes a crossbar unit configured to transmit data associated with both read data requests and write data requests. Data associated with read data requests is retrieved from a cache memory and transmitted to the client subsystems. Similarly, data associated with write data requests is transmitted from the client subsystems to the cache memory. To allow for the transmission of both read and write data on the crossbar unit, an arbiter is configured to schedule the crossbar unit transmissions as well and arbitrate between data requests received from the client subsystems.
    Type: Application
    Filed: September 28, 2010
    Publication date: April 7, 2011
    Inventors: Alexander L. Minkin, Steven L. Heinrich, Rajeshwaran Selvanesan, Stewart Glenn Carlton, John R. Nickolls
  • Publication number: 20110078415
    Abstract: The invention set forth herein describes a mechanism for predicated execution of instructions within a parallel processor executing multiple threads or data lanes. Each thread or data lane executing within the parallel processor is associated with a predicate register that stores a set of 1-bit predicates. Each of these predicates can be set using different types of predicate-setting instructions, where each predicate setting instruction specifies one or more source operands, at least one operation to be performed on the source operands, and one or more destination predicates for storing the result of the operation. An instruction can be guarded by a predicate that may influence whether the instruction is executed for a particular thread or data lane or how the instruction is executed for a particular thread or data lane.
    Type: Application
    Filed: September 27, 2010
    Publication date: March 31, 2011
    Inventors: Richard Craig Johnson, John R. Nickolls, Robert Steven Glanville
  • Publication number: 20110078427
    Abstract: A trap handler architecture is incorporated into a parallel processing subsystem such as a GPU. The trap handler architecture minimizes design complexity and verification efforts for concurrently executing threads by imposing a property that all thread groups associated with a streaming multi-processor are either all executing within their respective code segments or are all executing within the trap handler code segment.
    Type: Application
    Filed: September 29, 2009
    Publication date: March 31, 2011
    Inventors: Michael C. Shebanow, Jack Choquette, Brett W. Coon, Steven J. Heinrich, Aravind Kalaiah, John R. Nickolls, Daniel Salinas, Ming Y. Siu, Tommy Thorn, Nicholas Wang
  • Publication number: 20110074802
    Abstract: One embodiment of the present invention sets forth a technique for a program to access multi-dimensional formatted graphics surface memory. Multi-dimensional memory objects called “surfaces” stored in a user-specified data or pixel format and arranged in a graphics optimized layout are accessed by programs using surface instructions. A set of memory access instructions e.g., load, store, reduce, and atomic, referred to as surface instructions, may be used to access the surfaces. Coordinate bounds checking is performed with configurable clamping. Caching behavior may also be specified by the surface instructions. Data format conversion and packing to a specified storage format is supported for store, reduction, and atomic surface instructions. Data format conversion and unpacking from a specified storage format is supported for loads and atomic surface instructions.
    Type: Application
    Filed: September 24, 2010
    Publication date: March 31, 2011
    Inventors: John R. Nickolls, Brian Fahs, Lars Nyland, John Erik Lindholm, Richard Craig Johnson
  • Publication number: 20110078689
    Abstract: A method for thread address mapping in a parallel thread processor. The method includes receiving a thread address associated with a first thread in a thread group; computing an effective address based on a location of the thread address within a local window of a thread address space; computing a thread group address in an address space associated with the thread group based on the effective address and a thread identifier associated with a first thread; and computing a virtual address associated with the first thread based on the thread group address and a thread group identifier, where the virtual address is used to access a location in a memory associated with the thread address to load or store data.
    Type: Application
    Filed: September 24, 2010
    Publication date: March 31, 2011
    Inventors: Michael C. SHEBANOW, Yan Yan Tang, John R. Nickolls
  • Publication number: 20110078692
    Abstract: One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.
    Type: Application
    Filed: September 21, 2010
    Publication date: March 31, 2011
    Inventors: John R. NICKOLLS, Steven James Heinrich, Brett W. Coon, Michael C. Shebanow
  • Publication number: 20110078406
    Abstract: One embodiment of the present invention sets forth a technique for unifying the addressing of multiple distinct parallel memory spaces into a single address space for a thread. A unified memory space address is converted into an address that accesses one of the parallel memory spaces for that thread. A single type of load or store instruction may be used that specifies the unified memory space address for a thread instead of using a different type of load or store instruction to access each of the distinct parallel memory spaces.
    Type: Application
    Filed: September 25, 2009
    Publication date: March 31, 2011
    Inventors: John R. Nickolls, Brett W. Coon, Ian A. Buck, Robert Steven Glanville
  • Publication number: 20110078417
    Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.
    Type: Application
    Filed: September 24, 2010
    Publication date: March 31, 2011
    Inventors: Brian FAHS, Ming Y. Siu, Brett W. Coon, John R. Nickolls, Lars Nyland
  • Publication number: 20110078225
    Abstract: The invention set forth herein describes a mechanism for efficiently performing extended precision operations on multi-word source operands. Corresponding data words of the source operands are processed together via each instruction of a cascading sequence of instructions. State information generated when each instruction is processed is stored in condition code flags. The state information is optionally used in the processing of subsequent instructions in the sequence and/or accumulated with previously set state information.
    Type: Application
    Filed: September 23, 2010
    Publication date: March 31, 2011
    Inventors: Richard Craig JOHNSON, John R. Nickolls
  • Publication number: 20110078381
    Abstract: A method for managing a parallel cache hierarchy in a processing unit. The method including receiving an instruction that includes a cache operations modifier that identifies a level of the parallel cache hierarchy in which to cache data associated with the instruction; and implementing a cache replacement policy based on the cache operations modifier.
    Type: Application
    Filed: September 24, 2010
    Publication date: March 31, 2011
    Inventors: Steven James HEINRICH, Alexander L. Minkin, Brett W. Coon, Rajeshwaran Selvanesan, Robert Steven Glanville, Charles McCarver, Anjana Rajendran, Stewart Glenn Carlton, John R. Nickolls, Brian Fahs
  • Publication number: 20110072249
    Abstract: One embodiment of the present invention sets forth a mechanism for managing thread divergence in a thread group executing a multithreaded processor. A unanimous branch instruction, when executed, causes all the active threads in the thread group to branch only when each thread in the thread group agrees to take the branch. In such a manner, thread divergence is eliminated. A branch-any instruction, when executed, causes all the active threads in the thread group to branch when at least one thread in the thread group agrees to take the branch.
    Type: Application
    Filed: June 14, 2010
    Publication date: March 24, 2011
    Inventors: John R. Nickolls, Richard Craig Johnson, Robert Steven Glanville, Guillermo Juan Rozas
  • Publication number: 20110072248
    Abstract: One embodiment of the present invention sets forth a mechanism for managing thread divergence in a thread group executing a multithreaded processor. A unanimous branch instruction, when executed, causes all the active threads in the thread group to branch only when each thread in the thread group agrees to take the branch. In such a manner, thread divergence is eliminated. A branch-any instruction, when executed, causes all the active threads in the thread group to branch when at least one thread in the thread group agrees to take the branch.
    Type: Application
    Filed: June 14, 2010
    Publication date: March 24, 2011
    Inventors: John R. NICKOLLS, Richard Craig Johnson, Robert Steven Glanville, Guillermo Juan Rozas
  • Publication number: 20110072213
    Abstract: A method for managing a parallel cache hierarchy in a processing unit. The method includes receiving an instruction from a scheduler unit, where the instruction comprises a load instruction or a store instruction; determining that the instruction includes a cache operations modifier that identifies a policy for caching data associated with the instruction at one or more levels of the parallel cache hierarchy; and executing the instruction and caching the data associated with the instruction based on the cache operations modifier.
    Type: Application
    Filed: September 22, 2010
    Publication date: March 24, 2011
    Inventors: John R. NICKOLLS, Brett W. Coon, Michael C. Shebanow
  • Patent number: 7877585
    Abstract: One embodiment of a computing system configured to manage divergent threads in a SIMD thread group includes a stack configured to store state information for processing control instructions. A parallel processing unit is configured to perform the steps of determining if one or more threads diverge during execution of a conditional control instruction. A disable mask allows for the use of conditional return and break instructions in a multithreaded SIMD architecture. Additional control instructions are used to set up thread processing target addresses for synchronization, breaks, and returns.
    Type: Grant
    Filed: August 27, 2007
    Date of Patent: January 25, 2011
    Assignee: NVIDIA Corporation
    Inventors: Brett W. Coon, John R. Nickolls, John Erik Lindholm, Svetoslav D. Tzvetkov
  • Patent number: 7864185
    Abstract: A graphics processing unit can queue a large number of texture requests to balance out the variability of texture requests without the need for a large texture request buffer. A dedicated texture request buffer queues the relatively small texture commands and parameters. Additionally, for each queued texture command, an associated set of texture arguments, which are typically much larger than the texture command, are stored in a general purpose register. The texture unit retrieves texture commands from the texture request buffer and then fetches the associated texture arguments from the appropriate general purpose register. The texture arguments may be stored in the general purpose register designated as the destination of the final texture value computed by the texture unit. Because the destination register must be allocated for the final texture value as texture commands are queued, storing the texture arguments in this register does not consume any additional registers.
    Type: Grant
    Filed: October 23, 2008
    Date of Patent: January 4, 2011
    Assignee: NVIDIA Corporation
    Inventors: John Erik Lindholm, John R. Nickolls, Simon S. Moy, Brett W. Coon
  • Patent number: 7865894
    Abstract: Embodiments of the present invention facilitate distributing processing tasks within a processor. In one embodiment, processing clusters keep track of resource requirements. If sufficient resources are available within a particular processing cluster, the available processing cluster asserts a ready signal to a dispatch unit. The dispatch unit is configured to pass a processing task (such as a cooperative thread array or CTA) to an available processing cluster that asserted a ready signal. In another embodiment, a processing task is passed around a ring of processing clusters until a processing cluster with sufficient resources available accepts the processing task.
    Type: Grant
    Filed: December 19, 2005
    Date of Patent: January 4, 2011
    Assignee: NVIDIA Corporation
    Inventors: Bryon S. Nordquist, John R. Nickolls