Patents by Inventor John R. Nickolls

John R. Nickolls has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 9952977
    Abstract: A method for managing a parallel cache hierarchy in a processing unit. The method including receiving an instruction that includes a cache operations modifier that identifies a level of the parallel cache hierarchy in which to cache data associated with the instruction; and implementing a cache replacement policy based on the cache operations modifier.
    Type: Grant
    Filed: September 24, 2010
    Date of Patent: April 24, 2018
    Assignee: NVIDIA CORPORATION
    Inventors: Steven James Heinrich, Alexander L. Minkin, Brett W. Coon, Rajeshwaran Selvanesan, Robert Steven Glanville, Charles McCarver, Anjana Rajendran, Stewart Glenn Carlton, John R. Nickolls, Brian Fahs
  • Patent number: 9830197
    Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.
    Type: Grant
    Filed: August 16, 2016
    Date of Patent: November 28, 2017
    Assignee: NVIDIA Corporation
    Inventors: Brian Fahs, Ming Y Siu, Brett W. Coon, John R. Nickolls, Lars Nyland
  • Publication number: 20170235581
    Abstract: A technique for managing a parallel cache hierarchy that includes receiving an instruction from a scheduler unit, where the instruction comprises a load instruction or a store instruction; determining that the instruction includes a cache operations modifier that identifies a policy for caching data associated with the instruction at one or more levels of the parallel cache hierarchy; and executing the instruction and caching the data associated with the instruction based on the cache operations modifier.
    Type: Application
    Filed: May 1, 2017
    Publication date: August 17, 2017
    Inventors: John R. NICKOLLS, Brett W. Coon, Michael C. Shebanow
  • Patent number: 9639479
    Abstract: A method for managing a parallel cache hierarchy in a processing unit. The method includes receiving an instruction from a scheduler unit, where the instruction comprises a load instruction or a store instruction; determining that the instruction includes a cache operations modifier that identifies a policy for caching data associated with the instruction at one or more levels of the parallel cache hierarchy; and executing the instruction and caching the data associated with the instruction based on the cache operations modifier.
    Type: Grant
    Filed: September 22, 2010
    Date of Patent: May 2, 2017
    Assignee: NVIDIA Corporation
    Inventors: John R. Nickolls, Brett W. Coon, Michael C. Shebanow
  • Patent number: 9639365
    Abstract: An indirect branch instruction takes an address register as an argument in order to provide indirect function call capability for single-instruction multiple-thread (SIMT) processor architectures. The indirect branch instruction is used to implement indirect function calls, virtual function calls, and switch statements to improve processing performance compared with using sequential chains of tests and branches.
    Type: Grant
    Filed: November 12, 2012
    Date of Patent: May 2, 2017
    Assignee: NVIDIA Corporation
    Inventors: Brett W. Coon, John R. Nickolls, Lars Nyland, Peter C. Mills, John Erik Lindholm
  • Patent number: 9519947
    Abstract: One embodiment of the present invention sets forth a technique for a program to access multi-dimensional formatted graphics surface memory. Multi-dimensional memory objects called “surfaces” stored in a user-specified data or pixel format and arranged in a graphics optimized layout are accessed by programs using surface instructions. A set of memory access instructions e.g., load, store, reduce, and atomic, referred to as surface instructions, may be used to access the surfaces. Coordinate bounds checking is performed with configurable clamping. Caching behavior may also be specified by the surface instructions. Data format conversion and packing to a specified storage format is supported for store, reduction, and atomic surface instructions. Data format conversion and unpacking from a specified storage format is supported for loads and atomic surface instructions.
    Type: Grant
    Filed: September 24, 2010
    Date of Patent: December 13, 2016
    Assignee: NVIDIA Corporation
    Inventors: John R. Nickolls, Brian Fahs, Lars Nyland, John Erik Lindholm, Richard Craig Johnson
  • Publication number: 20160357560
    Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.
    Type: Application
    Filed: August 16, 2016
    Publication date: December 8, 2016
    Inventors: Brian FAHS, Ming Y. SIU, Brett W. Coon, John R. NICKOLLS, Lars NYLAND
  • Patent number: 9417875
    Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.
    Type: Grant
    Filed: September 12, 2013
    Date of Patent: August 16, 2016
    Assignee: NVIDIA CORPORATION
    Inventors: Brian Fahs, Ming Y. Siu, Brett W. Coon, John R. Nickolls, Lars Nyland
  • Patent number: 9286256
    Abstract: The invention sets forth an L1 cache architecture that includes a crossbar unit configured to transmit data associated with both read data requests and write data requests. Data associated with read data requests is retrieved from a cache memory and transmitted to the client subsystems. Similarly, data associated with write data requests is transmitted from the client subsystems to the cache memory. To allow for the transmission of both read and write data on the crossbar unit, an arbiter is configured to schedule the crossbar unit transmissions as well and arbitrate between data requests received from the client subsystems.
    Type: Grant
    Filed: September 28, 2010
    Date of Patent: March 15, 2016
    Assignee: NVIDIA Corporation
    Inventors: Alexander L. Minkin, Steven J. Heinrich, Rajeshwaran Selvanesan, Stewart Glenn Carlton, John R. Nickolls
  • Patent number: 9223578
    Abstract: One embodiment of the present invention sets forth a technique for coalescing memory barrier operations across multiple parallel threads. Memory barrier requests from a given parallel thread processing unit are coalesced to reduce the impact to the rest of the system. Additionally, memory barrier requests may specify a level of a set of threads with respect to which the memory transactions are committed. For example, a first type of memory barrier instruction may commit the memory transactions to a level of a set of cooperating threads that share an L1 (level one) cache. A second type of memory barrier instruction may commit the memory transactions to a level of a set of threads sharing a global memory. Finally, a third type of memory barrier instruction may commit the memory transactions to a system level of all threads sharing all system memories. The latency required to execute the memory barrier instruction varies based on the type of memory barrier instruction.
    Type: Grant
    Filed: September 21, 2010
    Date of Patent: December 29, 2015
    Assignee: NVIDIA Corporation
    Inventors: John R. Nickolls, Steven James Heinrich, Brett W. Coon, Michael C. Shebanow
  • Patent number: 8700877
    Abstract: A method for thread address mapping in a parallel thread processor. The method includes receiving a thread address associated with a first thread in a thread group; computing an effective address based on a location of the thread address within a local window of a thread address space; computing a thread group address in an address space associated with the thread group based on the effective address and a thread identifier associated with a first thread; and computing a virtual address associated with the first thread based on the thread group address and a thread group identifier, where the virtual address is used to access a location in a memory associated with the thread address to load or store data.
    Type: Grant
    Filed: September 24, 2010
    Date of Patent: April 15, 2014
    Assignee: Nvidia Corporation
    Inventors: Michael C. Shebanow, Yan Yan Tang, John R. Nickolls
  • Patent number: 8677106
    Abstract: One embodiment of the present invention sets forth a mechanism for managing thread divergence in a thread group executing a multithreaded processor. A unanimous branch instruction, when executed, causes all the active threads in the thread group to branch only when each thread in the thread group agrees to take the branch. In such a manner, thread divergence is eliminated. A branch-any instruction, when executed, causes all the active threads in the thread group to branch when at least one thread in the thread group agrees to take the branch.
    Type: Grant
    Filed: June 14, 2010
    Date of Patent: March 18, 2014
    Assignee: Nvidia Corporation
    Inventors: John R. Nickolls, Richard Craig Johnson, Robert Steven Glanville, Guillermo Juan Rozas
  • Patent number: 8645638
    Abstract: A memory is used by concurrent threads in a multithreaded processor. Any addressable storage location is accessible by any of the concurrent threads, but only one location at a time is accessible. The memory is coupled to parallel processing engines that generate a group of parallel memory access requests, each specifying a target address that might be the same or different for different requests. Serialization logic selects one of the target addresses and determines which of the requests specify the selected target address. All such requests are allowed to proceed in parallel, while other requests are deferred. Deferred requests may be regenerated and processed through the serialization logic so that a group of requests can be satisfied by accessing each different target address in the group exactly once.
    Type: Grant
    Filed: May 7, 2012
    Date of Patent: February 4, 2014
    Assignee: NVIDIA Corporation
    Inventors: Brett W. Coon, Ming Y. Siu, Weizhong Xu, Stuart F. Oberman, John R. Nickolls, Peter C. Mills
  • Publication number: 20140019724
    Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.
    Type: Application
    Filed: September 12, 2013
    Publication date: January 16, 2014
    Applicant: NVIDIA Corporation
    Inventors: Brian FAHS, Ming Y. SIU, Brett W. COON, John R. NICKOLLS, Lars NYLAND
  • Patent number: 8615646
    Abstract: One embodiment of the present invention sets forth a mechanism for managing thread divergence in a thread group executing a multithreaded processor. A unanimous branch instruction, when executed, causes all the active threads in the thread group to branch only when each thread in the thread group agrees to take the branch. In such a manner, thread divergence is eliminated. A branch-any instruction, when executed, causes all the active threads in the thread group to branch when at least one thread in the thread group agrees to take the branch.
    Type: Grant
    Filed: June 14, 2010
    Date of Patent: December 24, 2013
    Assignee: Nvidia Corporation
    Inventors: John R. Nickolls, Richard Craig Johnson, Robert Steven Glanville, Guillermo Juan Rozas
  • Patent number: 8615541
    Abstract: The invention set forth herein describes a mechanism for efficiently performing extended precision operations on multi-word source operands. Corresponding data words of the source operands are processed together via each instruction of a cascading sequence of instructions. State information generated when each instruction is processed is stored in condition code flags. The state information is optionally used in the processing of subsequent instructions in the sequence and/or accumulated with previously set state information.
    Type: Grant
    Filed: September 23, 2010
    Date of Patent: December 24, 2013
    Assignee: NVIDIA Corporation
    Inventors: Richard Craig Johnson, John R. Nickolls
  • Patent number: 8539204
    Abstract: One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread.
    Type: Grant
    Filed: September 24, 2010
    Date of Patent: September 17, 2013
    Assignee: Nvidia Corporation
    Inventors: Brian Fahs, Ming Y. Siu, Brett W. Coon, John R. Nickolls, Lars Nyland
  • Patent number: 8522000
    Abstract: A trap handler architecture is incorporated into a parallel processing subsystem such as a GPU. The trap handler architecture minimizes design complexity and verification efforts for concurrently executing threads by imposing a property that all thread groups associated with a streaming multi-processor are either all executing within their respective code segments or are all executing within the trap handler code segment.
    Type: Grant
    Filed: September 29, 2009
    Date of Patent: August 27, 2013
    Assignee: Nvidia Corporation
    Inventors: Michael C. Shebanow, Jack Choquette, Brett W. Coon, Steven J. Heinrich, Aravind Kalaiah, John R. Nickolls, Daniel Salinas, Ming Y. Siu, Tommy Thorn, Nicholas Wang
  • Patent number: 8392669
    Abstract: One embodiment of the present invention sets forth a technique for efficiently and flexibly performing coalesced memory accesses for a thread group. For each read application request that services a thread group, the core interface generates one pending request table (PRT) entry and one or more memory access requests. The core interface determines the number of memory access requests and the size of each memory access request based on the spread of the memory access addresses in the application request. Each memory access request specifies the particular threads that the memory access request services. The PRT entry tracks the number of pending memory access requests. As the memory interface completes each memory access request, the core interface uses information in the memory access request and the corresponding PRT entry to route the returned data.
    Type: Grant
    Filed: November 26, 2008
    Date of Patent: March 5, 2013
    Assignee: NVIDIA Corporation
    Inventors: Lars Nyland, John R. Nickolls, Gentaro Hirota, Tanmoy Mandal
  • Patent number: 8375176
    Abstract: A system and method for locking and unlocking access to a shared memory for atomic operations provides immediate feedback indicating whether or not the lock was successful. Read data is returned to the requestor with the lock status. The lock status may be changed concurrently when locking during a read or unlocking during a write. Therefore, it is not necessary to check the lock status as a separate transaction prior to or during a read-modify-write operation. Additionally, a lock or unlock may be explicitly specified for each atomic memory operation. Therefore, lock operations are not performed for operations that do not modify the contents of a memory location.
    Type: Grant
    Filed: October 18, 2011
    Date of Patent: February 12, 2013
    Assignee: NVIDIA Corporation
    Inventors: Brett W. Coon, John R. Nickolls, Lars Nyland, Peter C. Mills