Patents by Inventor Lacky V. Shah

Lacky V. Shah has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 9069609
    Abstract: One embodiment of the present invention sets forth a technique for assigning a compute task to a first processor included in a plurality of processors. The technique involves analyzing each compute task in a plurality of compute tasks to identify one or more compute tasks that are eligible for assignment to the first processor, where each compute task is listed in a first table and is associated with a priority value and an allocation order that indicates relative time at which the compute task was added to the first table. The technique further involves selecting a first task compute from the identified one or more compute tasks based on at least one of the priority value and the allocation order, and assigning the first compute task to the first processor for execution.
    Type: Grant
    Filed: January 18, 2012
    Date of Patent: June 30, 2015
    Assignee: NVIDIA CORPORATION
    Inventors: Karim M. Abdalla, Lacky V. Shah, Jerome F. Duluk, Jr., Timothy John Purcell, Tanmoy Mandal, Gentaro Hirota
  • Patent number: 8984183
    Abstract: One embodiment of the present invention sets forth a technique for enabling the insertion of generated tasks into a scheduling pipeline of a multiple processor system allows a compute task that is being executed to dynamically generate a dynamic task and notify a scheduling unit of the multiple processor system without intervention by a CPU. A reflected notification signal is generated in response to a write request when data for the dynamic task is written to a queue. Additional reflected notification signals are generated for other events that occur during execution of a compute task, e.g., to invalidate cache entries storing data for the compute task and to enable scheduling of another compute task.
    Type: Grant
    Filed: December 16, 2011
    Date of Patent: March 17, 2015
    Assignee: Nvidia Corporation
    Inventors: Timothy John Purcell, Lacky V. Shah, Jerome F. Duluk, Jr., Sean J. Treichler, Karim M. Abdalla, Philip Alexander Cuadra, Brian Pharris
  • Patent number: 8976185
    Abstract: One embodiment of the present invention sets forth a technique for executing an operation once work associated with a version of a state object has been completed. The method includes receiving the version of the state object at a first stage in a processing pipeline, where the version of the state object is associated with a reference count object, determining that the version of the state object is relevant to the first stage, incrementing a counter included in the reference count object, transmitting the version of the state object to a second stage in the processing pipeline, processing work associated with the version of the state object, decrementing the counter, determining that the counter is equal to zero, and in response, executing an operation specified by the reference count object.
    Type: Grant
    Filed: November 11, 2011
    Date of Patent: March 10, 2015
    Assignee: NVIDIA Corporation
    Inventors: Sean J. Treichler, Lacky V. Shah, Daniel Elliot Wexler
  • Patent number: 8948167
    Abstract: One embodiment of the present invention is a control unit for distributing packets of work to one or more consumer of works. The control unit is configured to assign at least one processing domain from a set of processing domains to each consumer included in the one or more consumers, receive a plurality of packets of work from at least one producer of work, wherein each packet of work is associated with a processing domain from the set of processing domains, and a first packet of work associated with a first processing domain can be processed by the one or more consumers independently of a second packet of work associated with a second processing domain, identify a first consumer that has been assigned the first processing domain, and transmit the first packet of work to the first consumer for processing.
    Type: Grant
    Filed: September 15, 2011
    Date of Patent: February 3, 2015
    Assignee: NVIDIA Corporation
    Inventors: Lacky V. Shah, Sean J. Treichler, Abraham B. de Waal
  • Publication number: 20140184617
    Abstract: One embodiment of the present invention sets forth a technique for mid-primitive execution preemption. When preemption is initiated no new instructions are issued, in-flight instructions progress to an execution unit boundary, and the execution state is unloaded from the processing pipeline. The execution units within the processing pipeline, including the coarse rasterization unit complete execution of in-flight instructions and become idle. However, rasterization of a triangle may be preempted at a coarse raster region boundary. The amount of context state to be stored is reduced because the execution units are idle. Preempting at the mid-primitive level during rasterization reduces the time from when preemption is initiated to when another process can execute because the entire triangle is not rasterized.
    Type: Application
    Filed: December 27, 2012
    Publication date: July 3, 2014
    Applicant: NVIDIA CORPORATION
    Inventors: Gregory Scott PALMER, Ziyad S. HAKURA, Emmett M. KILGARIFF, Dale L. KIRKLAND, Lacky V. SHAH
  • Publication number: 20140189260
    Abstract: A streaming multiprocessor in a parallel processing subsystem processes atomic operations for multiple threads in a multi-threaded architecture. The streaming multiprocessor receives a request from a thread in a thread group to acquire access to a memory location in a lock-protected shared memory, and determines whether a address lock in a plurality of address locks is asserted, where the address lock is associated the memory location. If the address lock is asserted, then the streaming multiprocessor refuses the request. Otherwise, the streaming multiprocessor asserts the address lock, asserts a thread group lock in a plurality of thread group locks, where the thread group lock is associated with the thread group, and grants the request. One advantage of the disclosed techniques is that acquired locks are released when a thread is preempted. As a result, a preempted thread that has previously acquired a lock does not retain the lock indefinitely.
    Type: Application
    Filed: December 27, 2012
    Publication date: July 3, 2014
    Applicant: NVIDIA CORPORATION
    Inventors: Nicholas WANG, Shirish GADRE, Robert OHANNESSIAN, Lacky V. SHAH, Matthew BROCKMEYER, Stewart Glenn CARLTON
  • Publication number: 20140189329
    Abstract: Techniques are provided for handling a trap encountered in a thread that is part of a thread array that is being executed in a plurality of execution units. In these techniques, a data structure with an identifier associated with the thread is updated to indicate that the trap occurred during the execution of the thread array. Also in these techniques, the execution units execute a trap handling routine that includes a context switch. The execution units perform this context switch for at least one of the execution units as part of the trap handling routine while allowing the remaining execution units to exit the trap handling routine before the context switch. One advantage of the disclosed techniques is that the trap handling routine operates efficiently in parallel processors.
    Type: Application
    Filed: December 27, 2012
    Publication date: July 3, 2014
    Applicant: NVIDIA CORPORATION
    Inventors: Gerald F. LUIZ, Philip Alexander CUADRA, Luke DURANT, Shirish GADRE, Robert OHANNESSIAN, Lacky V. SHAH, Nicholas WANG, Arthur DANSKIN
  • Publication number: 20140189711
    Abstract: Techniques are provided for restoring thread groups in a cooperative thread array (CTA) within a processing core. Each thread group in the CTA is launched to execute a context restore routine. Each thread group, executes the context restore routine to restore from a memory a first portion of context associated with the thread group, and determines whether the thread group completed an assigned function prior to executing the context restore routine. If the thread group completed an assigned function prior to executing the context restore routine, then the thread group exits the context restore routine. If the thread group did not complete the assigned function prior to executing the context restore routine, then the thread group executes one or more operations associated with a trap handler routine. One advantage of the disclosed techniques is that the trap handling routine operates efficiently in parallel processors.
    Type: Application
    Filed: April 15, 2013
    Publication date: July 3, 2014
    Inventors: Gerald F. Luiz, Phillip Alexander Cuadra, Luke Durant, Shirish Gadre, Robert Ohannessian, Lacky V. Shah, Nicholas Wang, Arthur Merlin Danskin
  • Publication number: 20140168245
    Abstract: A texture processing pipeline can be configured to service memory access requests that represent texture data access operations or generic data access operations. When the texture processing pipeline receives a memory access request that represents a texture data access operation, the texture processing pipeline may retrieve texture data based on texture coordinates. When the memory access request represents a generic data access operation, the texture pipeline extracts a virtual address from the memory access request and then retrieves data based on the virtual address. The texture processing pipeline is also configured to cache generic data retrieved on behalf of a group of threads and to then invalidate that generic data when the group of threads exits.
    Type: Application
    Filed: December 19, 2012
    Publication date: June 19, 2014
    Applicant: NVIDIA CORPORATION
    Inventors: Brian Fahs, Eric T. Anderson, Nick Barrow-Williams, Shirish Gadre, Joel James McCormack, Bryon S. Nordquist, Nirmal Raj Saxena, Lacky V. Shah
  • Publication number: 20140173193
    Abstract: A tag unit configured to manage a cache unit includes a coalescer that implements a set hashing function. The set hashing function maps a virtual address to a particular content-addressable memory unit (CAM). The coalescer implements the set hashing function by splitting the virtual address into upper, middle, and lower portions. The upper portion is further divided into even-indexed bits and odd-indexed bits. The even-indexed bits are reduced to a single bit using a XOR tree, and the odd-indexed are reduced in like fashion. Those single bits are combined with the middle portion of the virtual address to provide a CAM number that identifies a particular CAM. The identified CAM is queried to determine the presence of a tag portion of the virtual address, indicating a cache hit or cache miss.
    Type: Application
    Filed: December 19, 2012
    Publication date: June 19, 2014
    Applicant: NVIDIA CORPORATION
    Inventors: Brian Fahs, Eric T. ANDERSON, Nick Barrow-Williams, Shirish GADRE, Joel James MCCORMACK, Bryon S. NORDQUIST, Nirmal Raj Saxena, Lacky V. Shah
  • Publication number: 20140173258
    Abstract: A texture processing pipeline can be configured to service memory access requests that represent texture data access operations or generic data access operations. When the texture processing pipeline receives a memory access request that represents a texture data access operation, the texture processing pipeline may retrieve texture data based on texture coordinates. When the memory access request represents a generic data access operation, the texture pipeline extracts a virtual address from the memory access request and then retrieves data based on the virtual address. The texture processing pipeline is also configured to cache generic data retrieved on behalf of a group of threads and to then invalidate that generic data when the group of threads exits.
    Type: Application
    Filed: December 19, 2012
    Publication date: June 19, 2014
    Applicant: NVIDIA CORPORATION
    Inventors: Brian FAHS, Eric T. ANDERSON, Nick BARROW-WILLIAMS, Shirish GADRE, Joel James MCCORMACK, Bryon S. NORDQUIST, Nirmal Raj SAXENA, Lacky V. SHAH
  • Publication number: 20140165072
    Abstract: A streaming multiprocessor (SM) included within a parallel processing unit (PPU) is configured to suspend a thread group executing on the SM and to save the operating state of the suspended thread group. A load-store unit (LSU) within the SM re-maps local memory associated with the thread group to a location in global memory. Subsequently, the SM may re-launch the suspended thread group. The LSU may then perform local memory access operations on behalf of the re-launched thread group with the re-mapped local memory that resides in global memory.
    Type: Application
    Filed: December 11, 2012
    Publication date: June 12, 2014
    Applicant: NVIDIA CORPORATION
    Inventors: Nicholas WANG, Lacky V. SHAH, Gerald F. LUIZ, Philip Alexander CUADRA, Luke DURANT, Shirish GADRE
  • Publication number: 20130298133
    Abstract: One embodiment of the present invention sets forth a technique for performing nested kernel execution within a parallel processing subsystem. The technique involves enabling a parent thread to launch a nested child grid on the parallel processing subsystem, and enabling the parent thread to perform a thread synchronization barrier on the child grid for proper execution semantics between the parent thread and the child grid. This technique advantageously enables the parallel processing subsystem to perform a richer set of programming constructs, such as conditionally executed and nested operations and externally defined library functions without the additional complexity of CPU involvement.
    Type: Application
    Filed: May 2, 2012
    Publication date: November 7, 2013
    Inventors: Stephen JONES, Philip Alexander Cuadra, Daniel Elliot Wexler, Ignacio Llamas, Lacky V. Shah, Jerome F. Duluk, JR., Christopher Lamb
  • Publication number: 20130198759
    Abstract: A technique for controlling the distribution of compute task processing in a multi-threaded system encodes each processing task as task metadata (TMD) stored in memory. The TMD includes work distribution parameters specifying how the processing task should be distributed for processing. Scheduling circuitry selects a task for execution when entries of a work queue for the task have been written. The work distribution parameters may define a number of work queue entries needed before a cooperative thread array” (“CTA”) may be launched to process the work queue entries according to the compute task. The work distribution parameters may define a number of CTAS that are launched to process the same work queue entries. Finally, the work distribution parameters may define a step size that is used to update pointers to the work queue entries.
    Type: Application
    Filed: January 31, 2012
    Publication date: August 1, 2013
    Inventors: Lacky V. SHAH, Karim M. ABDALLA, Sean J. TREICHLER, Abraham B. DE WAAL
  • Publication number: 20130198760
    Abstract: One embodiment of the present invention sets forth a technique for automatic launching of a dependent task when execution of a first task completes. Automatically launching the dependent task reduces the latency incurred during the transition from the first task to the dependent task. Information associated with the dependent task is encoded as part of the metadata for the first task. When execution of the first task completes a task scheduling unit is notified and the dependent task is launched without requiring any release or acquisition of a semaphore. The information associated with the dependent task includes an enable flag and a pointer to the dependent task. Once the dependent task is launched, the first task is marked as complete so that memory storing the metadata for the first task may be reused to store metadata for a new task.
    Type: Application
    Filed: January 27, 2012
    Publication date: August 1, 2013
    Inventors: Philip Alexander CUADRA, Lacky V. Shah, Timothy John Purcell, Gerald F. Luiz, Jerome F. Duluk, JR.
  • Publication number: 20130185725
    Abstract: One embodiment of the present invention sets forth a technique for selecting a first processor included in a plurality of processors to receive work related to a compute task. The technique involves analyzing state data of each processor in the plurality of processors to identify one or more processors that have already been assigned one compute task and are eligible to receive work related to the one compute task, receiving, from each of the one or more processors identified as eligible, an availability value that indicates the capacity of the processor to receive new work, selecting a first processor to receive work related to the one compute task based on the availability values received from the one or more processors, and issuing, to the first processor via a cooperative thread array (CTA), the work related to the one compute task.
    Type: Application
    Filed: January 18, 2012
    Publication date: July 18, 2013
    Inventors: Karim M. ABDALLA, Lacky V. Shah, Jerome F. Duluk, JR., Timothy John Purcell, Tanmoy Mandal, Gentaro Hirota
  • Publication number: 20130185728
    Abstract: One embodiment of the present invention sets forth a technique for assigning a compute task to a first processor included in a plurality of processors. The technique involves analyzing each compute task in a plurality of compute tasks to identify one or more compute tasks that are eligible for assignment to the first processor, where each compute task is listed in a first table and is associated with a priority value and an allocation order that indicates relative time at which the compute task was added to the first table. The technique further involves selecting a first task compute from the identified one or more compute tasks based on at least one of the priority value and the allocation order, and assigning the first compute task to the first processor for execution.
    Type: Application
    Filed: January 18, 2012
    Publication date: July 18, 2013
    Inventors: Karim M. Abdalla, Lacky V. Shah, Jerome F. Duluk, JR., Timothy John Purcell, Tanmoy Mandal, Gentaro Hirota
  • Publication number: 20130160021
    Abstract: One embodiment of the present invention sets forth a technique for enabling the insertion of generated tasks into a scheduling pipeline of a multiple processor system allows a compute task that is being executed to dynamically generate a dynamic task and notify a scheduling unit of the multiple processor system without intervention by a CPU. A reflected notification signal is generated in response to a write request when data for the dynamic task is written to a queue. Additional reflected notification signals are generated for other events that occur during execution of a compute task, e.g., to invalidate cache entries storing data for the compute task and to enable scheduling of another compute task.
    Type: Application
    Filed: December 16, 2011
    Publication date: June 20, 2013
    Inventors: Timothy John PURCELL, Lacky V. Shah, Jerome F. Duluk, JR., Sean J. Treichler, Karim M. Abdalla, Philip Alexander Cuadra, Brian Pharris
  • Publication number: 20130152093
    Abstract: A time slice group (TSG) is a grouping of different streams of work (referred to herein as “channels”) that share the same context information. The set of channels belonging to a TSG are processed in a pre-determined order. However, when a channel stalls while processing, the next channel with independent work can be switched to fully load the parallel processing unit. Importantly, because each channel in the TSG shares the same context information, a context switch operation is not needed when the processing of a particular channel in the TSG stops and the processing of a next channel in the TSG begins. Therefore, multiple independent streams of work are allowed to run concurrently within a single context increasing utilization of parallel processing units.
    Type: Application
    Filed: December 9, 2011
    Publication date: June 13, 2013
    Inventors: Samuel H. DUNCAN, Lacky V. SHAH, Sean J. TREICHLER, Daniel Elliot WEXLER, Jerome F. DULUK, JR., Phillip Browning JOHNSON, Jonathon Stuart Ramsay EVANS
  • Publication number: 20130132711
    Abstract: One embodiment of the present invention sets forth a technique instruction level and compute thread array granularity execution preemption. Preempting at the instruction level does not require any draining of the processing pipeline. No new instructions are issued and the context state is unloaded from the processing pipeline. When preemption is performed at a compute thread array boundary, the amount of context state to be stored is reduced because execution units within the processing pipeline complete execution of in-flight instructions and become idle. If, the amount of time needed to complete execution of the in-flight instructions exceeds a threshold, then the preemption may dynamically change to be performed at the instruction level instead of at compute thread array granularity.
    Type: Application
    Filed: November 22, 2011
    Publication date: May 23, 2013
    Inventors: Lacky V. SHAH, Gregory Scott Palmer, Gernot Schaufler, Samuel H. Duncan, Philip Browning Johnson, Shirish Gadre, Timothy John Purcell