Patents by Inventor Michael A. Fetterman

Michael A. Fetterman has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 10152329
    Abstract: One embodiment of the present disclosure sets forth an optimized way to execute pre-scheduled replay operations for divergent operations in a parallel processing subsystem. Specifically, a streaming multiprocessor (SM) includes a multi-stage pipeline configured to insert pre-scheduled replay operations into a multi-stage pipeline. A pre-scheduled replay unit detects whether the operation associated with the current instruction is accessing a common resource. If the threads are accessing data which are distributed across multiple cache lines, then the pre-scheduled replay unit inserts pre-scheduled replay operations behind the current instruction. The multi-stage pipeline executes the instruction and the associated pre-scheduled replay operations sequentially. If additional threads remain unserviced after execution of the instruction and the pre-scheduled replay operations, then additional replay operations are inserted via the replay loop, until all threads are serviced.
    Type: Grant
    Filed: February 9, 2012
    Date of Patent: December 11, 2018
    Assignee: NVIDIA CORPORATION
    Inventors: Michael Fetterman, Stewart Glenn Carlton, Jack Hilaire Choquette, Shirish Gadre, Olivier Giroux, Douglas J. Hahn, Steven James Heinrich, Eric Lyell Hill, Charles McCarver, Omkar Paranjape, Anjana Rajendran, Rajeshwaran Selvanesan
  • Patent number: 10095548
    Abstract: One embodiment of the present disclosure sets forth an effective way to maintain fairness and order in the scheduling of common resource access requests related to replay operations. Specifically, a streaming multiprocessor (SM) includes a total order queue (TOQ) configured to schedule the access requests over one or more execution cycles. Access requests are allowed to make forward progress when needed common resources have been allocated to the request. Where multiple access requests require the same common resource, priority is given to the older access request. Access requests may be placed in a sleep state pending availability of certain common resources. Deadlock may be avoided by allowing an older access request to steal resources from a younger resource request. One advantage of the disclosed technique is that older common resource access requests are not repeatedly blocked from making forward progress by newer access requests.
    Type: Grant
    Filed: May 21, 2012
    Date of Patent: October 9, 2018
    Assignee: NVIDIA CORPORATION
    Inventors: Michael Fetterman, Shirish Gadre, John H. Edmondson, Omkar Paranjape, Anjana Rajendran, Eric Lyell Hill, Rajeshwaran Selvanesan, Charles McCarver, Kevin Mitchell, Steven James Heinrich
  • Patent number: 10007527
    Abstract: One embodiment of the present invention sets forth a technique for processing load instructions for parallel threads of a thread group when a sub-set of the parallel threads request the same memory address. The load/store unit determines if the memory addresses for each sub-set of parallel threads match based on one or more uniform patterns. When a match is achieved for at least one of the uniform patterns, the load/store unit transmits a read request to retrieve data for the sub-set of parallel threads. The number of read requests transmitted is reduced compared with performing a separate read request for each thread in the sub-set. A variety of uniform patterns may be defined based on common access patterns present in program instructions. A variety of uniform patterns may also be defined based on interconnect constraints between the load/store unit and the memory when a full crossbar interconnect is not available.
    Type: Grant
    Filed: March 5, 2012
    Date of Patent: June 26, 2018
    Assignee: NVIDIA CORPORATION
    Inventors: Michael Fetterman, Stewart Glenn Carlton, Douglas J. Hahn, Rajeshwaran Selvanesan, Shirish Gadre, Steven James Heinrich
  • Patent number: 9836325
    Abstract: One embodiment of the present disclosure sets forth an effective way to maintain fairness and order in the scheduling of common resource access requests related to replay operations. Specifically, a streaming multiprocessor (SM) includes a total order queue (TOQ) configured to schedule the access requests over one or more execution cycles. Access requests are allowed to make forward progress when needed common resources have been allocated to the request. Where multiple access requests require the same common resource, priority is given to the older access request. Access requests may be placed in a sleep state pending availability of certain common resources. Deadlock may be avoided by allowing an older access request to steal resources from a younger resource request. One advantage of the disclosed technique is that older common resource access requests are not repeatedly blocked from making forward progress by newer access requests.
    Type: Grant
    Filed: May 21, 2012
    Date of Patent: December 5, 2017
    Assignee: NVIDIA Corporation
    Inventors: Michael Fetterman, Shirish Gadre, John H. Edmondson, Omkar Paranjape, Anjana Rajendran, Eric Lyell Hill, Rajeshwaran Selvanesan, Charles McCarver, Kevin Mitchell, Steven James Heinrich
  • Patent number: 9817668
    Abstract: One embodiment of the present invention sets forth an approach for executing replay operations for divergent operations in a parallel processing subsystem. Specifically, the streaming multiprocessor (SM) includes a multistage pipeline configured to batch two or more replay operations for processing via replay loop. A logic element within the multistage pipeline detects whether the current pipeline stage is accessing a shared resource, such as loading data from a shared memory. If the threads are accessing data which are distributed across multiple cache lines, then the multistage pipeline batches two or more replay operations, where the replay operations are inserted into the pipeline back-to-back.
    Type: Grant
    Filed: December 16, 2011
    Date of Patent: November 14, 2017
    Assignee: NVIDIA Corporation
    Inventors: Michael Fetterman, Jack Hilaire Choquette, Omkar Paranjape, Anjana Rajendran, Eric Lyell Hill, Stewart Glenn Carlton, Rajeshwaran Selvanesan, Douglas J. Hahn, Steven James Heinrich
  • Patent number: 9755994
    Abstract: One embodiment of the present disclosure sets forth an effective way to maintain fairness and order in the scheduling of common resource access requests related to replay operations. Specifically, a streaming multiprocessor (SM) includes a total order queue (TOQ) configured to schedule the access requests over one or more execution cycles. Access requests are allowed to make forward progress when needed common resources have been allocated to the request. Where multiple access requests require the same common resource, priority is given to the older access request. Access requests may be placed in a sleep state pending availability of certain common resources. Deadlock may be avoided by allowing an older access request to steal resources from a younger resource request. One advantage of the disclosed technique is that older common resource access requests are not repeatedly blocked from making forward progress by newer access requests.
    Type: Grant
    Filed: May 21, 2012
    Date of Patent: September 5, 2017
    Assignee: NVIDIA Corporation
    Inventors: Michael Fetterman, Shirish Gadre, John H. Edmondson, Omkar Paranjape, Anjana Rajendran, Eric Lyell Hill, Rajeshwaran Selvanesan, Charles McCarver, Kevin Mitchell, Steven James Heinrich
  • Patent number: 9626191
    Abstract: One embodiment of the present invention sets forth a technique for performing a shaped access of a register file that includes a set of N registers, wherein N is greater than or equal to two. The technique involves, for at least one thread included in a group of threads, receiving a request to access a first amount of data from each register in the set of N registers, and configuring a crossbar to allow the at least one thread to access the first amount of data from each register in the set of N registers.
    Type: Grant
    Filed: December 22, 2011
    Date of Patent: April 18, 2017
    Assignee: NVIDIA Corporation
    Inventors: Jack Hilaire Choquette, Michael Fetterman, Shirish Gadre, Xiaogang Qiu, Omkar Paranjape, Anjana Rajendran, Stewart Glenn Carlton, Eric Lyell Hill, Rajeshwaran Selvanesan, Douglas J. Hahn
  • Patent number: 9262174
    Abstract: One embodiment sets forth a technique for dynamically mapping addresses to banks of a multi-bank memory based on a bank mode. Application programs may be configured to perform read and write a memory accessing different numbers of bits per bank, e.g., 32-bits per bank, 64-bits per bank, or 128-bits per bank. On each clock cycle an access request may be received from one of the application programs and per processing thread addresses of the access request are dynamically mapped based on the bank mode to produce a set of bank addresses. The bank addresses are then used to access the multi-bank memory. Allowing different bank mappings enables each application program to avoid bank conflicts when the memory is accesses compared with using a single bank mapping for all accesses.
    Type: Grant
    Filed: April 5, 2012
    Date of Patent: February 16, 2016
    Assignee: NVIDIA Corporation
    Inventors: Michael Fetterman, Stewart Glenn Carlton, Douglas J. Hahn, Rajeshwaran Selvanesan, Shirish Gadre, Steven James Heinrich
  • Publication number: 20130311686
    Abstract: One embodiment of the present disclosure sets forth an effective way to maintain fairness and order in the scheduling of common resource access requests related to replay operations. Specifically, a streaming multiprocessor (SM) includes a total order queue (TOQ) configured to schedule the access requests over one or more execution cycles. Access requests are allowed to make forward progress when needed common resources have been allocated to the request. Where multiple access requests require the same common resource, priority is given to the older access request. Access requests may be placed in a sleep state pending availability of certain common resources. Deadlock may be avoided by allowing an older access request to steal resources from a younger resource request. One advantage of the disclosed technique is that older common resource access requests are not repeatedly blocked from making forward progress by newer access requests.
    Type: Application
    Filed: May 21, 2012
    Publication date: November 21, 2013
    Inventors: Michael FETTERMAN, Shirish Gadre, John H. Edmondson, Omkar Paranjape, Anjana Rajendran, Eric Lyell Hill, Rajeshwaran Selvanesan, Charles McCarver, Kevin Mitchell, Steven James Heinrich
  • Publication number: 20130311999
    Abstract: One embodiment of the present disclosure sets forth an effective way to maintain fairness and order in the scheduling of common resource access requests related to replay operations. Specifically, a streaming multiprocessor (SM) includes a total order queue (TOQ) configured to schedule the access requests over one or more execution cycles. Access requests are allowed to make forward progress when needed common resources have been allocated to the request. Where multiple access requests require the same common resource, priority is given to the older access request. Access requests may be placed in a sleep state pending availability of certain common resources. Deadlock may be avoided by allowing an older access request to steal resources from a younger resource request. One advantage of the disclosed technique is that older common resource access requests are not repeatedly blocked from making forward progress by newer access requests.
    Type: Application
    Filed: May 21, 2012
    Publication date: November 21, 2013
    Inventors: Michael FETTERMAN, Shirish Gadre, John H. Edmondson, Omkar Paranjape, Anjana Rajendran, Eric Lyell Hill, Rajeshwaran Selvanesan, Charles McCarver, Kevin Mitchell, Steven James Heinrich
  • Publication number: 20130311996
    Abstract: One embodiment of the present disclosure sets forth an effective way to maintain fairness and order in the scheduling of common resource access requests related to replay operations. Specifically, a streaming multiprocessor (SM) includes a total order queue (TOQ) configured to schedule the access requests over one or more execution cycles. Access requests are allowed to make forward progress when needed common resources have been allocated to the request. Where multiple access requests require the same common resource, priority is given to the older access request. Access requests may be placed in a sleep state pending availability of certain common resources. Deadlock may be avoided by allowing an older access request to steal resources from a younger resource request. One advantage of the disclosed technique is that older common resource access requests are not repeatedly blocked from making forward progress by newer access requests.
    Type: Application
    Filed: May 21, 2012
    Publication date: November 21, 2013
    Inventors: Michael FETTERMAN, Shirish GADRE, John H. EDMONDSON, Omkar PARANJAPE, Anjana RAJENDRAN, Eric Lyell HILL, Rajeshwaran SELVANESAN, Charles McCARVER, Kevin MITCHELL, Steven James HEINRICH
  • Publication number: 20130268715
    Abstract: One embodiment sets forth a technique for dynamically mapping addresses to banks of a multi-bank memory based on a bank mode. Application programs may be configured to perform read and write a memory accessing different numbers of bits per bank, e.g., 32-bits per bank, 64-bits per bank, or 128-bits per bank. On each clock cycle an access request may be received from one of the application programs and per processing thread addresses of the access request are dynamically mapped based on the bank mode to produce a set of bank addresses. The bank addresses are then used to access the multi-bank memory. Allowing different bank mappings enables each application program to avoid bank conflicts when the memory is accesses compared with using a single bank mapping for all accesses.
    Type: Application
    Filed: April 5, 2012
    Publication date: October 10, 2013
    Inventors: Michael FETTERMAN, Stewart Glenn Carlton, Douglas J. Hahn, Rajeshwaran Selvanesan, Shirish Gadre, Steven James Heinrich
  • Publication number: 20130232322
    Abstract: One embodiment of the present invention sets forth a technique for processing load instructions for parallel threads of a thread group when a sub-set of the parallel threads request the same memory address. The load/store unit determines if the memory addresses for each sub-set of parallel threads match based on one or more uniform patterns. When a match is achieved for at least one of the uniform patterns, the load/store unit transmits a read request to retrieve data for the sub-set of parallel threads. The number of read requests transmitted is reduced compared with performing a separate read request for each thread in the sub-set. A variety of uniform patterns may be defined based on common access patterns present in program instructions. A variety of uniform patterns may also be defined based on interconnect constraints between the load/store unit and the memory when a full crossbar interconnect is not available.
    Type: Application
    Filed: March 5, 2012
    Publication date: September 5, 2013
    Inventors: Michael FETTERMAN, Stewart Glenn Carlton, Douglas J. Hahn, Rajeshwaran Selvanesan, Shirish Gadre, Steven James Heinrich
  • Publication number: 20130212364
    Abstract: One embodiment of the present disclosure sets forth an optimized way to execute pre-scheduled replay operations for divergent operations in a parallel processing subsystem. Specifically, a streaming multiprocessor (SM) includes a multi-stage pipeline configured to insert pre-scheduled replay operations into a multi-stage pipeline. A pre-scheduled replay unit detects whether the operation associated with the current instruction is accessing a common resource. If the threads are accessing data which are distributed across multiple cache lines, then the pre-scheduled replay unit inserts pre-scheduled replay operations behind the current instruction. The multi-stage pipeline executes the instruction and the associated pre-scheduled replay operations sequentially. If additional threads remain unserviced after execution of the instruction and the pre-scheduled replay operations, then additional replay operations are inserted via the replay loop, until all threads are serviced.
    Type: Application
    Filed: February 9, 2012
    Publication date: August 15, 2013
    Inventors: Michael FETTERMAN, Stewart Glenn Carlton, Jack Hilaire Choquette, Shirish Gadre, Olivier Giroux, Douglas J. Hahn, Steven James Heinrich, Eric Lyell Hill, Charles McCarver, Omkar Paranjape, Anjana Rajendran, Rajeshwaran Selvanesan
  • Patent number: 8510536
    Abstract: Techniques for vector completion mask (VCM) handling are provided. A data structure includes a mask field for each operand of a particular operation. A processor attempts to execute the operation with multiple operands, which are identified in the data structure by the mask fields. If operands are successfully retrieved for execution with the operation, then the corresponding mask field within the data structure is cleared. The processor can reset if any field remains set within the data structure and can re-process the operation with operands that were not previously handled with the operation.
    Type: Grant
    Filed: June 28, 2012
    Date of Patent: August 13, 2013
    Assignee: Intel Corporation
    Inventors: Stephan Jourdan, Michael Fetterman, Michael Cornaby, Per Hammarlund, Ronak Signhal, Glenn Hinton
  • Publication number: 20130166882
    Abstract: Systems and methods for scheduling instructions without instruction decode. In one embodiment, a multi-core processor includes a scheduling unit in each core for scheduling instructions from two or more threads scheduled for execution on that particular core. As threads are scheduled for execution on the core, instructions from the threads are fetched into a buffer without being decoded. The scheduling unit includes a macro-scheduler unit for performing a priority sort of the two or more threads and a micro-scheduler arbiter for determining the highest order thread that is ready to execute. The macro-scheduler unit and the micro-scheduler arbiter use pre-decode data to implement the scheduling algorithm. The pre-decode data may be generated by decoding only a small portion of the instruction or received along with the instruction. Once the micro-scheduler arbiter has selected an instruction to dispatch to the execution unit, a decode unit fully decodes the instruction.
    Type: Application
    Filed: December 22, 2011
    Publication date: June 27, 2013
    Inventors: Jack Hilaire CHOQUETTE, Robert J. STOLL, Olivier GIROUX, Michael FETTERMAN, Shirish GADRE, Robert Steven GLANVILLE, Alexandre JOLY
  • Publication number: 20130166877
    Abstract: One embodiment of the present invention sets forth a technique for performing a shaped access of a register file that includes a set of N registers, wherein N is greater than or equal to two. The technique involves, for at least one thread included in a group of threads, receiving a request to access a first amount of data from each register in the set of N registers, and configuring a crossbar to allow the at least one thread to access the first amount of data from each register in the set of N registers.
    Type: Application
    Filed: December 22, 2011
    Publication date: June 27, 2013
    Inventors: Jack Hilaire CHOQUETTE, Michael FETTERMAN, Shirish GADRE, Xiaogang QIU, Omkar PARANJAPE, Anjana RAJENDRAN, Stewart Glenn CARLTON, Eric Lyell HILL, Rajeshwaran SELVANESAN, Douglas J. HAHN
  • Publication number: 20130159684
    Abstract: One embodiment of the present invention sets forth an optimized way to execute replay operations for divergent operations in a parallel processing subsystem. Specifically, the streaming multiprocessor (SM) includes a multistage pipeline configured to batch two or more replay operations for processing via replay loop. A logic element within the multistage pipeline detects whether the current pipeline stage is accessing a shared resource, such as loading data from a shared memory. If the threads are accessing data which are distributed across multiple cache lines, then the multistage pipeline batches two or more replay operations, where the replay operations are inserted into the pipeline back-to-back. Advantageously, divergent operations requiring two or more replay operations operate with reduced latency. Where memory access operations require transfer of more than two cache lines to service all threads, the number of clock cycles required to complete all replay operations is reduced.
    Type: Application
    Filed: December 16, 2011
    Publication date: June 20, 2013
    Inventors: Michael Fetterman, Jack Hilaire Choquette, Omkar Paranjape, Anjana Rajendran, Eric Lyell Hill, Stewart glenn Carlton, Rajeshwaran Selvanesan, Douglas J. Hahn, Steven James Heinrich
  • Publication number: 20120272046
    Abstract: Techniques for vector completion mask (VCM) handling are provided. A data structure includes a mask field for each operand of a particular operation. A processor attempts to execute the operation with multiple operands, which are identified in the data structure by the mask fields. If operands are successfully retrieved for execution with the operation, then the corresponding mask field within the data structure is cleared. The processor can reset if any field remains set within the data structure and can re-process the operation with operands that were not previously handled with the operation.
    Type: Application
    Filed: June 28, 2012
    Publication date: October 25, 2012
    Inventors: Stephan Jourdan, Michael Fetterman, Michael Cornaby, Per Hammarlund, Ronak Signhal, Glenn Hinton
  • Patent number: D853392
    Type: Grant
    Filed: December 7, 2018
    Date of Patent: July 9, 2019
    Assignee: Apple Inc.
    Inventors: Jody Akana, Molly Anderson, Bartley K. Andre, Shota Aoyagi, Anthony Michael Ashcroft, Marine C. Bataille, Jeremy Bataillou, Robert Y. Cao, James M. Cuseo, Daniele De Iuliis, Markus Diebel, Kevin Scott Fetterman, M. Evans Hankey, Julian Hoenig, Richard P. Howarth, Jonathan P. Ive, Julian Jaede, Duncan Robert Kerr, Dinesh Mathew, Sabrina Paseman, Gavin J. Reid, Peter Russell-Clarke, Benjamin Andrew Shaffer, Mikael Silvanto, Brandon S. Smith, Christopher J. Stringer, Joe Sung-Ho Tan, Clement Tissandier, Eugene Antony Whang, Rico Zörkendörfer