Patents by Inventor Michael J. Mantor

Michael J. Mantor has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20210201439
    Abstract: Systems, apparatuses, and methods for implementing a graphics processing unit (GPU) coprocessor are disclosed. The GPU coprocessor includes a SIMD unit with the ability to self-schedule sub-wave procedures based on input data flow events. A host processor sends messages targeting the GPU coprocessor to a queue. In response to detecting a first message in the queue, the GPU coprocessor schedules a first sub-task for execution. The GPU coprocessor includes an inter-lane crossbar and intra-lane biased indexing mechanism for a vector general purpose register (VGPR) file. The VGPR file is split into two files. The first VGPR file is a larger register file with one read port and one write port. The second VGPR file is a smaller register file with multiple read ports and one write port. The second VGPR introduces the ability to co-issue more than one instruction per clock cycle.
    Type: Application
    Filed: February 22, 2021
    Publication date: July 1, 2021
    Inventors: Jiasheng Chen, Timour Paltashev, Alexander Lyashevsky, Carl Kittredge Wakeland, Michael J. Mantor
  • Patent number: 10970081
    Abstract: Systems, apparatuses, and methods for implementing a decoupled crossbar for a stream processor are disclosed. In one embodiment, a system includes at least a multi-lane execution pipeline, a vector register file, and a crossbar. The system is configured to determine if a given instruction in an instruction stream requires a permutation on data operands retrieved from the vector register file. The system conveys the data operands to the multi-lane execution pipeline on a first path which includes the crossbar responsive to determining the given instruction requires a permutation on the data operands. The crossbar then performs the necessary permutation to route the data operands to the proper processing lanes. Otherwise, the system conveys the data operands to the multi-lane execution pipeline on a second path which bypasses the crossbar responsive to determining the given instruction does not require a permutation on the input operands.
    Type: Grant
    Filed: June 29, 2017
    Date of Patent: April 6, 2021
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Jiasheng Chen, Bin He, Mohammad Reza Hakami, Timothy Lottes, Justin David Smith, Michael J. Mantor, Derek Carson
  • Publication number: 20210090208
    Abstract: Methods and systems are described. A system includes a redundant shader pipe array that performs rendering calculations on data provided thereto and a shader pipe array that includes a plurality of shader pipes, each of which performs rendering calculations on data provided thereto. The system also includes a circuit that identifies a defective shader pipe of the plurality of shader pipes in the shader pipe array. In response to identifying the defective shader pipe, the circuit generates a signal. The system also includes a redundant shader switch. The redundant shader switch receives the generated signal, and, in response to receiving the generated signal, transfers the data for the defective shader pipe to the redundant shader pipe array.
    Type: Application
    Filed: December 7, 2020
    Publication date: March 25, 2021
    Applicant: Advanced Micro Devices, Inc.
    Inventors: Michael J. Mantor, Jeffrey T. Brady, Angel E. Socarras
  • Patent number: 10929944
    Abstract: Systems, apparatuses, and methods for implementing a graphics processing unit (GPU) coprocessor are disclosed. The GPU coprocessor includes a SIMD unit with the ability to self-schedule sub-wave procedures based on input data flow events. A host processor sends messages targeting the GPU coprocessor to a queue. In response to detecting a first message in the queue, the GPU coprocessor schedules a first sub-task for execution. The GPU coprocessor includes an inter-lane crossbar and intra-lane biased indexing mechanism for a vector general purpose register (VGPR) file. The VGPR file is split into two files. The first VGPR file is a larger register file with one read port and one write port. The second VGPR file is a smaller register file with multiple read ports and one write port. The second VGPR introduces the ability to co-issue more than one instruction per clock cycle.
    Type: Grant
    Filed: November 23, 2016
    Date of Patent: February 23, 2021
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Jiasheng Chen, Timour Paltashev, Alexander Lyashevsky, Carl Kittredge Wakeland, Michael J. Mantor
  • Publication number: 20210011760
    Abstract: Systems, apparatuses, and methods for abstracting tasks in virtual memory identifier (VMID) containers are disclosed. A processor coupled to a memory executes a plurality of concurrent tasks including a first task. Responsive to detecting one or more instructions of the first task which correspond to a first operation, the processor retrieves a first identifier (ID) which is used to uniquely identify the first task, wherein the first ID is transparent to the first task. Then, the processor maps the first ID to a second ID and/or a third ID. The processor completes the first operation by using the second ID and/or the third ID to identify the first task to at least a first data structure. In one implementation, the first operation is a memory access operation and the first data structure is a set of page tables. Also, in one implementation, the second ID identifies a first application of the first task and the third ID identifies a first operating system (OS) of the first task.
    Type: Application
    Filed: July 24, 2020
    Publication date: January 14, 2021
    Inventors: Anirudh R. Acharya, Michael J. Mantor, Rex Eldon McCrary, Anthony Asaro, Jeffrey Gongxian Cheng, Mark Fowler
  • Patent number: 10861122
    Abstract: Methods, systems and non-transitory computer readable media are described. A system includes a shader pipe array, a redundant shader pipe array, a sequencer and a redundant shader switch. The shader pipe array includes multiple shader pipes, each of which perform rendering calculations on data provided thereto. The redundant shader pipe array also performs rendering calculations on data provided thereto. The sequencer identifies at least one defective shader pipe in the shader pipe array, and, in response, generates a signal. The redundant shader switch receives the generated signal, and, in response, transfers the data destined for each shader pipe identified as being defective independently to the redundant shader pipe array.
    Type: Grant
    Filed: May 17, 2016
    Date of Patent: December 8, 2020
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Michael J. Mantor, Jeffrey T. Brady, Angel E. Socarras
  • Patent number: 10817302
    Abstract: Systems, apparatuses, and methods for implementing a high bandwidth, low power vector register file for use by a parallel processor are disclosed. In one embodiment, a system includes at least a parallel processing unit with a plurality of processing pipeline. The parallel processing unit includes a vector arithmetic logic unit and a high bandwidth, low power, vector register file. The vector register file includes multi-bank high density random-access memories (RAMs) to satisfy register bandwidth requirements. The parallel processing unit also includes an instruction request queue and an instruction operand buffer to provide enough local bandwidth for VALU instructions and vector I/O instructions. Also, the parallel processing unit is configured to leverage the RAM's output flops as a last level cache to reduce duplicate operand requests between multiple instructions. The parallel processing unit includes a vector destination cache to provide additional R/W bandwidth for the vector register file.
    Type: Grant
    Filed: July 7, 2017
    Date of Patent: October 27, 2020
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Jiasheng Chen, Bin He, Mark M. Leather, Michael J. Mantor, Yunxiao Zou
  • Patent number: 10725822
    Abstract: Systems, apparatuses, and methods for abstracting tasks in virtual memory identifier (VMID) containers are disclosed. A processor coupled to a memory executes a plurality of concurrent tasks including a first task. Responsive to detecting one or more instructions of the first task which correspond to a first operation, the processor retrieves a first identifier (ID) which is used to uniquely identify the first task, wherein the first ID is transparent to the first task. Then, the processor maps the first ID to a second ID and/or a third ID. The processor completes the first operation by using the second ID and/or the third ID to identify the first task to at least a first data structure. In one implementation, the first operation is a memory access operation and the first data structure is a set of page tables. Also, in one implementation, the second ID identifies a first application of the first task and the third ID identifies a first operating system (OS) of the first task.
    Type: Grant
    Filed: July 31, 2018
    Date of Patent: July 28, 2020
    Assignees: Advanced Micro Devices, Inc., ATI Technologies ULC
    Inventors: Anirudh R. Acharya, Michael J. Mantor, Rex Eldon McCrary, Anthony Asaro, Jeffrey Gongxian Cheng, Mark Fowler
  • Patent number: 10558489
    Abstract: Systems, apparatuses, and methods for suspending and restoring operations on a processor are disclosed. In one embodiment, a processor includes at least a control unit, multiple execution units, and multiple work creation units. In response to detecting a request to suspend a software application executing on the processor, the control unit sends requests to the plurality of work creation units to stop creating new work. The control unit waits until receiving acknowledgements from the work creation units prior to initiating a suspend operation. Once all work creation units have acknowledged that they have stopped creating new work, the control unit initiates the suspend operation. Also, when a restore operation is initiated, the control unit prevents any work creation units from launching new work-items until all previously in-flight work-items have been restored to the same work creation units and execution units to which they were previously allocated.
    Type: Grant
    Filed: February 21, 2017
    Date of Patent: February 11, 2020
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Alexander Fuad Ashkar, Michael J. Mantor, Randy Wayne Ramsey, Rex Eldon McCrary, Harry J. Wise
  • Publication number: 20200042348
    Abstract: Systems, apparatuses, and methods for abstracting tasks in virtual memory identifier (VMID) containers are disclosed. A processor coupled to a memory executes a plurality of concurrent tasks including a first task. Responsive to detecting one or more instructions of the first task which correspond to a first operation, the processor retrieves a first identifier (ID) which is used to uniquely identify the first task, wherein the first ID is transparent to the first task. Then, the processor maps the first ID to a second ID and/or a third ID. The processor completes the first operation by using the second ID and/or the third ID to identify the first task to at least a first data structure. In one implementation, the first operation is a memory access operation and the first data structure is a set of page tables. Also, in one implementation, the second ID identifies a first application of the first task and the third ID identifies a first operating system (OS) of the first task.
    Type: Application
    Filed: July 31, 2018
    Publication date: February 6, 2020
    Inventors: Anirudh R. Acharya, Michael J. Mantor, Rex Eldon McCrary, Anthony Asaro, Jeffrey Gongxian Cheng, Mark Fowler
  • Patent number: 10474468
    Abstract: Systems, apparatuses, and methods for processing variable wavefront sizes on a processor are disclosed. In one embodiment, a processor includes at least a scheduler, cache, and multiple execution units. When operating in a first mode, the processor executes the same instruction on multiple portions of a wavefront before proceeding to the next instruction of the shader program. When operating in a second mode, the processor executes a set of instructions on a first portion of a wavefront. In the second mode, when the processor finishes executing the set of instructions on the first portion of the wavefront, the processor executes the set of instructions on a second portion of the wavefront, and so on until all portions of the wavefront have been processed. The processor determines the operating mode based on one or more conditions.
    Type: Grant
    Filed: February 22, 2017
    Date of Patent: November 12, 2019
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Michael J. Mantor, Brian D. Emberling, Mark Fowler, Mark M. Leather
  • Publication number: 20190171448
    Abstract: Systems, apparatuses, and methods for implementing a low power parallel matrix multiply pipeline are disclosed. In one embodiment, a system includes at least first and second vector register files coupled to a matrix multiply pipeline. The matrix multiply pipeline comprises a plurality of dot product units. The dot product units are configured to calculate dot or outer products for first and second sets of operands retrieved from the first vector register file. The results of the dot or outer product operations are written back to the second vector register file. The second vector register file provides the results from the previous dot or outer product operations as inputs to subsequent dot or outer product operations. The dot product units receive the results from previous phases of the matrix multiply operation and accumulate these previous dot or outer product results with the current dot or outer product results.
    Type: Application
    Filed: December 27, 2017
    Publication date: June 6, 2019
    Inventors: Jiasheng Chen, Yunxiao Zou, Michael J. Mantor, Allen Rush
  • Publication number: 20190129718
    Abstract: Systems, apparatuses, and methods for routing traffic between clients and system memory are disclosed. A computing system includes a processor capable of executing single precision mathematical instructions on data sizes of M bits and half precision mathematical instructions on data sizes of N bits, which is less than M bits. At least two source operands with M bits indicated by a received instruction are read from a register file. If the instruction is a packed math instruction, at least a first source operand with a size of N bits less than M bits is selected from either a high portion or a low portion of one of the at least two source operands read from the register file. The instruction includes fields storing bits, each bit indicating the high portion or the low portion of a given source operand associated with a register identifier specified elsewhere in the instruction.
    Type: Application
    Filed: October 31, 2017
    Publication date: May 2, 2019
    Inventors: Jiasheng Chen, Bin He, Yunxiao Zou, Michael J. Mantor, Radhakrishna Giduthuri, Eric J. Finger, Brian D. Emberling
  • Patent number: 10180789
    Abstract: Systems, apparatuses, and methods for implementing software control of state sets are disclosed. In one embodiment, a processor includes at least an execution unit and a plurality of state registers. The processor is configured to detect a command to allocate a first state set for storing a first state, wherein the command is generated by software, and wherein the first state specifies values for the plurality of state registers. The command is executed on the execution unit while the processor is in a second state, wherein the second state is different from the first state. The first state set of the processor is allocated with the first state responsive to executing the command on the execution unit. The processor is configured to allocate the first state set for the first state prior to the processor entering the first state.
    Type: Grant
    Filed: January 26, 2017
    Date of Patent: January 15, 2019
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Rex Eldon McCrary, Michael J. Mantor, Alexander Fuad Ashkar, Harry J. Wise
  • Publication number: 20190004814
    Abstract: Systems, apparatuses, and methods for implementing a decoupled crossbar for a stream processor are disclosed. In one embodiment, a system includes at least a multi-lane execution pipeline, a vector register file, and a crossbar. The system is configured to determine if a given instruction in an instruction stream requires a permutation on data operands retrieved from the vector register file. The system conveys the data operands to the multi-lane execution pipeline on a first path which includes the crossbar responsive to determining the given instruction requires a permutation on the data operands. The crossbar then performs the necessary permutation to route the data operands to the proper processing lanes. Otherwise, the system conveys the data operands to the multi-lane execution pipeline on a second path which bypasses the crossbar responsive to determining the given instruction does not require a permutation on the input operands.
    Type: Application
    Filed: June 29, 2017
    Publication date: January 3, 2019
    Inventors: Jiasheng Chen, Bin He, Mohammad Reza Hakami, Timothy Lottes, Justin David Smith, Michael J. Mantor, Derek Carson
  • Publication number: 20190004807
    Abstract: Systems, apparatuses, and methods for implementing a stream processor with overlapping execution are disclosed. In one embodiment, a system includes at least a parallel processing unit with a plurality of execution pipelines. The processing throughput of the parallel processing unit is increased by overlapping execution of multi-pass instructions with single pass instructions without increasing the instruction issue rate. A first plurality of operands of a first vector instruction are read from a shared vector register file in a single clock cycle and stored in temporary storage. The first plurality of operands are accessed and utilized to initiate multiple instructions on individual vector elements on a first execution pipeline in subsequent clock cycles. A second plurality of operands are read from the shared vector register file during the subsequent clock cycles to initiate execution of one or more second vector instructions on the second execution pipeline.
    Type: Application
    Filed: July 24, 2017
    Publication date: January 3, 2019
    Inventors: Jiasheng Chen, Qingcheng Wang, Yunxiao Zou, Bin He, Jian Yang, Michael J. Mantor, Brian D. Emberling
  • Publication number: 20180357064
    Abstract: Systems, apparatuses, and methods for implementing a high bandwidth, low power vector register file for use by a parallel processor are disclosed. In one embodiment, a system includes at least a parallel processing unit with a plurality of processing pipeline. The parallel processing unit includes a vector arithmetic logic unit and a high bandwidth, low power, vector register file. The vector register file includes multi-bank high density random-access memories (RAMs) to satisfy register bandwidth requirements. The parallel processing unit also includes an instruction request queue and an instruction operand buffer to provide enough local bandwidth for VALU instructions and vector I/O instructions. Also, the parallel processing unit is configured to leverage the RAM's output flops as a last level cache to reduce duplicate operand requests between multiple instructions. The parallel processing unit includes a vector destination cache to provide additional R/W bandwidth for the vector register file.
    Type: Application
    Filed: July 7, 2017
    Publication date: December 13, 2018
    Inventors: Jiasheng Chen, Bin He, Mark M. Leather, Michael J. Mantor, Yunxiao Zou
  • Patent number: 10152434
    Abstract: A system and method for efficient arbitration of memory access requests are described. One or more functional units generate memory access requests for a partitioned memory. An arbitration unit stores the generated requests and selects a given one of the stored requests. The arbitration unit identifies a given partition of the memory which stores a memory location targeted by the selected request. The arbitration unit determines whether one or more other stored requests access memory locations in the given partition. The arbitration unit sends each of the selected memory access request and the identified one or more other memory access requests to the memory to be serviced out of order.
    Type: Grant
    Filed: December 20, 2016
    Date of Patent: December 11, 2018
    Assignees: Advanced Micro Devices, Inc., ATI Technologies ULC
    Inventors: Rostyslav Kyrychynskyi, Anthony Asaro, Kostantinos Danny Christidis, Mark Fowler, Michael J. Mantor, Robert Scott Hartog
  • Patent number: 10140123
    Abstract: A graphics processing unit is disclosed, the graphics processing unit having a processor having one or more SIMD processing units, and a local data share corresponding to one of the one or more SIMD processing units, the local data share comprising one or more low latency accessible memory regions for each group of threads assigned to one or more execution wavefronts, and a global data share comprising one or more low latency memory regions for each group of threads.
    Type: Grant
    Filed: April 10, 2017
    Date of Patent: November 27, 2018
    Assignee: ADVANCED MICRO DEVICES, INC.
    Inventors: Michael J. Mantor, Brian Emberling
  • Patent number: 10073783
    Abstract: A system and method for efficiently processing access requests for a shared resource are described. Each of many requestors are assigned to a partition of a shared resource. When a controller determines no requestor generates an access request for an unassigned partition, the controller permits simultaneous access to the assigned partitions for active requestors. When the controller determines at least one active requestor generates an access request for an unassigned partition, the controller allows a single active requestor to gain exclusive access to the entire shared resource while stalling access for the other active requestors. The controller alternatives exclusive access among the active requestors. In various embodiments, the shared resource is a local data store in a graphics processing unit and each of the multiple requestors is a single instruction multiple data (SIMD) compute unit.
    Type: Grant
    Filed: November 23, 2016
    Date of Patent: September 11, 2018
    Assignee: Advanced Micro Devices, Inc.
    Inventors: Daniel Clifton, Michael J. Mantor, Hans Burton