Patents by Inventor Brian Emberling

Brian Emberling has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Block data load with transpose into memory

Patent number: 12229570

Abstract: Block data load with transpose techniques are described. In one example, an input is received, at a control unit, specifying an instruction to load a block of data to at least one memory module using a transpose operation. Responsive to the receiving the input by the control unit, the block of data is caused to be loaded to the at least one memory module by transposing the block of data to form a transposed block of data and storing the transposed block of data in the at least one memory.

Type: Grant

Filed: September 25, 2022

Date of Patent: February 18, 2025

Assignee: Advanced Micro Devices, Inc.

Inventors: Bin He, Michael John Mantor, Brian Emberling, Liang Huang, Chao Liu
SYSTEM AND METHOD FOR EXECUTING A TASK

Publication number: 20240355044

Abstract: A method, system, and computer-readable medium for executing a task is disclosed. The method includes receiving input data and computing instructions, launching a workgroup including wavefronts to execute the task, wherein the launching causes the wavefronts to process the input data by sharing intermediate results and resources, and adjusting the operation based on characteristics of the wavefronts. The characteristics include data dependencies, computational load, memory usage, and execution timing requirements. The wavefronts execute the task in stages, where each stage processes portions of input data and data generated by other wavefronts.

Type: Application

Filed: July 2, 2024

Publication date: October 24, 2024

Applicant: Advanced Micro Devices, Inc.

Inventors: Brian Emberling, Michael Y. Chow
SOFTWARE-DEFINED COMPUTE UNIT RESOURCE ALLOCATION MODE

Publication number: 20240311199

Abstract: A program code executing on a processing system includes one or more instructions each identifying a workload that includes a plurality of waves and each identifying resource allocations for the plurality of waves of the workgroup. In response to receiving an instruction identifying a workload and resource allocations for the plurality of waves of the workgroup, a processor allocates a first set of processing resources to a compute unit of the processor based on the resource allocations for the plurality of waves. The compute unit then performs operations for the workgroup using the allocated set of processing resources.

Type: Application

Filed: March 13, 2023

Publication date: September 19, 2024

Inventors: Nicolai Haehnle, Mark Leather, Brian Emberling, Michael John Bedy, Daniel Schneider
System and methods for efficient execution of a collaborative task in a shader system

Patent number: 12033275

Abstract: Methods and systems are disclosed for executing a collaborative task in a shader system. Techniques disclosed include receiving, by the system, input data and computing instructions associated with the collaborative task, as well as a configuration setting, causing the system to operate in a takeover mode. The system then launches, exclusively in one workgroup processor, a workgroup including wavefronts configured to execute the collaborative task.

Type: Grant

Filed: September 29, 2021

Date of Patent: July 9, 2024

Assignee: Advanced Micro Devices, Inc.

Inventors: Brian Emberling, Michael Y. Chow
Block Data Load with Transpose into Memory

Publication number: 20240103879

Abstract: Block data load with transpose techniques are described. In one example, an input is received, at a control unit, specifying an instruction to load a block of data to at least one memory module using a transpose operation. Responsive to the receiving the input by the control unit, the block of data is caused to be loaded to the at least one memory module by transposing the block of data to form a transposed block of data and storing the transposed block of data in the at least one memory.

Type: Application

Filed: September 25, 2022

Publication date: March 28, 2024

Applicant: Advanced Micro Devices, Inc.

Inventors: Bin He, Michael John Mantor, Brian Emberling, Liang Huang, Chao Liu
Software-based instruction scoreboard for arithmetic logic units

Patent number: 11847462

Abstract: A software-based instruction scoreboard indicates dependencies between closely-issued instructions issued to an arithmetic logic unit (ALU) pipeline. The software-based instruction scoreboard inserts one or more control words into the command stream between the dependent instructions, which is then executed by the ALU pipeline. The control words identify the instruction(s) upon which the dependent instructions depend (parent instructions) so that the GPU hardware can ensure that the ALU pipeline does not stall while the dependent instruction waits for results from the parent instruction.

Type: Grant

Filed: December 15, 2020

Date of Patent: December 19, 2023

Assignee: Advanced Micro Devices, Inc.

Inventor: Brian Emberling
HARDWARE SUPPORTED SPLIT BARRIER

Publication number: 20230205608

Abstract: A disclosed technique includes executing, for a first wavefront, a barrier arrival notification instruction, for a first barrier, indicating arrival at a first barrier point; performing, for the first wavefront, work prior to the first barrier point; executing, for the first wavefront, a barrier check instruction; and executing, for the first wavefront, at a control flow path based on a result of the barrier check instruction.

Type: Application

Filed: December 27, 2021

Publication date: June 29, 2023

Applicant: Advanced Micro Devices, Inc.

Inventors: Brian Emberling, Joseph L. Greathouse
Dual vector arithmetic logic unit

Patent number: 11675568

Abstract: A processing system executes wavefronts at multiple arithmetic logic unit (ALU) pipelines of a single instruction multiple data (SIMD) unit in a single execution cycle. The ALU pipelines each include a number of ALUs that execute instructions on wavefront operands that are collected from vector general process register (VGPR) banks at a cache and output results of the instructions executed on the wavefronts at a buffer. By storing wavefronts supplied by the VGPR banks at the cache, a greater number of wavefronts can be made available to the SIMD unit without increasing the VGPR bandwidth, enabling multiple ALU pipelines to execute instructions during a single execution cycle.

Type: Grant

Filed: December 14, 2020

Date of Patent: June 13, 2023

Assignee: Advanced Micro Devices, Inc.

Inventors: Bin He, Brian Emberling, Mark Leather, Michael Mantor
SYSTEM AND METHODS FOR EFFICIENT EXECUTION OF A COLLABORATIVE TASK IN A SHADER SYSTEM

Publication number: 20230102767

Abstract: Methods and systems are disclosed for executing a collaborative task in a shader system. Techniques disclosed include receiving, by the system, input data and computing instructions associated with the collaborative task, as well as a configuration setting, causing the system to operate in a takeover mode. The system then launches, exclusively in one workgroup processor, a workgroup including wavefronts configured to execute the collaborative task.

Type: Application

Filed: September 29, 2021

Publication date: March 30, 2023

Applicant: Advanced Micro Devices, Inc.

Inventors: Brian Emberling, Michael Y. Chow
CONVOLUTIONAL NEURAL NETWORK OPERATIONS

Publication number: 20230097279

Abstract: Methods and systems are disclosed for executing operations on single-instruction-multiple-data (SIMD) units. Techniques disclosed perform a dot product operation on input data during one computer cycle, including convolving the input data, generating intermediate data, and applying one or more transitional operations to the intermediate data to generate output data. Aspects described, wherein the input data is an input to a layer of a convolutional neural network and the generated output data is the output of the layer.

Type: Application

Filed: September 29, 2021

Publication date: March 30, 2023

Applicant: Advanced Micro Devices, Inc.

Inventors: Brian Emberling, Michael Mantor, Michael Y. Chow, Bin He
Exception handler for sampling draw dispatch identifiers

Patent number: 11386518

Abstract: The address of the draw or dispatch packet responsible for creating an exception is tied to a shader/wavefront back to the draw command from which it originated. In various embodiments, a method of operating a graphics pipeline and exception handling includes receiving, at a command processor of a graphics processing unit (GPU), an exception signal indicating an occurrence of a pipeline exception at a shader stage of a graphics pipeline. The shader stage generates an exception signal in response to a pipeline exception and transmits the exception signal to the command processor. The command processor determines, based on the exception signal, an address of a command packet responsible for the occurrence of the pipeline exception.

Type: Grant

Filed: September 24, 2019

Date of Patent: July 12, 2022

Assignee: ADVANCED MICRO DEVICES, INC.

Inventors: Michael Mantor, Alexander Fuad Ashkar, Randy Ramsey, Mangesh P. Nijasure, Brian Emberling
GENERAL PURPOSE REGISTER HIERARCHY SYSTEM AND METHOD

Publication number: 20220197649

Abstract: A processing unit includes a first memory device and a second memory device. The first memory device includes a first plurality of general purpose registers (GPRs) and the second memory device includes a second plurality of GPRs. The second memory device includes fewer GPRs than the first memory device. Program data is stored at the first memory device and the second memory device based on expected frequency of accesses associated with the program data.

Type: Application

Filed: December 21, 2021

Publication date: June 23, 2022

Inventors: Prasanna Balasundaram, Dipayan Karmakar, Brian Emberling
DUAL VECTOR ARITHMETIC LOGIC UNIT

Publication number: 20220188076

Abstract: A processing system executes wavefronts at multiple arithmetic logic unit (ALU) pipelines of a single instruction multiple data (SIMD) unit in a single execution cycle. The ALU pipelines each include a number of ALUs that execute instructions on wavefront operands that are collected from vector general process register (VGPR) banks at a cache and output results of the instructions executed on the wavefronts at a buffer. By storing wavefronts supplied by the VGPR banks at the cache, a greater number of wavefronts can be made available to the SIMD unit without increasing the VGPR bandwidth, enabling multiple ALU pipelines to execute instructions during a single execution cycle.

Type: Application

Filed: December 14, 2020

Publication date: June 16, 2022

Inventors: Bin He, Brian Emberling, Mark Leather, Michael Mantor
Selective prefetching in multithreaded processing units

Patent number: 11226819

Abstract: A processing unit includes a plurality of processing elements and one or more caches. A first thread executes a program that includes one or more prefetch instructions to prefetch information into a first cache. Prefetching is selectively enabled when executing the first thread on a first processing element dependent upon whether one or more second threads previously executed the program on the first processing element. The first thread is then dispatched to execute the program on the first processing element. In some cases, a dispatcher receives the first thread four dispatching to the first processing element. The dispatcher modifies the prefetch instruction to disable prefetching into the first cache in response to the one or more second threads having previously executed the program on the first processing element.

Type: Grant

Filed: November 20, 2017

Date of Patent: January 18, 2022

Assignee: ADVANCED MICRO DEVICES, INC.

Inventors: Brian Emberling, Michael Mantor
SIMD processing lanes storing input pixel operand data in local register file for thread execution of image processing operations

Patent number: 10140123

Abstract: A graphics processing unit is disclosed, the graphics processing unit having a processor having one or more SIMD processing units, and a local data share corresponding to one of the one or more SIMD processing units, the local data share comprising one or more low latency accessible memory regions for each group of threads assigned to one or more execution wavefronts, and a global data share comprising one or more low latency memory regions for each group of threads.

Type: Grant

Filed: April 10, 2017

Date of Patent: November 27, 2018

Assignee: ADVANCED MICRO DEVICES, INC.

Inventors: Michael J. Mantor, Brian Emberling
SIMD PROCESSING UNIT WITH LOCAL DATA SHARE AND ACCESS TO A GLOBAL DATA SHARE OF A GPU

Publication number: 20170212757

Abstract: A graphics processing unit is disclosed, the graphics processing unit having a processor having one or more SIMD processing units, and a local data share corresponding to one of the one or more SIMD processing units, the local data share comprising one or more low latency accessible memory regions for each group of threads assigned to one or more execution wavefronts, and a global data share comprising one or more low latency memory regions for each group of threads.

Type: Application

Filed: April 10, 2017

Publication date: July 27, 2017

Applicant: Advanced Micro Devices, Inc.

Inventors: Michael J. Mantor, Brian Emberling
SIMD processing unit with local data share and access to a global data share of a GPU

Patent number: 9619428

Abstract: A graphics processing unit is disclosed, the graphics processing unit having a processor having one or more SIMD processing units, and a local data share corresponding to one of the one or more SIMD processing units, the local data share comprising one or more low latency accessible memory regions for each group of threads assigned to one or more execution wavefronts, and a global data share comprising one or more low latency memory regions for each group of threads.

Type: Grant

Filed: June 1, 2009

Date of Patent: April 11, 2017

Assignee: Advanced Micro Devices, Inc.

Inventors: Michael J. Mantor, Brian Emberling
Interlocked increment memory allocation and access

Patent number: 9529632

Abstract: A method of allocating a memory to a plurality of concurrent threads is presented. The method includes dynamically determining writer threads each having at least one pending write to the memory; and dynamically allocating respective contiguous blocks in the memory for each of the writer threads. Another method of allocating a memory to a plurality of concurrent threads includes launching the plurality of threads as a plurality of wavefronts, dynamically determining a group of wavefronts each having at least one thread requiring a write to the memory, and dynamically allocating respective contiguous blocks in the memory for each wavefront from the group of wavefronts.

Type: Grant

Filed: September 3, 2009

Date of Patent: December 27, 2016

Assignee: Advanced Micro Devices, Inc.

Inventors: Michael Mantor, John McCardle, Marcos Zini, Brian Emberling
Dynamic control of SIMDs

Patent number: 9311102

Abstract: Systems and methods to improve performance in a graphics processing unit are described herein. Embodiments achieve power saving in a graphics processing unit by dynamically activating/deactivating individual SIMDs in a shader complex that comprises multiple SIMD units. On-the-fly dynamic disabling and enabling of individual SIMDs provides flexibility in achieving a required performance and power level for a given processing application. Embodiments of the invention also achieve dynamic medium grain clock gating of SIMDs in a shader complex. Embodiments reduce switching power by shutting down clock trees to unused logic by providing a clock on demand mechanism. In this way, embodiments enhance clock gating to save more switching power for the duration of time when SIMDs are idle (or assigned no work). Embodiments can also save leakage power by power gating SIMDs for a duration when SIMDs are idle for an extended period of time.

Type: Grant

Filed: July 12, 2011

Date of Patent: April 12, 2016

Assignee: Advanced Micro Devices, Inc.

Inventors: Tushar K. Shah, Michael J. Mantor, Brian Emberling
Efficient state transition among multiple programs on multi-threaded processors by executing cache priming program

Patent number: 9015720

Abstract: A system and method to optimize processor performance and minimizing average thread latency by selectively loading a cache when a program state, resources required for execution of a program or the program itself change, is described. An embodiment of the invention supports a “cache priming program” that is selectively executed for a first thread/program/sub-routine of each process. Such a program is optimized for situations when instructions and other program data are not yet resident in cache(s), and/or whenever resources required for program execution or the program itself changes. By pre-loading the cache with two resources required for two instructions for only a first thread, average thread latency is reduced because the resources are already present in the cache.

Type: Grant

Filed: January 6, 2009

Date of Patent: April 21, 2015

Assignee: Advanced Micro Devices, Inc.

Inventors: Andrew Brown, Brian Emberling

1 2 next