Patents by Inventor Brian D. Emberling

Brian D. Emberling has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

REGISTER COMPACTION WITH EARLY RELEASE

Publication number: 20250005705

Abstract: Systems, apparatuses, and methods for implementing register compaction with early release are disclosed. A processor includes at least a command processor, a plurality of compute units, a plurality of registers, and a control unit. Registers are statically allocated to wavefronts by the control unit when wavefronts are launched by the command processor on the compute units. In response to determining that a first set of registers, previously allocated to a first wavefront, are no longer needed, the first wavefront executes an instruction to release the first set of registers. The control unit detects the executed instruction and releases the first set of registers to the available pool of registers to potentially be used by other wavefronts. Then, the control unit can allocate the first set of registers to a second wavefront for use by threads of the second wavefront while the first wavefront is still active.

Type: Application

Filed: July 5, 2024

Publication date: January 2, 2025

Inventors: Brian D. Emberling, Joseph Lee Greathouse, Anthony Thomas Gutierrez
WAVE LEVEL MATRIX MULTIPLY INSTRUCTIONS

Publication number: 20240329998

Abstract: An apparatus and method for efficiently processing multiplication and accumulate operations for matrices in applications. In various implementations, a computing system includes a parallel data processing circuit and a memory. The memory stores the instructions (or translated commands) of a parallel data application. The circuitry of the parallel data processing circuit performs a matrix multiplication operation using source operands accessed only once from a vector register file and multiple instantiations of a vector processing circuit capable of performing multiple matrix multiplication operations corresponding to multiple different types of instructions. The multiplier circuit and the adder circuit of the vector processing circuit perform each of the fused multiply add (FMA) operation and the dot product (inner product) operation without independent, dedicated execution pipelines with one execution pipeline for the FMA operation and the other separate execution pipeline for the dot product operation.

Type: Application

Filed: March 28, 2024

Publication date: October 3, 2024

Inventors: Bin He, Michael J. Mantor, Brian D. Emberling
Register compaction with early release

Patent number: 12033238

Abstract: Systems, apparatuses, and methods for implementing register compaction with early release are disclosed. A processor includes at least a command processor, a plurality of compute units, a plurality of registers, and a control unit. Registers are statically allocated to wavefronts by the control unit when wavefronts are launched by the command processor on the compute units. In response to determining that a first set of registers, previously allocated to a first wavefront, are no longer needed, the first wavefront executes an instruction to release the first set of registers. The control unit detects the executed instruction and releases the first set of registers to the available pool of registers to potentially be used by other wavefronts. Then, the control unit can allocate the first set of registers to a second wavefront for use by threads of the second wavefront while the first wavefront is still active.

Type: Grant

Filed: September 24, 2020

Date of Patent: July 9, 2024

Assignee: Advanced Micro Devices, Inc.

Inventors: Brian D. Emberling, Joseph Lee Greathouse, Anthony Thomas Gutierrez
Packed 16 bits instruction pipeline

Patent number: 11880683

Abstract: Systems, apparatuses, and methods for efficiently processing arithmetic operations are disclosed. A computing system includes a processor capable of executing single precision mathematical instructions on data sizes of M bits and half precision mathematical instructions on data sizes of N bits, which is less than M bits. At least two source operands with M bits indicated by a received instruction are read from a register file. If the instruction is a packed math instruction, at least a first source operand with a size of N bits less than M bits is selected from either a high portion or a low portion of one of the at least two source operands read from the register file. The instruction includes fields storing bits, each bit indicating the high portion or the low portion of a given source operand associated with a register identifier specified elsewhere in the instruction.

Type: Grant

Filed: October 31, 2017

Date of Patent: January 23, 2024

Assignee: Advanced micro devices, inc.

Inventors: Jiasheng Chen, Bin He, Yunxiao Zou, Michael J. Mantor, Radhakrishna Giduthuri, Eric J. Finger, Brian D. Emberling
REGISTER COMPACTION WITH EARLY RELEASE

Publication number: 20220092725

Abstract: Systems, apparatuses, and methods for implementing register compaction with early release are disclosed. A processor includes at least a command processor, a plurality of compute units, a plurality of registers, and a control unit. Registers are statically allocated to wavefronts by the control unit when wavefronts are launched by the command processor on the compute units. In response to determining that a first set of registers, previously allocated to a first wavefront, are no longer needed, the first wavefront executes an instruction to release the first set of registers. The control unit detects the executed instruction and releases the first set of registers to the available pool of registers to potentially be used by other wavefronts. Then, the control unit can allocate the first set of registers to a second wavefront for use by threads of the second wavefront while the first wavefront is still active.

Type: Application

Filed: September 24, 2020

Publication date: March 24, 2022

Inventors: Brian D. Emberling, Joseph Lee Greathouse, Anthony Thomas Gutierrez
Wait instruction for preventing execution of one or more instructions until a load counter or store counter reaches a specified value

Patent number: 11074075

Abstract: Systems, apparatuses, and methods for maintaining separate pending load and store counters are disclosed herein. In one embodiment, a system includes at least one execution unit, a memory subsystem, and a pair of counters for each thread of execution. In one embodiment, the system implements a software based approach for managing dependencies between instructions. In one embodiment, the execution unit(s) maintains counters to support the software-based approach for managing dependencies between instructions. The execution unit(s) are configured to execute instructions that are used to manage the dependencies during run-time. In one embodiment, the execution unit(s) execute wait instructions to wait until a given counter is equal to a specified value before continuing to execute the instruction sequence.

Type: Grant

Filed: February 24, 2017

Date of Patent: July 27, 2021

Assignee: Advanced Micro Devices, Inc.

Inventors: Mark Fowler, Brian D. Emberling
Indicating instruction scheduling mode for processing wavefront portions

Patent number: 10474468

Abstract: Systems, apparatuses, and methods for processing variable wavefront sizes on a processor are disclosed. In one embodiment, a processor includes at least a scheduler, cache, and multiple execution units. When operating in a first mode, the processor executes the same instruction on multiple portions of a wavefront before proceeding to the next instruction of the shader program. When operating in a second mode, the processor executes a set of instructions on a first portion of a wavefront. In the second mode, when the processor finishes executing the set of instructions on the first portion of the wavefront, the processor executes the set of instructions on a second portion of the wavefront, and so on until all portions of the wavefront have been processed. The processor determines the operating mode based on one or more conditions.

Type: Grant

Filed: February 22, 2017

Date of Patent: November 12, 2019

Assignee: Advanced Micro Devices, Inc.

Inventors: Michael J. Mantor, Brian D. Emberling, Mark Fowler, Mark M. Leather
PACKED 16 BITS INSTRUCTION PIPELINE

Publication number: 20190129718

Abstract: Systems, apparatuses, and methods for routing traffic between clients and system memory are disclosed. A computing system includes a processor capable of executing single precision mathematical instructions on data sizes of M bits and half precision mathematical instructions on data sizes of N bits, which is less than M bits. At least two source operands with M bits indicated by a received instruction are read from a register file. If the instruction is a packed math instruction, at least a first source operand with a size of N bits less than M bits is selected from either a high portion or a low portion of one of the at least two source operands read from the register file. The instruction includes fields storing bits, each bit indicating the high portion or the low portion of a given source operand associated with a register identifier specified elsewhere in the instruction.

Type: Application

Filed: October 31, 2017

Publication date: May 2, 2019

Inventors: Jiasheng Chen, Bin He, Yunxiao Zou, Michael J. Mantor, Radhakrishna Giduthuri, Eric J. Finger, Brian D. Emberling
STREAM PROCESSOR WITH OVERLAPPING EXECUTION

Publication number: 20190004807

Abstract: Systems, apparatuses, and methods for implementing a stream processor with overlapping execution are disclosed. In one embodiment, a system includes at least a parallel processing unit with a plurality of execution pipelines. The processing throughput of the parallel processing unit is increased by overlapping execution of multi-pass instructions with single pass instructions without increasing the instruction issue rate. A first plurality of operands of a first vector instruction are read from a shared vector register file in a single clock cycle and stored in temporary storage. The first plurality of operands are accessed and utilized to initiate multiple instructions on individual vector elements on a first execution pipeline in subsequent clock cycles. A second plurality of operands are read from the shared vector register file during the subsequent clock cycles to initiate execution of one or more second vector instructions on the second execution pipeline.

Type: Application

Filed: July 24, 2017

Publication date: January 3, 2019

Inventors: Jiasheng Chen, Qingcheng Wang, Yunxiao Zou, Bin He, Jian Yang, Michael J. Mantor, Brian D. Emberling
SEPARATE TRACKING OF PENDING LOADS AND STORES

Publication number: 20180246724

Abstract: Systems, apparatuses, and methods for maintaining separate pending load and store counters are disclosed herein. In one embodiment, a system includes at least one execution unit, a memory subsystem, and a pair of counters for each thread of execution. In one embodiment, the system implements a software based approach for managing dependencies between instructions. In one embodiment, the execution unit(s) maintains counters to support the software-based approach for managing dependencies between instructions. The execution unit(s) are configured to execute instructions that are used to manage the dependencies during run-time. In one embodiment, the execution unit(s) execute wait instructions to wait until a given counter is equal to a specified value before continuing to execute the instruction sequence.

Type: Application

Filed: February 24, 2017

Publication date: August 30, 2018

Inventors: Mark Fowler, Brian D. Emberling
VARIABLE WAVEFRONT SIZE

Publication number: 20180239606

Abstract: Systems, apparatuses, and methods for processing variable wavefront sizes on a processor are disclosed. In one embodiment, a processor includes at least a scheduler, cache, and multiple execution units. When operating in a first mode, the processor executes the same instruction on multiple portions of a wavefront before proceeding to the next instruction of the shader program. When operating in a second mode, the processor executes a set of instructions on a first portion of a wavefront. In the second mode, when the processor finishes executing the set of instructions on the first portion of the wavefront, the processor executes the set of instructions on a second portion of the wavefront, and so on until all portions of the wavefront have been processed. The processor determines the operating mode based on one or more conditions.

Type: Application

Filed: February 22, 2017

Publication date: August 23, 2018

Inventors: Michael J. Mantor, Brian D. Emberling, Mark Fowler, Mark M. Leather
Method and system for thread monitoring

Patent number: 9311205

Abstract: An apparatus and methods for hardware-based performance monitoring of a computer system are presented. The apparatus includes: processing units; a memory; a connector device connecting the processing units and the memory; probes inserted the processing units, and the probes generating probe signals when selected processing events are detected; and a thread trace device connected to the connector device. The thread trace device includes an event interface to receive probe signals, and an event memory controller to send probe event messages to the memory, where probe event messages are based on probe signals. The probe event messages transferred to memory can be subsequently analyzed using a software program to determine, for example, thread-to-thread interactions.

Type: Grant

Filed: March 15, 2013

Date of Patent: April 12, 2016

Assignee: Advanced Micro Devices, Inc.

Inventor: Brian D. Emberling
Executing first instructions for smaller set of SIMD threads diverging upon conditional branch instruction

Patent number: 8959319

Abstract: Embodiments of the present invention provide systems, methods, and computer program products for improving divergent conditional branches in code being executed by a processor. For example, in an embodiment, a method comprises detecting a conditional statement of a program being simultaneously executed by a plurality of threads, determining which threads evaluate a condition of the conditional statement as true and which threads evaluate the condition as false, pushing an identifier associated with the larger set of the threads onto a stack, executing code associated with a smaller set of the threads, and executing code associated with the larger set of the threads.

Type: Grant

Filed: December 2, 2011

Date of Patent: February 17, 2015

Assignee: Advanced Micro Devices, Inc.

Inventors: Mark Leather, Norman Rubin, Brian D. Emberling, Michael Mantor
Processor with power control via instruction issuance

Patent number: 8862924

Abstract: Methods and apparatuses are provided for power control in a processor. The apparatus comprises a plurality of operational units arranged as a group of operational units. A power consumption monitor determines when cumulative power consumption of the group of operational units exceeds a threshold (e.g., either or both of the cumulative power threshold and the cumulative power rate threshold) during a time interval, after which a filter for issuing instructions to the group of operational units suspends instruction issuance to the group of operational units for the remainder of the time interval. The method comprises monitoring cumulative power consumption by a group of operational units within a processor over a time interval. If the cumulative power consumption of the group of operational units exceeds the threshold, instruction issuance to the group of operational units is suspended for the remainder of the time interval.

Type: Grant

Filed: November 15, 2011

Date of Patent: October 14, 2014

Assignee: Advanced Micro Devices, Inc.

Inventors: Brian D. Emberling, Stephen D. Presant, Seth Hendrickson, Krishna Sitaraman, Ali Ibrahim, Jeff Herman
Method and system for workitem synchronization

Patent number: 8607247

Abstract: Method, system, and computer program product embodiments for synchronizing workitems on one or more processors are disclosed. The embodiments include executing a barrier skip instruction by a first workitem from the group, and responsive to the executed barrier skip instruction, reconfiguring a barrier to synchronize other workitems from the group in a plurality of points in a sequence without requiring the first workitem to reach the barrier in any of the plurality of points.

Type: Grant

Filed: November 3, 2011

Date of Patent: December 10, 2013

Assignee: Advanced Micro Devices, Inc.

Inventors: Lee W. Howes, Benedict R. Gaster, Michael C. Houston, Michael Mantor, Mark Leather, Norman Rubin, Brian D. Emberling
Handling of extra contexts for shader constants

Patent number: 8593465

Abstract: The present invention provides a system for handling extra contexts for shader constants, and applications thereof. In an embodiment there is provided a computer-based method for executing a series of compute packets in an execution pipeline. The execution pipeline includes a first plurality of registers configured to store state-updates of a first type and a second plurality of registers configured to store state-updates of a second type. A first number of state-updates of the first type and a second number of state-updates of the second type are respectively identified and stored in the first and second plurality of registers. A compute packet is sent to the execution pipeline responsive to the first number and the second number. Then, the compute packet is executed by the execution pipeline.

Type: Grant

Filed: June 13, 2007

Date of Patent: November 26, 2013

Assignee: Advanced Micro Devices, Inc.

Inventors: Mark M. Leather, Brian D. Emberling
PROCESSOR WITH POWER CONTROL VIA INSTRUCTION ISSUANCE

Publication number: 20130124900

Abstract: Methods and apparatuses are provided for power control in a processor. The apparatus comprises a plurality of operational units arranged as a group of operational units. A power consumption monitor determines when cumulative power consumption of the group of operational units exceeds a threshold (e.g., either or both of the cumulative power threshold and the cumulative power rate threshold) during a time interval, after which a filter for issuing instructions to the group of operational units suspends instruction issuance to the group of operational units for the remainder of the time interval. The method comprises monitoring cumulative power consumption by a group of operational units within a processor over a time interval. If the cumulative power consumption of the group of operational units exceeds the threshold, instruction issuance to the group of operational units is suspended for the remainder of the time interval.

Type: Application

Filed: November 15, 2011

Publication date: May 16, 2013

Applicant: ADVANCED MICRO DEVICES, INC.

Inventors: Brian D. Emberling, Stephen D. Presant, Seth Hendrickson, Krishna Sitaraman, Ali Ibrahim, Jeff Herman
Method and System for Workitem Synchronization

Publication number: 20130117750

Abstract: Method, system, and computer program product embodiments for synchronizing workitems on one or more processors are disclosed. The embodiments include executing a barrier skip instruction by a first workitem from the group, and responsive to the executed barrier skip instruction, reconfiguring a barrier to synchronize other workitems from the group in a plurality of points in a sequence without requiring the first workitem to reach the barrier in any of the plurality of points.

Type: Application

Filed: November 3, 2011

Publication date: May 9, 2013

Applicant: Advanced Micro Devices, Inc.

Inventors: Lee W. HOWES, Benedict R. Gaster, Michael C. Houston, Michael Mantor, Mark Leather, Norman Rubin, Brian D. Emberling
Method and system for thread monitoring

Patent number: 8413120

Abstract: An apparatus and methods for hardware-based performance monitoring of a computer system are presented. The apparatus includes: processing units; a memory; a connector device connecting the processing units and the memory; probes inserted the processing units, and the probes generating probe signals when selected processing events are detected; and a thread trace device connected to the connector device. The thread trace device includes an event interface to receive probe signals, and an event memory controller to send probe event messages to the memory, where probe event messages are based on probe signals. The probe event messages transferred to memory can be subsequently analyzed using a software program to determine, for example, thread-to-thread interactions.

Type: Grant

Filed: October 27, 2008

Date of Patent: April 2, 2013

Assignee: Advanced Micro Devices, Inc.

Inventor: Brian D. Emberling
Systems and Methods for Improving Divergent Conditional Branches

Publication number: 20120204014

Abstract: Embodiments of the present invention provide systems, methods, and computer program products for improving divergent conditional branches in code being executed by a processor. For example, in an embodiment, a method comprises detecting a conditional statement of a program being simultaneously executed by a plurality of threads, determining which threads evaluate a condition of the conditional statement as true and which threads evaluate the condition as false, pushing an identifier associated with the larger set of the threads onto a stack, executing code associated with a smaller set of the threads, and executing code associated with the larger set of the threads.

Type: Application

Filed: December 2, 2011

Publication date: August 9, 2012

Inventors: Mark LEATHER, Norman Rubin, Brian D. Emberling, Michael Mantor

1 2 3 4 next