Patents by Inventor Andrew M. Havlir

Andrew M. Havlir has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Low Latency Fetch Circuitry for Compute Kernels

Publication number: 20200097293

Abstract: Techniques are disclosed relating to fetching items from a compute command stream that includes compute kernels. In some embodiments, stream fetch circuitry sequentially pre-fetches items from the stream and stores them in a buffer. In some embodiments, fetch parse circuitry iterate through items in the buffer using a fetch parse pointer to detect indirect-data-access items and/or redirect items in the buffer. The fetch parse circuitry may send detected indirect data accesses to indirect-fetch circuitry, which may buffer requests. In some embodiments, execute parse circuitry iterates through items in the buffer using an execute parse pointer (e.g., which may trail the fetch parse pointer) and outputs both item data from the buffer and indirect-fetch results from indirect-fetch circuitry for execution. In various embodiments, the disclosed techniques may reduce fetch latency for compute kernels.

Type: Application

Filed: September 26, 2018

Publication date: March 26, 2020

Inventors: Andrew M. Havlir, Jeffrey T. Brady
Distributed Compute Work Parser Circuitry using Communications Fabric

Publication number: 20200098160

Abstract: Techniques are disclosed relating to distributing work from compute kernels using a distributed hierarchical parser architecture. In some embodiments, an apparatus includes a plurality of shader units configured to perform operations for compute workgroups included in compute kernels processed by the apparatus, a plurality of distributed workload parser circuits, and a communications fabric connected to the plurality of distributed workload parser circuits and a master workload parser circuit. In some embodiments, the master workload parser circuit is configured to iteratively determine a next position in multiple dimensions for a next batch of workgroups from the kernel and send batch information to the distributed workload parser circuits via the communications fabric to assign the batch of workgroups.

Type: Application

Filed: September 26, 2018

Publication date: March 26, 2020

Inventors: Andrew M. Havlir, Benjamin Bowman, Jeffrey T. Brady
Distributed compute work parser circuitry using communications fabric

Patent number: 10593094

Abstract: Techniques are disclosed relating to distributing work from compute kernels using a distributed hierarchical parser architecture. In some embodiments, an apparatus includes a plurality of shader units configured to perform operations for compute workgroups included in compute kernels processed by the apparatus, a plurality of distributed workload parser circuits, and a communications fabric connected to the plurality of distributed workload parser circuits and a master workload parser circuit. In some embodiments, the master workload parser circuit is configured to iteratively determine a next position in multiple dimensions for a next batch of workgroups from the kernel and send batch information to the distributed workload parser circuits via the communications fabric to assign the batch of workgroups.

Type: Grant

Filed: September 26, 2018

Date of Patent: March 17, 2020

Assignee: Apple Inc.

Inventors: Andrew M. Havlir, Benjamin Bowman, Jeffrey T. Brady
Pipelined Allocation for Operand Cache

Publication number: 20200065104

Abstract: Techniques are disclosed relating to controlling an operand cache in a pipelined fashion. An operand cache may cache operands fetched from the register file or generated by previous instructions to improve performance and/or reduce power consumption. In some embodiments, instructions are pipelined and separate tag information is maintained to indicate allocation of an operand cache entry and ownership of the operand cache entry. In some embodiments, this may allow an operand to remain in the operand cache (and potentially be retrieved or modified) during an interval between allocation of the entry for another operand and ownership of the entry by the other operand. This may improve operand cache efficiency by allowing the entry to be used while to retrieving the other operand from the register file, for example.

Type: Application

Filed: August 24, 2018

Publication date: February 27, 2020

Inventors: Robert D. Kenney, Terence M. Potter, Andrew M. Havlir, Sivayya V. Ayinala
Dependency handling for set-aside of compute control stream commands

Patent number: 10475152

Abstract: Techniques are disclosed relating to managing dependencies in a compute control stream that specifies operations to be performed on a programmable shader (e.g., of a graphics unit). In some embodiments, the compute control stream includes commands and kernels. In some embodiments, dependency circuitry is configured to maintain dependencies such that younger kernels are allowed to execute ahead of a type of cache-related command (e.g., a command that signals a cache flush and/or invalidate). Disclosed circuitry may include separate buffers for commands and kernels, command dependency circuitry, and kernel dependency circuitry. In various embodiments, the disclosed architecture may improve performance in a highly scalable manner.

Type: Grant

Filed: February 14, 2018

Date of Patent: November 12, 2019

Assignee: Apple Inc.

Inventors: Andrew M. Havlir, Jeffrey T. Brady
Fast determination of workgroup batches from multi-dimensional kernels

Patent number: 10467724

Abstract: Techniques are disclosed relating to dispatching compute work from a compute stream. In some embodiments, workgroup batch circuitry is configured to select (e.g., in a single clock cycle) multiple workgroups to be distributed to different shader circuitry. In some embodiments, iterator circuitry is configured to determine next positions in different dimensions at least partially in parallel. For example, in some embodiments, first circuitry is configured to determine a next position in a first dimension and an increment amount for a second dimension. In some embodiments, second circuitry is configured to determine at least partially in parallel with the determination of the next position in the first dimension, next positions in the second dimension for multiple possible increment amounts in the second dimension. In some embodiments, this may facilitate a configurable number of workgroups per batch and may increase performance, e.g., by increasing the overall number of workgroups dispatched per clock cycle.

Type: Grant

Filed: February 14, 2018

Date of Patent: November 5, 2019

Assignee: Apple Inc.

Inventors: Andrew M. Havlir, Jeffrey T. Brady
Clause chaining for clause-based instruction execution

Patent number: 10353711

Abstract: Techniques are disclosed relating to clause-based execution of program instructions, which may be single-instruction multiple data (SIMD) computer instructions. In some embodiments, an apparatus includes execution circuitry configured to receive clauses of instructions and SIMD groups of input data to be operated on by the clauses. In some embodiments, the apparatus further includes one or more storage elements configured to store state information for clauses processed by the execution circuitry. In some embodiments, the apparatus further includes scheduling circuitry configured to send instructions of a first clause and corresponding input data for execution by the execution circuitry and indicate, prior to sending instruction and input data of a second clause to the execution circuitry for execution, whether the second clause and a first clause are assigned to operate on groups of input data corresponding to the same instruction stream.

Type: Grant

Filed: September 6, 2016

Date of Patent: July 16, 2019

Assignee: Apple Inc.

Inventors: Andrew M. Havlir, Brian K. Reynolds, Liang Xia, Terence M. Potter
Floating-point multiply-add with down-conversion

Patent number: 10282169

Abstract: Techniques are disclosed relating to floating-point operations with down-conversion. In some embodiments, a floating-point unit is configured to perform fused multiply-addition operations based on first and second different instruction types. In some embodiments, the first instruction type specifies result in the first floating-point format and the second instruction type specifies fused multiply addition of input operands in the first floating-point format to generate a result in a second, lower-precision floating-point format. For example, the first format may be a 32-bit format and the second format may be a 16-bit format. In some embodiments, the floating-point unit includes rounding circuitry, exponent circuitry, and/or increment circuitry configured to generate signals for the second instruction type in the same pipeline stage as for the first instruction type. In some embodiments, disclosed techniques may reduce the number of pipeline stages included in the floating-point circuitry.

Type: Grant

Filed: April 6, 2016

Date of Patent: May 7, 2019

Assignee: Apple Inc.

Inventors: Liang-Kai Wang, Terence M. Potter, Andrew M. Havlir, Yu Sun, Nicolas X. Pena, Xiao-Long Wu, Christopher A. Burns
Re-using graphics vertex identifiers for primitive blocks across states

Patent number: 10269091

Abstract: Techniques are disclosed relating to storage techniques for storing primitive information with vertex re-use. In some embodiments, graphics circuitry aggregates primitive information (including vertex data) for multiple primitives into a primitive block data structure. This may include storing only a single instance of a vertex for multiple primitives that share the vertex. The graphics circuitry may switch between primitive blocks, with one being active and the others non-active. For non-active primitive blocks, the graphics circuitry may track whether vertex identifiers have been used for a new vertex, which may prevent vertex re-use. If an identifier is not used for a new vertex, however, a vertex may be re-used across deactivation and reactivation of a primitive block.

Type: Grant

Filed: November 10, 2017

Date of Patent: April 23, 2019

Assignee: Apple Inc.

Inventors: Michael A. Mang, Andrew M. Havlir
INSTRUCTION PREDICATION CIRCUIT CONTROL SYSTEM

Publication number: 20180089090

Abstract: In some embodiments, a system includes an execution unit, a register file, an operand cache, and a predication control circuit. Operands identified by an instruction may be stored in the operand cache. One or more entries of the operand cache that store the operands may be marked as dirty. The predication control circuit may identify an instruction as having an unresolved predication state. Subsequent to initiating execution of the instruction, the predication control circuit may receive results of the at least one unresolved conditional instruction. In response to the results indicating the instruction has a known-to-execute predication state, the predication control circuit may initiate writing, in the operand cache, results of executing the instruction. In response to the results indicating the instruction has a known-not-to-execute predication state, the predication control circuit may prevent the results from executing the instruction from being written in the operand cache.

Type: Application

Filed: September 23, 2016

Publication date: March 29, 2018

Inventors: Andrew M. Havlir, Terence M. Potter
Clause Chaining for Clause-Based Instruction Execution

Publication number: 20180067748

Abstract: Techniques are disclosed relating to clause-based execution of program instructions, which may be single-instruction multiple data (SIMD) computer instructions. In some embodiments, an apparatus includes execution circuitry configured to receive clauses of instructions and SIMD groups of input data to be operated on by the clauses. In some embodiments, the apparatus further includes one or more storage elements configured to store state information for clauses processed by the execution circuitry. In some embodiments, the apparatus further includes scheduling circuitry configured to send instructions of a first clause and corresponding input data for execution by the execution circuitry and indicate, prior to sending instruction and input data of a second clause to the execution circuitry for execution, whether the second clause and a first clause are assigned to operate on groups of input data corresponding to the same instruction stream.

Type: Application

Filed: September 6, 2016

Publication date: March 8, 2018

Inventors: Andrew M. Havlir, Brian K. Reynolds, Liang Xia, Terence M. Potter
Unified integer and floating-point compare circuitry

Patent number: 9846579

Abstract: Techniques are disclosed relating to comparison circuitry. In some embodiments, compare circuitry is configured to generate comparison results for sets of inputs in both one or more integer formats and one or more floating-point formats. In some embodiments, the compare circuitry includes padding circuitry configured to add one or more bits to each of first and second input values to generate first and second padded values. In some embodiments, the compare circuitry also includes integer subtraction circuitry configured to subtract the first padded value from the second padded value to generate a subtraction result. In some embodiments, the compare circuitry includes output logic configured to generate the comparison result based on the subtraction result. In various embodiments, using at least a portion of the same circuitry (e.g., the subtractor) for both integer and floating-point comparisons may reduce processor area.

Type: Grant

Filed: June 13, 2016

Date of Patent: December 19, 2017

Assignee: Apple Inc.

Inventors: Liang-Kai Wang, Terence M. Potter, Andrew M. Havlir
UNIFIED INTEGER AND FLOATING-POINT COMPARE CIRCUITRY

Publication number: 20170357506

Abstract: Techniques are disclosed relating to comparison circuitry. In some embodiments, compare circuitry is configured to generate comparison results for sets of inputs in both one or more integer formats and one or more floating-point formats. In some embodiments, the compare circuitry includes padding circuitry configured to add one or more bits to each of first and second input values to generate first and second padded values. In some embodiments, the compare circuitry also includes integer subtraction circuitry configured to subtract the first padded value from the second padded value to generate a subtraction result. In some embodiments, the compare circuitry includes output logic configured to generate the comparison result based on the subtraction result. In various embodiments, using at least a portion of the same circuitry (e.g., the subtractor) for both integer and floating-point comparisons may reduce processor area.

Type: Application

Filed: June 13, 2016

Publication date: December 14, 2017

Inventors: Liang-Kai Wang, Terence M. Potter, Andrew M. Havlir
Instruction Storage

Publication number: 20170323420

Abstract: Techniques are disclosed relating to low-level instruction storage in a processing unit. In some embodiments, a graphics unit includes execution circuitry, decode circuitry, hazard circuitry, and caching circuitry. In some embodiments the execution circuitry is configured to execute clauses of graphics instructions. In some embodiments, the decode circuitry is configured to receive graphics instructions and a clause identifier for each received graphics instruction and to decode the received graphics instructions. In some embodiments, the caching circuitry includes a plurality of entries each configured to store a set of decoded instructions in the same clause. A given clause may be fetched and executed multiple times, e.g., for different SIMD groups, while stored in the caching circuitry.

Type: Application

Filed: July 24, 2017

Publication date: November 9, 2017

Inventors: Andrew M. Havlir, Dzung Q. Vu, Liang Kai Wang
Floating-Point Multiply-Add with Down-Conversion

Publication number: 20170293470

Abstract: Techniques are disclosed relating to floating-point operations with down-conversion. In some embodiments, a floating-point unit is configured to perform fused multiply-addition operations based on first and second different instruction types. In some embodiments, the first instruction type specifies result in the first floating-point format and the second instruction type specifies fused multiply addition of input operands in the first floating-point format to generate a result in a second, lower-precision floating-point format. For example, the first format may be a 32-bit format and the second format may be a 16-bit format. In some embodiments, the floating-point unit includes rounding circuitry, exponent circuitry, and/or increment circuitry configured to generate signals for the second instruction type in the same pipeline stage as for the first instruction type. In some embodiments, disclosed techniques may reduce the number of pipeline stages included in the floating-point circuitry.

Type: Application

Filed: April 6, 2016

Publication date: October 12, 2017

Inventors: Liang-Kai Wang, Terence M. Potter, Andrew M. Havlir, Yu Sun, Nicolas X. Pena, Xiao-Long Wu, Christopher A. Burns
Operand cache control techniques

Patent number: 9785567

Abstract: Techniques are disclosed relating to per-pipeline control for an operand cache. In some embodiments, an apparatus includes a register file and multiple execution pipelines. In some embodiments, the apparatus also includes an operand cache that includes multiple entries that each include multiple portions that are each configured to store an operand for a corresponding execution pipeline. In some embodiments, the operand cache is configured, during operation of the apparatus, to store data in only a subset of the portions of an entry. In some embodiments, the apparatus is configured to store, for each entry in the operand cache, a per-entry validity value that indicates whether the entry is valid and per-portion state information that indicates whether data for each portion is valid and whether data for each portion is modified relative to data in a corresponding entry in the register file.

Type: Grant

Filed: September 11, 2015

Date of Patent: October 10, 2017

Assignee: Apple Inc.

Inventors: Andrew M. Havlir, Terence M. Potter, Liang-Kai Wang
GPU instruction storage

Patent number: 9727944

Abstract: Techniques are disclosed relating to low-level instruction storage in a graphics unit. In some embodiments, a graphics unit includes execution circuitry, decode circuitry, hazard circuitry, and caching circuitry. In some embodiments the execution circuitry is configured to execute clauses of graphics instructions. In some embodiments, the decode circuitry is configured to receive graphics instructions and a clause identifier for each received graphics instruction and to decode the received graphics instructions. In some embodiments, the hazard circuitry is configured to generate hazard information that specifies dependencies between ones of the decoded graphics instructions in the same clause. In some embodiments, the caching circuitry includes a plurality of entries each configured to store a set of decoded instructions in the same clause and hazard information generated by the decode circuitry for the clause.

Type: Grant

Filed: June 22, 2015

Date of Patent: August 8, 2017

Assignee: Apple Inc.

Inventors: Andrew M. Havlir, Dzung Q. Vu, Liang Kai Wang
Hint values for use with an operand cache

Patent number: 9652233

Abstract: Instructions may require one or more operands to be executed, which may be provided from a register file. In the context of a GPU, however, a register file may be a relatively large structure, and reading from the register file may be energy and/or time intensive An operand cache may be used to store a subset of operands, and may use less power and have quicker access times than the register file. Hint values may be used in some embodiments to suggest that a particular operand should be stored in the operand cache (so that is available for current or future use). In one embodiment, a hint value indicates that an operand should be cached whenever possible. Hint values may be determined by software, such as a compiler, in some embodiments. One or more criteria may be used to determine hint values, such as how soon in the future or how frequently an operand will be used again.

Type: Grant

Filed: August 20, 2013

Date of Patent: May 16, 2017

Assignee: Apple Inc.

Inventors: Terence M. Potter, Timothy A. Olson, James S. Blomgren, Andrew M. Havlir, Michael Geary
GPU predication

Patent number: 9633409

Abstract: Techniques are disclosed relating to predication. In one embodiment, a graphics processing unit is disclosed that includes a first set of architecturally-defined registers configured to store predication information. The graphics processing unit further includes a second set of registers configured to mirror the first set of registers and an execution pipeline configured to discontinue execution of an instruction sequence based on predication information in the second set of registers. In one embodiment, the second set of registers includes one or more registers proximal to an output of the execution pipeline. In some embodiments, the execution pipeline writes back a predicate value determined for a predicate writer to the second set of registers. The first set of architecturally-defined registers is then updated with the predicate value written back to the second set of registers. In some embodiments, the execution pipeline discontinues execution of the instruction sequence without stalling.

Type: Grant

Filed: August 26, 2013

Date of Patent: April 25, 2017

Assignee: Apple Inc.

Inventors: Andrew M. Havlir, Brian K. Reynolds, Michael A. Geary
Operand cache flush, eviction, and clean techniques using hint information and dirty information

Patent number: 9619394

Abstract: An apparatus includes an operand cache for storing operands from a register file for use by execution circuitry. In some embodiments, eviction priority for the operand cache is based on the status of entries (e.g., whether dirty or clean) and the retention priority of entries. In some embodiments, flushes are handled differently based on their retention priority (e.g., low-priority entries may be pre-emptively flushed). In some embodiments, timing for cache clean operations is specified on a per-instruction basis. Disclosed techniques may spread out write backs in time, facilitate cache clean operations, facilitate thread switching, extend the time operands are available in an operand cache, and/or improve the use of compiler hints, in some embodiments.

Type: Grant

Filed: July 21, 2015

Date of Patent: April 11, 2017

Assignee: Apple Inc.

Inventors: Andrew M. Havlir, Terence M. Potter

prev 1 2 3 4 next