Patents by Inventor Andrew M. Havlir

Andrew M. Havlir has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20200097293
    Abstract: Techniques are disclosed relating to fetching items from a compute command stream that includes compute kernels. In some embodiments, stream fetch circuitry sequentially pre-fetches items from the stream and stores them in a buffer. In some embodiments, fetch parse circuitry iterate through items in the buffer using a fetch parse pointer to detect indirect-data-access items and/or redirect items in the buffer. The fetch parse circuitry may send detected indirect data accesses to indirect-fetch circuitry, which may buffer requests. In some embodiments, execute parse circuitry iterates through items in the buffer using an execute parse pointer (e.g., which may trail the fetch parse pointer) and outputs both item data from the buffer and indirect-fetch results from indirect-fetch circuitry for execution. In various embodiments, the disclosed techniques may reduce fetch latency for compute kernels.
    Type: Application
    Filed: September 26, 2018
    Publication date: March 26, 2020
    Inventors: Andrew M. Havlir, Jeffrey T. Brady
  • Publication number: 20200098160
    Abstract: Techniques are disclosed relating to distributing work from compute kernels using a distributed hierarchical parser architecture. In some embodiments, an apparatus includes a plurality of shader units configured to perform operations for compute workgroups included in compute kernels processed by the apparatus, a plurality of distributed workload parser circuits, and a communications fabric connected to the plurality of distributed workload parser circuits and a master workload parser circuit. In some embodiments, the master workload parser circuit is configured to iteratively determine a next position in multiple dimensions for a next batch of workgroups from the kernel and send batch information to the distributed workload parser circuits via the communications fabric to assign the batch of workgroups.
    Type: Application
    Filed: September 26, 2018
    Publication date: March 26, 2020
    Inventors: Andrew M. Havlir, Benjamin Bowman, Jeffrey T. Brady
  • Patent number: 10593094
    Abstract: Techniques are disclosed relating to distributing work from compute kernels using a distributed hierarchical parser architecture. In some embodiments, an apparatus includes a plurality of shader units configured to perform operations for compute workgroups included in compute kernels processed by the apparatus, a plurality of distributed workload parser circuits, and a communications fabric connected to the plurality of distributed workload parser circuits and a master workload parser circuit. In some embodiments, the master workload parser circuit is configured to iteratively determine a next position in multiple dimensions for a next batch of workgroups from the kernel and send batch information to the distributed workload parser circuits via the communications fabric to assign the batch of workgroups.
    Type: Grant
    Filed: September 26, 2018
    Date of Patent: March 17, 2020
    Assignee: Apple Inc.
    Inventors: Andrew M. Havlir, Benjamin Bowman, Jeffrey T. Brady
  • Publication number: 20200065104
    Abstract: Techniques are disclosed relating to controlling an operand cache in a pipelined fashion. An operand cache may cache operands fetched from the register file or generated by previous instructions to improve performance and/or reduce power consumption. In some embodiments, instructions are pipelined and separate tag information is maintained to indicate allocation of an operand cache entry and ownership of the operand cache entry. In some embodiments, this may allow an operand to remain in the operand cache (and potentially be retrieved or modified) during an interval between allocation of the entry for another operand and ownership of the entry by the other operand. This may improve operand cache efficiency by allowing the entry to be used while to retrieving the other operand from the register file, for example.
    Type: Application
    Filed: August 24, 2018
    Publication date: February 27, 2020
    Inventors: Robert D. Kenney, Terence M. Potter, Andrew M. Havlir, Sivayya V. Ayinala
  • Patent number: 10475152
    Abstract: Techniques are disclosed relating to managing dependencies in a compute control stream that specifies operations to be performed on a programmable shader (e.g., of a graphics unit). In some embodiments, the compute control stream includes commands and kernels. In some embodiments, dependency circuitry is configured to maintain dependencies such that younger kernels are allowed to execute ahead of a type of cache-related command (e.g., a command that signals a cache flush and/or invalidate). Disclosed circuitry may include separate buffers for commands and kernels, command dependency circuitry, and kernel dependency circuitry. In various embodiments, the disclosed architecture may improve performance in a highly scalable manner.
    Type: Grant
    Filed: February 14, 2018
    Date of Patent: November 12, 2019
    Assignee: Apple Inc.
    Inventors: Andrew M. Havlir, Jeffrey T. Brady
  • Patent number: 10467724
    Abstract: Techniques are disclosed relating to dispatching compute work from a compute stream. In some embodiments, workgroup batch circuitry is configured to select (e.g., in a single clock cycle) multiple workgroups to be distributed to different shader circuitry. In some embodiments, iterator circuitry is configured to determine next positions in different dimensions at least partially in parallel. For example, in some embodiments, first circuitry is configured to determine a next position in a first dimension and an increment amount for a second dimension. In some embodiments, second circuitry is configured to determine at least partially in parallel with the determination of the next position in the first dimension, next positions in the second dimension for multiple possible increment amounts in the second dimension. In some embodiments, this may facilitate a configurable number of workgroups per batch and may increase performance, e.g., by increasing the overall number of workgroups dispatched per clock cycle.
    Type: Grant
    Filed: February 14, 2018
    Date of Patent: November 5, 2019
    Assignee: Apple Inc.
    Inventors: Andrew M. Havlir, Jeffrey T. Brady
  • Patent number: 10353711
    Abstract: Techniques are disclosed relating to clause-based execution of program instructions, which may be single-instruction multiple data (SIMD) computer instructions. In some embodiments, an apparatus includes execution circuitry configured to receive clauses of instructions and SIMD groups of input data to be operated on by the clauses. In some embodiments, the apparatus further includes one or more storage elements configured to store state information for clauses processed by the execution circuitry. In some embodiments, the apparatus further includes scheduling circuitry configured to send instructions of a first clause and corresponding input data for execution by the execution circuitry and indicate, prior to sending instruction and input data of a second clause to the execution circuitry for execution, whether the second clause and a first clause are assigned to operate on groups of input data corresponding to the same instruction stream.
    Type: Grant
    Filed: September 6, 2016
    Date of Patent: July 16, 2019
    Assignee: Apple Inc.
    Inventors: Andrew M. Havlir, Brian K. Reynolds, Liang Xia, Terence M. Potter
  • Patent number: 10282169
    Abstract: Techniques are disclosed relating to floating-point operations with down-conversion. In some embodiments, a floating-point unit is configured to perform fused multiply-addition operations based on first and second different instruction types. In some embodiments, the first instruction type specifies result in the first floating-point format and the second instruction type specifies fused multiply addition of input operands in the first floating-point format to generate a result in a second, lower-precision floating-point format. For example, the first format may be a 32-bit format and the second format may be a 16-bit format. In some embodiments, the floating-point unit includes rounding circuitry, exponent circuitry, and/or increment circuitry configured to generate signals for the second instruction type in the same pipeline stage as for the first instruction type. In some embodiments, disclosed techniques may reduce the number of pipeline stages included in the floating-point circuitry.
    Type: Grant
    Filed: April 6, 2016
    Date of Patent: May 7, 2019
    Assignee: Apple Inc.
    Inventors: Liang-Kai Wang, Terence M. Potter, Andrew M. Havlir, Yu Sun, Nicolas X. Pena, Xiao-Long Wu, Christopher A. Burns
  • Patent number: 10269091
    Abstract: Techniques are disclosed relating to storage techniques for storing primitive information with vertex re-use. In some embodiments, graphics circuitry aggregates primitive information (including vertex data) for multiple primitives into a primitive block data structure. This may include storing only a single instance of a vertex for multiple primitives that share the vertex. The graphics circuitry may switch between primitive blocks, with one being active and the others non-active. For non-active primitive blocks, the graphics circuitry may track whether vertex identifiers have been used for a new vertex, which may prevent vertex re-use. If an identifier is not used for a new vertex, however, a vertex may be re-used across deactivation and reactivation of a primitive block.
    Type: Grant
    Filed: November 10, 2017
    Date of Patent: April 23, 2019
    Assignee: Apple Inc.
    Inventors: Michael A. Mang, Andrew M. Havlir
  • Publication number: 20180089090
    Abstract: In some embodiments, a system includes an execution unit, a register file, an operand cache, and a predication control circuit. Operands identified by an instruction may be stored in the operand cache. One or more entries of the operand cache that store the operands may be marked as dirty. The predication control circuit may identify an instruction as having an unresolved predication state. Subsequent to initiating execution of the instruction, the predication control circuit may receive results of the at least one unresolved conditional instruction. In response to the results indicating the instruction has a known-to-execute predication state, the predication control circuit may initiate writing, in the operand cache, results of executing the instruction. In response to the results indicating the instruction has a known-not-to-execute predication state, the predication control circuit may prevent the results from executing the instruction from being written in the operand cache.
    Type: Application
    Filed: September 23, 2016
    Publication date: March 29, 2018
    Inventors: Andrew M. Havlir, Terence M. Potter
  • Publication number: 20180067748
    Abstract: Techniques are disclosed relating to clause-based execution of program instructions, which may be single-instruction multiple data (SIMD) computer instructions. In some embodiments, an apparatus includes execution circuitry configured to receive clauses of instructions and SIMD groups of input data to be operated on by the clauses. In some embodiments, the apparatus further includes one or more storage elements configured to store state information for clauses processed by the execution circuitry. In some embodiments, the apparatus further includes scheduling circuitry configured to send instructions of a first clause and corresponding input data for execution by the execution circuitry and indicate, prior to sending instruction and input data of a second clause to the execution circuitry for execution, whether the second clause and a first clause are assigned to operate on groups of input data corresponding to the same instruction stream.
    Type: Application
    Filed: September 6, 2016
    Publication date: March 8, 2018
    Inventors: Andrew M. Havlir, Brian K. Reynolds, Liang Xia, Terence M. Potter
  • Patent number: 9846579
    Abstract: Techniques are disclosed relating to comparison circuitry. In some embodiments, compare circuitry is configured to generate comparison results for sets of inputs in both one or more integer formats and one or more floating-point formats. In some embodiments, the compare circuitry includes padding circuitry configured to add one or more bits to each of first and second input values to generate first and second padded values. In some embodiments, the compare circuitry also includes integer subtraction circuitry configured to subtract the first padded value from the second padded value to generate a subtraction result. In some embodiments, the compare circuitry includes output logic configured to generate the comparison result based on the subtraction result. In various embodiments, using at least a portion of the same circuitry (e.g., the subtractor) for both integer and floating-point comparisons may reduce processor area.
    Type: Grant
    Filed: June 13, 2016
    Date of Patent: December 19, 2017
    Assignee: Apple Inc.
    Inventors: Liang-Kai Wang, Terence M. Potter, Andrew M. Havlir
  • Publication number: 20170357506
    Abstract: Techniques are disclosed relating to comparison circuitry. In some embodiments, compare circuitry is configured to generate comparison results for sets of inputs in both one or more integer formats and one or more floating-point formats. In some embodiments, the compare circuitry includes padding circuitry configured to add one or more bits to each of first and second input values to generate first and second padded values. In some embodiments, the compare circuitry also includes integer subtraction circuitry configured to subtract the first padded value from the second padded value to generate a subtraction result. In some embodiments, the compare circuitry includes output logic configured to generate the comparison result based on the subtraction result. In various embodiments, using at least a portion of the same circuitry (e.g., the subtractor) for both integer and floating-point comparisons may reduce processor area.
    Type: Application
    Filed: June 13, 2016
    Publication date: December 14, 2017
    Inventors: Liang-Kai Wang, Terence M. Potter, Andrew M. Havlir
  • Publication number: 20170323420
    Abstract: Techniques are disclosed relating to low-level instruction storage in a processing unit. In some embodiments, a graphics unit includes execution circuitry, decode circuitry, hazard circuitry, and caching circuitry. In some embodiments the execution circuitry is configured to execute clauses of graphics instructions. In some embodiments, the decode circuitry is configured to receive graphics instructions and a clause identifier for each received graphics instruction and to decode the received graphics instructions. In some embodiments, the caching circuitry includes a plurality of entries each configured to store a set of decoded instructions in the same clause. A given clause may be fetched and executed multiple times, e.g., for different SIMD groups, while stored in the caching circuitry.
    Type: Application
    Filed: July 24, 2017
    Publication date: November 9, 2017
    Inventors: Andrew M. Havlir, Dzung Q. Vu, Liang Kai Wang
  • Publication number: 20170293470
    Abstract: Techniques are disclosed relating to floating-point operations with down-conversion. In some embodiments, a floating-point unit is configured to perform fused multiply-addition operations based on first and second different instruction types. In some embodiments, the first instruction type specifies result in the first floating-point format and the second instruction type specifies fused multiply addition of input operands in the first floating-point format to generate a result in a second, lower-precision floating-point format. For example, the first format may be a 32-bit format and the second format may be a 16-bit format. In some embodiments, the floating-point unit includes rounding circuitry, exponent circuitry, and/or increment circuitry configured to generate signals for the second instruction type in the same pipeline stage as for the first instruction type. In some embodiments, disclosed techniques may reduce the number of pipeline stages included in the floating-point circuitry.
    Type: Application
    Filed: April 6, 2016
    Publication date: October 12, 2017
    Inventors: Liang-Kai Wang, Terence M. Potter, Andrew M. Havlir, Yu Sun, Nicolas X. Pena, Xiao-Long Wu, Christopher A. Burns
  • Patent number: 9785567
    Abstract: Techniques are disclosed relating to per-pipeline control for an operand cache. In some embodiments, an apparatus includes a register file and multiple execution pipelines. In some embodiments, the apparatus also includes an operand cache that includes multiple entries that each include multiple portions that are each configured to store an operand for a corresponding execution pipeline. In some embodiments, the operand cache is configured, during operation of the apparatus, to store data in only a subset of the portions of an entry. In some embodiments, the apparatus is configured to store, for each entry in the operand cache, a per-entry validity value that indicates whether the entry is valid and per-portion state information that indicates whether data for each portion is valid and whether data for each portion is modified relative to data in a corresponding entry in the register file.
    Type: Grant
    Filed: September 11, 2015
    Date of Patent: October 10, 2017
    Assignee: Apple Inc.
    Inventors: Andrew M. Havlir, Terence M. Potter, Liang-Kai Wang
  • Patent number: 9727944
    Abstract: Techniques are disclosed relating to low-level instruction storage in a graphics unit. In some embodiments, a graphics unit includes execution circuitry, decode circuitry, hazard circuitry, and caching circuitry. In some embodiments the execution circuitry is configured to execute clauses of graphics instructions. In some embodiments, the decode circuitry is configured to receive graphics instructions and a clause identifier for each received graphics instruction and to decode the received graphics instructions. In some embodiments, the hazard circuitry is configured to generate hazard information that specifies dependencies between ones of the decoded graphics instructions in the same clause. In some embodiments, the caching circuitry includes a plurality of entries each configured to store a set of decoded instructions in the same clause and hazard information generated by the decode circuitry for the clause.
    Type: Grant
    Filed: June 22, 2015
    Date of Patent: August 8, 2017
    Assignee: Apple Inc.
    Inventors: Andrew M. Havlir, Dzung Q. Vu, Liang Kai Wang
  • Patent number: 9652233
    Abstract: Instructions may require one or more operands to be executed, which may be provided from a register file. In the context of a GPU, however, a register file may be a relatively large structure, and reading from the register file may be energy and/or time intensive An operand cache may be used to store a subset of operands, and may use less power and have quicker access times than the register file. Hint values may be used in some embodiments to suggest that a particular operand should be stored in the operand cache (so that is available for current or future use). In one embodiment, a hint value indicates that an operand should be cached whenever possible. Hint values may be determined by software, such as a compiler, in some embodiments. One or more criteria may be used to determine hint values, such as how soon in the future or how frequently an operand will be used again.
    Type: Grant
    Filed: August 20, 2013
    Date of Patent: May 16, 2017
    Assignee: Apple Inc.
    Inventors: Terence M. Potter, Timothy A. Olson, James S. Blomgren, Andrew M. Havlir, Michael Geary
  • Patent number: 9633409
    Abstract: Techniques are disclosed relating to predication. In one embodiment, a graphics processing unit is disclosed that includes a first set of architecturally-defined registers configured to store predication information. The graphics processing unit further includes a second set of registers configured to mirror the first set of registers and an execution pipeline configured to discontinue execution of an instruction sequence based on predication information in the second set of registers. In one embodiment, the second set of registers includes one or more registers proximal to an output of the execution pipeline. In some embodiments, the execution pipeline writes back a predicate value determined for a predicate writer to the second set of registers. The first set of architecturally-defined registers is then updated with the predicate value written back to the second set of registers. In some embodiments, the execution pipeline discontinues execution of the instruction sequence without stalling.
    Type: Grant
    Filed: August 26, 2013
    Date of Patent: April 25, 2017
    Assignee: Apple Inc.
    Inventors: Andrew M. Havlir, Brian K. Reynolds, Michael A. Geary
  • Patent number: 9619394
    Abstract: An apparatus includes an operand cache for storing operands from a register file for use by execution circuitry. In some embodiments, eviction priority for the operand cache is based on the status of entries (e.g., whether dirty or clean) and the retention priority of entries. In some embodiments, flushes are handled differently based on their retention priority (e.g., low-priority entries may be pre-emptively flushed). In some embodiments, timing for cache clean operations is specified on a per-instruction basis. Disclosed techniques may spread out write backs in time, facilitate cache clean operations, facilitate thread switching, extend the time operands are available in an operand cache, and/or improve the use of compiler hints, in some embodiments.
    Type: Grant
    Filed: July 21, 2015
    Date of Patent: April 11, 2017
    Assignee: Apple Inc.
    Inventors: Andrew M. Havlir, Terence M. Potter