Patents by Inventor Paul Gilbert Meyer

Paul Gilbert Meyer has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20240111528
    Abstract: A technique to execute transpose and compute operations may include retrieving a set of machine instructions from an instruction buffer of a data processor. The instruction buffer has multiple entries, and each entry stores one machine instruction. A machine instruction from the set of machine instructions is executed to transpose a submatrix of an input tensor and perform computations on column elements of the submatrix. The machine instruction combines the transpose operation with computational operations into a single machine instruction.
    Type: Application
    Filed: September 21, 2022
    Publication date: April 4, 2024
    Inventors: Xiaodan Tan, Paul Gilbert Meyer, Sheng Xu, Ron Diamant
  • Publication number: 20240103813
    Abstract: An integrated circuit that combines transpose and compute operations may include a transpose circuit coupled to a set of compute channels. Each compute channel may include multiple arithmetic logic unit (ALU) circuits coupled in series. The transpose circuit is operable to receive an input tensor, transpose the input tensor, and output a transposed tensor to the set of compute channels. The set of compute channels is operable to generate outputs in parallel, with each of the outputs being generated from a corresponding vector of the transposed tensor.
    Type: Application
    Filed: September 21, 2022
    Publication date: March 28, 2024
    Inventors: Xiaodan Tan, Paul Gilbert Meyer, Sheng Xu, Ron Diamant
  • Patent number: 11941397
    Abstract: Techniques to take advantage of the single-instruction-multiple-data (SIMD) capabilities of a processor to process data blocks can include implementing an instruction to fuse the data blocks together. The fuse input instruction can have a first input vector, a second input vector, a select input, a first output vector, and a second output vector. The fuse input instruction selects a portion of the first input vector and a portion of the second input vector based on the select input, sign extends the selected portion of the first input vector and the selected portion of the second input vector, and shuffles data elements of the sign extended portion of the first input vector with data elements of the sign extended portion of the second input vector to generate the first and second output vectors.
    Type: Grant
    Filed: May 31, 2022
    Date of Patent: March 26, 2024
    Assignee: Amazon Technologies, Inc.
    Inventors: Xiaodan Tan, Paul Gilbert Meyer
  • Patent number: 11880682
    Abstract: Systems and methods are provided to perform multiply-accumulate operations of reduced precision numbers in a systolic array. Each row of the systolic array can receive reduced inputs from a respective reducer. The reduced input can include a reduced input data element and/or a reduced weight. The systolic array may lack support for inputs with a first bit-length and the reducers may reduce the bit-length of a given input from the first bit-length to a second shorter bit-length and provide the reduced input to the array. In order to reduce the bit-length, the reducer may reduce the number of trailing bits of the input. Further, the systolic array can receive a reduced and rounded input. The systolic array can propagate the reduced input through the processing elements in the systolic array. Each processing element may include a multiplier and/or an adder to perform arithmetical operations based on the reduced input.
    Type: Grant
    Filed: June 30, 2021
    Date of Patent: January 23, 2024
    Assignee: Amazon Technologies, Inc.
    Inventors: Paul Gilbert Meyer, Thomas A Volpe, Ron Diamant, Joshua Wayne Bowman, Nishith Desai, Thomas Elmer
  • Patent number: 11803736
    Abstract: A systolic array can implement an architecture tailored to perform matrix multiplications on constrained fine-grained sparse weight matrices. Each processing element in the systolic array may include a weight register configured to store a weight value, and a multiplexor configured to select a feature map (FMAP) input element from multiple FMAP input data buses based on metadata associated with the weight value. Each processing element may also include a multiplier configured to multiply the selected feature map input element with the weight value to generate a multiplication result, and an adder configured to add the multiplication result to a partial sum input to generate a partial sum output.
    Type: Grant
    Filed: June 30, 2020
    Date of Patent: October 31, 2023
    Assignee: Amazon Technologies, Inc.
    Inventors: Paul Gilbert Meyer, Thiam Khean Hah, Randy Renfu Huang, Ron Diamant, Vignesh Vivekraja
  • Patent number: 11625453
    Abstract: To improve utilization of a systolic array, each row of the array is provided with a number of general purpose row input data buses. Each of the general purpose row input data buses can be operable to transfer either feature map (FMAP) input elements or weight values into the processing elements of the corresponding row of the array. By using such general purpose row input data buses, concurrent matrix multiplications as well as faster background weight loading can be achieved in the array.
    Type: Grant
    Filed: December 12, 2019
    Date of Patent: April 11, 2023
    Assignee: Amazon Technologies, Inc.
    Inventors: Paul Gilbert Meyer, Ron Diamant
  • Publication number: 20230100930
    Abstract: Techniques for compressing a neural network model by mixing compression ratios (sparsity patterns) are described. The weight tensor of a neural network model is divided into weight groups. The pruning cost of compressing the weight values according to a compression ratio is determined for each weight group, and a pruning cost distribution for the compression ratio is generated from the pruning costs of the weight groups. A cost threshold can then be selected from the pruning cost distribution, and weight groups having a pruning cost below the selected cost threshold are compressed according to the compression ratio. The remaining weight groups can be compressed using one or more less aggressive compression ratios. The cost threshold can be adjusted to tune the overall sparsity and accuracy of the compressed neural network.
    Type: Application
    Filed: September 30, 2021
    Publication date: March 30, 2023
    Inventors: Xiaodan Tan, Paul Gilbert Meyer, Gennady Pekhimenko, Randy Renfu Huang
  • Publication number: 20230004523
    Abstract: Systems and methods are provided to perform multiply-accumulate operations of reduced precision numbers in a systolic array. Each row of the systolic array can receive reduced inputs from a respective reducer. The reducer can receive a particular input and generate multiple reduced inputs from the input. The reduced inputs can include reduced input data elements and/or a reduced weights. The systolic array may lack support for inputs with a first bit-length and the reducers may reduce the bit-length of a given input from the first bit-length to a second shorter bit-length and provide multiple reduced inputs with second shorter bit-length to the array. The systolic array may perform multiply-accumulate operations on each unique combination of the multiple reduced input data elements and the reduced weights to generate multiple partial outputs. The systolic array may sum the partial outputs to generate the output.
    Type: Application
    Filed: June 30, 2021
    Publication date: January 5, 2023
    Inventors: Paul Gilbert Meyer, Thomas A. Volpe, Ron Diamant, Joshua Wayne Bowman, Nishith Desai, Thomas Elmer
  • Publication number: 20230004384
    Abstract: Systems and methods are provided to perform multiply-accumulate operations of reduced precision numbers in a systolic array. Each row of the systolic array can receive reduced inputs from a respective reducer. The reduced input can include a reduced input data element and/or a reduced weight. The systolic array may lack support for inputs with a first bit-length and the reducers may reduce the bit-length of a given input from the first bit-length to a second shorter bit-length and provide the reduced input to the array. In order to reduce the bit-length, the reducer may reduce the number of trailing bits of the input. Further, the systolic array can receive a reduced and rounded input. The systolic array can propagate the reduced input through the processing elements in the systolic array. Each processing element may include a multiplier and/or an adder to perform arithmetical operations based on the reduced input.
    Type: Application
    Filed: June 30, 2021
    Publication date: January 5, 2023
    Inventors: Paul Gilbert Meyer, Thomas A Volpe, Ron Diamant, Joshua Wayne Bowman, Nishith Desai, Thomas Elmer
  • Patent number: 11500962
    Abstract: To take advantage of the architecture of a systolic array tailored to perform sparse matrix multiplications, a weight matrix can be converted into a set of constrained fine-grained sparse weight matrices. The conversion process may include receiving a request to perform a matrix multiplication operation with a weight matrix, and determining that the weight matrix satisfies a sparsity condition to convert the weight matrix into a set of constrained fine-grained sparse weight matrices. The weight matrix can then be converted into a set of constrained fine-grained sparse weight matrices. Computer instructions can then be generated for an integrated circuit device to perform the requested matrix multiplication operation as a set of sparse matrix multiplication operations using the set of constrained fine-grained sparse weight matrices.
    Type: Grant
    Filed: June 30, 2020
    Date of Patent: November 15, 2022
    Assignee: Amazon Technologies, Inc.
    Inventors: Paul Gilbert Meyer, Thiam Khean Hah, Randy Renfu Huang, Ron Diamant, Vignesh Vivekraja
  • Patent number: 11435941
    Abstract: In one example, an apparatus comprises: a memory array having an array of memory elements arranged in rows and columns, each memory element being configured to store a data element; and a memory access circuit configured to: perform a row write operation to store a first group of data elements at a first row of the array of memory elements; perform a column read operation at a first column of the array of memory elements to obtain a second group of data elements; and perform a column write operation to store a third group of data elements at the first column of the array of memory elements to replace the second group of data elements.
    Type: Grant
    Filed: June 24, 2020
    Date of Patent: September 6, 2022
    Assignee: Amazon Technologies, Inc.
    Inventors: Kun Xu, Paul Gilbert Meyer, Ron Diamant
  • Patent number: 11314648
    Abstract: Data processing apparatus comprises a data access requesting node; data access circuitry to receive a data access request from the data access requesting node and to route the data access request for fulfilment by one or more data storage nodes selected from a group of two or more data storage nodes; and indication circuitry to provide a source indication to the data access requesting node, to indicate an attribute of the one or more data storage nodes which fulfilled the data access request; the data access requesting node being configured to vary its operation in response to the source indication.
    Type: Grant
    Filed: February 8, 2017
    Date of Patent: April 26, 2022
    Assignee: Arm Limited
    Inventors: Michael Filippo, Jamshed Jalal, Kias Magnus Bruce, Alex James Waugh, Geoffray Lacourba, Paul Gilbert Meyer, Bruce James Mathewson, Phanindra Kumar Mannava
  • Patent number: 11256623
    Abstract: Apparatus and a corresponding method of operating a hub device, and a target device, in a coherent interconnect system are presented. A cache pre-population request of a set of coherency protocol transactions in the system is received from a requesting master device specifying at least one data item and the hub device responds by cause a cache pre-population trigger of the set of coherency protocol transactions specifying the at least one data item to be transmitted to a target device. This trigger can cause the target device to request that the specified at least one data item is retrieved and brought into cache. Since the target device can therefore decide whether to respond to the trigger or not, it does not receive cached data unsolicited, simplifying its configuration, whilst still allowing some data to be pre-cached.
    Type: Grant
    Filed: February 8, 2017
    Date of Patent: February 22, 2022
    Assignee: ARM LIMITED
    Inventors: Phanindra Kumar Mannava, Bruce James Mathewson, Jamshed Jalal, Klas Magnus Bruce, Michael Filippo, Paul Gilbert Meyer, Alex James Waugh, Geoffray Matthieu Lacourba
  • Patent number: 10783080
    Abstract: An interconnect system and method of operating the system are disclosed. A master device has access to a cache and a slave device has an associated data storage device for long-term storage of data items. The master device can initiate a cache maintenance operation in the interconnect system with respect to a data item temporarily stored in the cache causing action to be taken by the slave device with respect to storage of the data item in the data storage device. For long latency operations the master device can issue a separated cache maintenance request specifying the data item and the slave device. In response an intermediate device signals an acknowledgment response indicating that it has taken on responsibility for completion of the cache maintenance operation and issues the separated cache maintenance request to the slave device.
    Type: Grant
    Filed: October 29, 2018
    Date of Patent: September 22, 2020
    Assignee: ARM LIMITED
    Inventors: Phanindra Kumar Mannava, Bruce James Mathewson, Jamshed Jalal, Paul Gilbert Meyer
  • Patent number: 10761987
    Abstract: An apparatus and method are provided for processing ownership upgrade requests in relation to cached data. The apparatus has a plurality of processing units, at least some of which have associated cache storage. A coherent interconnect couples the plurality of master units with memory, the coherent interconnect having a snoop unit used to implement a cache coherency protocol when a request received by the coherent interconnect identifies a cacheable memory address within the memory. Contention management circuitry is provided to control contended access to a memory address by two or more processing units within the plurality of processing units.
    Type: Grant
    Filed: November 28, 2018
    Date of Patent: September 1, 2020
    Assignee: Arm Limited
    Inventors: Jamshed Jalal, Mark David Werkheiser, Michael Filippo, Klas Magnus Bruce, Paul Gilbert Meyer
  • Patent number: 10713187
    Abstract: A memory controller comprises memory access circuitry configured to initiate a data access of data stored in a memory in response to a data access hint message received from another node in data communication with the memory controller; to access data stored in the memory in response to a data access request received from another node in data communication with the memory controller and to provide the accessed data as a data access response to the data access request.
    Type: Grant
    Filed: July 25, 2019
    Date of Patent: July 14, 2020
    Assignee: ARM Limited
    Inventors: Michael Filippo, Jamshed Jalal, Klas Magnus Bruce, Paul Gilbert Meyer, David Joseph Hawkins, Phanindra Kumar Mannava, Joseph Michael Pusdesris
  • Patent number: 10698825
    Abstract: In a system-on-chip there is a local interconnect to connect local devices on the chip to one another, a gateway to connect the chip to a remote chip of a plurality of chips in a cache-coherent multi-chip system via an inter-chip interconnect, and a cache-coherent device. The cache-coherent device has a cache-coherency look-up table having entries for shared cache data lines. When a data access request is received via the inter-chip interconnect and the local interconnect a system-unique identifier for a request source of the data access request is generated in dependence on an inter-chip request source identifier used on the inter-chip interconnect and an identifier indicative of the remote chip. The bit-set used to express the system-unique identifier is larger than the bit-set used to express the inter-chip request source identifier.
    Type: Grant
    Filed: March 12, 2019
    Date of Patent: June 30, 2020
    Assignee: Arm Limited
    Inventors: Gurunath Ramagiri, Ashok Kumar Tummala, Mark David Werkheiser, Jamshed Jalal, Premkishore Shivakumar, Paul Gilbert Meyer
  • Publication number: 20200167284
    Abstract: An apparatus and method are provided for processing ownership upgrade requests in relation to cached data. The apparatus has a plurality of processing units, at least some of which have associated cache storage. A coherent interconnect couples the plurality of master units with memory, the coherent interconnect having a snoop unit used to implement a cache coherency protocol when a request received by the coherent interconnect identifies a cacheable memory address within the memory. Contention management circuitry is provided to control contended access to a memory address by two or more processing units within the plurality of processing units.
    Type: Application
    Filed: November 28, 2018
    Publication date: May 28, 2020
    Inventors: Jamshed JALAL, Mark David WERKHEISER, Michael FILIPPO, Klas Magnus BRUCE, Paul Gilbert MEYER
  • Publication number: 20200133865
    Abstract: An interconnect system and method of operating the system are disclosed. A master device has access to a cache and a slave device has an associated data storage device for long-term storage of data items. The master device can initiate a cache maintenance operation in the interconnect system with respect to a data item temporarily stored in the cache causing action to be taken by the slave device with respect to storage of the data item in the data storage device. For long latency operations the master device can issue a separated cache maintenance request specifying the data item and the slave device. In response an intermediate device signals an acknowledgment response indicating that it has taken on responsibility for completion of the cache maintenance operation and issues the separated cache maintenance request to the slave device.
    Type: Application
    Filed: October 29, 2018
    Publication date: April 30, 2020
    Inventors: Phanindra Kumar MANNAVA, Bruce James MATHEWSON, Jamshed JALAL, Paul Gilbert MEYER
  • Patent number: 10591977
    Abstract: A method, system, and device provide for selective control in a distributed cache system of the power state of a number of receiver partitions arranged in one or more partition groups. A power control element coupled to one or more of the receiver partitions and a coherent interconnect selectively control transition from a current power state to a new power state by each receiver partition of one or more partition groups of the plurality of partition groups.
    Type: Grant
    Filed: December 10, 2015
    Date of Patent: March 17, 2020
    Assignee: Arm Limited
    Inventors: Mark David Werkheiser, Dominic William Brown, Ashley John Crawford, Paul Gilbert Meyer