Patents by Inventor Paul Gilbert Meyer

Paul Gilbert Meyer has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Configuration of a deep vector engine using an opcode table, control table, and datapath table

Patent number: 12271732

Abstract: A technique to program a compute channel having multiple computational circuit blocks coupled in series in a pipeline can include receiving a machine instruction for the compute channel. The machine instruction is decoded to obtain an opcode, and the opcode can be used as an index to access an opcode entry in an opcode table. The opcode entry contains a pointer to a microoperation, and the pointer can be used to access a microoperation represented by a control entry in a control table and a datapath configuration entry in a datapath table. The microoperation can then be issued to the compute channel by configuring the compute channel with the control entry and the datapath configuration entry.

Type: Grant

Filed: September 30, 2022

Date of Patent: April 8, 2025

Assignee: Amazon Technologies, Inc.

Inventors: Paul Gilbert Meyer, Ron Diamant, Sundeep Amirineni
Throughput increase for compute engine

Patent number: 12260214

Abstract: A compute channel can have multiple computational circuit blocks coupled in series to form a pipeline. The compute channel can perform a computation on an input tensor to generate an output tensor based on an instruction. When the computational does not require all of the computational circuit blocks, the throughput of the compute channel can be increased by splitting the data elements of the input tensor into multiple input data streams. The multiple input data streams are provided to respective subsets of one or more computational circuit blocks in the pipeline using bypass circuitry of the computational circuit blocks, and the computation can be performed on multiple input data streams in the respective subsets of one or more computational circuit blocks to generate multiple output data streams corresponding to the output tensor.

Type: Grant

Filed: September 30, 2022

Date of Patent: March 25, 2025

Assignee: Amazon Technologies, Inc.

Inventors: Paul Gilbert Meyer, Ron Diamant, Sundeep Amirineni, Sunil Kumar Bathula
Configurable vector compute engine

Patent number: 12242853

Abstract: A compute channel having a compute pipeline of compute stages can be configured using a configuration pipeline with a control table and a datapath table. The control table stores control entries corresponding to respective microoperations, and each control entry includes control information for the compute channel. A datapath table stores datapath configuration entries corresponding to respective microoperations, and each datapath configuration entry has a datapath configuration that includes computational circuit block configurations to configure respective computational circuit blocks in the compute pipeline of the compute channel. Control logic can issue a microoperation to the compute channel by configuring the compute channel according to the control information of the microoperation obtained from the control table, and by inputting the datapath configuration of the microoperation obtained from the datapath table into the configuration pipeline of the compute channel.

Type: Grant

Filed: September 30, 2022

Date of Patent: March 4, 2025

Assignee: Amazon Technologies, Inc.

Inventors: Paul Gilbert Meyer, Ron Diamant, Sundeep Amirineni
Increasing performance of computational array accelerators

Patent number: 12182691

Abstract: To improve performance of a computational array, the architecture of the array can be modified to allow the processing engines of a column to operate in parallel and the clock frequency of the array to be increased. The processing engines of each column of the array can be grouped into a series of row groups. The processing engines of each row group can be loaded with input values, and computations on the input values can be carried out in parallel to generate the column output. One or more flip-flop stages can be inserted into the computational logic of each of the processing engines. The computational logic can then be distributed across the flip-flop stages to reduce the propagation delay between flip-flop stages of the processing engine, hence allowing the clock frequency of the array to be increased.

Type: Grant

Filed: March 17, 2021

Date of Patent: December 31, 2024

Assignee: Amazon Technologies, Inc.

Inventors: Sundeep Amirineni, Akshay Balasubramanian, Joshua Wayne Bowman, Ron Diamant, Paul Gilbert Meyer, Thomas Elmer
Fine-grained sparsity computations in systolic array

Patent number: 12182695

Abstract: A systolic array can implement an architecture tailored to perform matrix multiplications on sparse matrices. Each processing element in the systolic array may include a register configured to store a value, and a multiplexor configured to select an input element from multiple input data buses based on metadata associated with the value. Each processing element may also include a multiplier configured to multiply the selected input element with the value to generate a multiplication result, and an adder configured to add the multiplication result to a partial sum input to generate a partial sum output.

Type: Grant

Filed: September 25, 2023

Date of Patent: December 31, 2024

Assignee: Amazon Technologies, Inc.

Inventors: Paul Gilbert Meyer, Thiam Khean Hah, Randy Renfu Huang, Ron Diamant, Vignesh Vivekraja
Matrix transpose hardware acceleration

Patent number: 12141468

Abstract: In one example, an apparatus comprises: a memory array having an array of memory elements arranged in rows and columns, each memory element being configured to store a data element; and a memory access circuit configured to: perform a row write operation to store a first group of data elements at a first row of the array of memory elements; perform a column read operation at a first column of the array of memory elements to obtain a second group of data elements; and perform a column write operation to store a third group of data elements at the first column of the array of memory elements to replace the second group of data elements.

Type: Grant

Filed: July 28, 2022

Date of Patent: November 12, 2024

Assignee: Amazon Technologies, Inc.

Inventors: Kun Xu, Paul Gilbert Meyer, Ron Diamant
Emulating fine-grained sparsity in a systolic array

Patent number: 12130885

Abstract: To take advantage of the architecture of a systolic array tailored to perform sparse matrix multiplications, a weight matrix can be converted into a set of constrained fine-grained sparse weight matrices. The conversion process may include receiving a request to perform a matrix multiplication operation with a weight matrix, and determining that the weight matrix satisfies a sparsity condition to convert the weight matrix into a set of constrained fine-grained sparse weight matrices. The weight matrix can then be converted into a set of constrained fine-grained sparse weight matrices. Computer instructions can then be generated for an integrated circuit device to perform the requested matrix multiplication operation as a set of sparse matrix multiplication operations using the set of constrained fine-grained sparse weight matrices.

Type: Grant

Filed: November 3, 2022

Date of Patent: October 29, 2024

Assignee: Amazon Technologies, Inc.

Inventors: Paul Gilbert Meyer, Thiam Khean Hah, Randy Renfu Huang, Ron Diamant, Vignesh Vivekraja
Throughput increase for tensor operations

Patent number: 12099840

Abstract: A technique for performing a tensor operation includes inputting concatenated data words of a first input tensor and concatenated data words of a second input tensor into a compute channel having a plurality of compute stages coupled in series. The concatenated data words of the first input tensor and the second input tensor represented in a first datatype can be converted into data elements represented in a second datatype using a first subset of the compute stages. A binary operation can be performed on each data element represented in the second datatype from the first input tensor with a corresponding data element represented in the second datatype from the second input tensor to generate output data elements of an output tensor represented in the second datatype using a second subset of the compute stages. The output data elements of the output tensor can then be outputted from the compute channel.

Type: Grant

Filed: March 16, 2023

Date of Patent: September 24, 2024

Assignee: Amazon Technologies, Inc.

Inventors: Xiaodan Tan, Paul Gilbert Meyer, Ron Diamant
Resizable scratchpad memory

Patent number: 12045475

Abstract: Techniques for implementing a dynamically resizable memory region for alternative use in a memory are described. The techniques may include using two concurrent address maps corresponding to two address ranges for a memory represented as an array of memory blocks. The first address range can be mapped to the memory with starting addresses of the memory blocks incrementing sequentially along each row. The second address range can be mapped to the memory with starting addresses of the memory blocks incrementing sequentially along each column. When an access request is received having a target address belonging to the first address range, the target address is provided as the memory address to access the memory. When an access request having a target address belonging to the second address range, the target address is translated by address translation logic into a memory address to access the memory.

Type: Grant

Filed: December 3, 2021

Date of Patent: July 23, 2024

Assignee: Amazon Technologies, Inc.

Inventors: Paul Gilbert Meyer, Patricio Kaplan, Sundeep Amirineni, Laura Sharpless, Ron Diamant, Akshay Balasubramanian
Programmable vector engine for efficient beam search

Patent number: 12039330

Abstract: To perform a beam search operation on an input tensor using a data processor with native hardware support, the data processor can be programmed with a set of instructions. The set of instructions can include a first machine instruction that operates on the input tensor to obtain N largest values in the input tensor, a second machine instruction that operates on the input tensor to obtain indices corresponding to the N largest values in the input tensor, and a third machine instruction that operates on the input tensor to replace the N largest values in the input tensor with a minimum value.

Type: Grant

Filed: September 14, 2021

Date of Patent: July 16, 2024

Assignee: Amazon Technologies, Inc.

Inventor: Paul Gilbert Meyer
Programmable compute engine having transpose operations

Patent number: 12008368

Abstract: A technique to execute transpose and compute operations may include retrieving a set of machine instructions from an instruction buffer of a data processor. The instruction buffer has multiple entries, and each entry stores one machine instruction. A machine instruction from the set of machine instructions is executed to transpose a submatrix of an input tensor and perform computations on column elements of the submatrix. The machine instruction combines the transpose operation with computational operations into a single machine instruction.

Type: Grant

Filed: September 21, 2022

Date of Patent: June 11, 2024

Assignee: Amazon Technologies, Inc.

Inventors: Xiaodan Tan, Paul Gilbert Meyer, Sheng Xu, Ron Diamant
PROGRAMMABLE COMPUTE ENGINE HAVING TRANSPOSE OPERATIONS

Publication number: 20240111528

Abstract: A technique to execute transpose and compute operations may include retrieving a set of machine instructions from an instruction buffer of a data processor. The instruction buffer has multiple entries, and each entry stores one machine instruction. A machine instruction from the set of machine instructions is executed to transpose a submatrix of an input tensor and perform computations on column elements of the submatrix. The machine instruction combines the transpose operation with computational operations into a single machine instruction.

Type: Application

Filed: September 21, 2022

Publication date: April 4, 2024

Inventors: Xiaodan Tan, Paul Gilbert Meyer, Sheng Xu, Ron Diamant
COMPUTE ENGINE WITH TRANSPOSE CIRCUITRY

Publication number: 20240103813

Abstract: An integrated circuit that combines transpose and compute operations may include a transpose circuit coupled to a set of compute channels. Each compute channel may include multiple arithmetic logic unit (ALU) circuits coupled in series. The transpose circuit is operable to receive an input tensor, transpose the input tensor, and output a transposed tensor to the set of compute channels. The set of compute channels is operable to generate outputs in parallel, with each of the outputs being generated from a corresponding vector of the transposed tensor.

Type: Application

Filed: September 21, 2022

Publication date: March 28, 2024

Inventors: Xiaodan Tan, Paul Gilbert Meyer, Sheng Xu, Ron Diamant
Machine instructions for decoding acceleration including fuse input instructions to fuse multiple JPEG data blocks together to take advantage of a full SIMD width of a processor

Patent number: 11941397

Abstract: Techniques to take advantage of the single-instruction-multiple-data (SIMD) capabilities of a processor to process data blocks can include implementing an instruction to fuse the data blocks together. The fuse input instruction can have a first input vector, a second input vector, a select input, a first output vector, and a second output vector. The fuse input instruction selects a portion of the first input vector and a portion of the second input vector based on the select input, sign extends the selected portion of the first input vector and the selected portion of the second input vector, and shuffles data elements of the sign extended portion of the first input vector with data elements of the sign extended portion of the second input vector to generate the first and second output vectors.

Type: Grant

Filed: May 31, 2022

Date of Patent: March 26, 2024

Assignee: Amazon Technologies, Inc.

Inventors: Xiaodan Tan, Paul Gilbert Meyer
Systolic array with efficient input reduction and extended array performance

Patent number: 11880682

Abstract: Systems and methods are provided to perform multiply-accumulate operations of reduced precision numbers in a systolic array. Each row of the systolic array can receive reduced inputs from a respective reducer. The reduced input can include a reduced input data element and/or a reduced weight. The systolic array may lack support for inputs with a first bit-length and the reducers may reduce the bit-length of a given input from the first bit-length to a second shorter bit-length and provide the reduced input to the array. In order to reduce the bit-length, the reducer may reduce the number of trailing bits of the input. Further, the systolic array can receive a reduced and rounded input. The systolic array can propagate the reduced input through the processing elements in the systolic array. Each processing element may include a multiplier and/or an adder to perform arithmetical operations based on the reduced input.

Type: Grant

Filed: June 30, 2021

Date of Patent: January 23, 2024

Assignee: Amazon Technologies, Inc.

Inventors: Paul Gilbert Meyer, Thomas A Volpe, Ron Diamant, Joshua Wayne Bowman, Nishith Desai, Thomas Elmer
Fine-grained sparsity computations in systolic array

Patent number: 11803736

Abstract: A systolic array can implement an architecture tailored to perform matrix multiplications on constrained fine-grained sparse weight matrices. Each processing element in the systolic array may include a weight register configured to store a weight value, and a multiplexor configured to select a feature map (FMAP) input element from multiple FMAP input data buses based on metadata associated with the weight value. Each processing element may also include a multiplier configured to multiply the selected feature map input element with the weight value to generate a multiplication result, and an adder configured to add the multiplication result to a partial sum input to generate a partial sum output.

Type: Grant

Filed: June 30, 2020

Date of Patent: October 31, 2023

Assignee: Amazon Technologies, Inc.

Inventors: Paul Gilbert Meyer, Thiam Khean Hah, Randy Renfu Huang, Ron Diamant, Vignesh Vivekraja
Using shared data bus to support systolic array tiling

Patent number: 11625453

Abstract: To improve utilization of a systolic array, each row of the array is provided with a number of general purpose row input data buses. Each of the general purpose row input data buses can be operable to transfer either feature map (FMAP) input elements or weight values into the processing elements of the corresponding row of the array. By using such general purpose row input data buses, concurrent matrix multiplications as well as faster background weight loading can be achieved in the array.

Type: Grant

Filed: December 12, 2019

Date of Patent: April 11, 2023

Assignee: Amazon Technologies, Inc.

Inventors: Paul Gilbert Meyer, Ron Diamant
MIXING SPARSITY COMPRESSION

Publication number: 20230100930

Abstract: Techniques for compressing a neural network model by mixing compression ratios (sparsity patterns) are described. The weight tensor of a neural network model is divided into weight groups. The pruning cost of compressing the weight values according to a compression ratio is determined for each weight group, and a pruning cost distribution for the compression ratio is generated from the pruning costs of the weight groups. A cost threshold can then be selected from the pruning cost distribution, and weight groups having a pruning cost below the selected cost threshold are compressed according to the compression ratio. The remaining weight groups can be compressed using one or more less aggressive compression ratios. The cost threshold can be adjusted to tune the overall sparsity and accuracy of the compressed neural network.

Type: Application

Filed: September 30, 2021

Publication date: March 30, 2023

Inventors: Xiaodan Tan, Paul Gilbert Meyer, Gennady Pekhimenko, Randy Renfu Huang
SYSTOLIC ARRAY WITH INPUT REDUCTION TO MULTIPLE REDUCED INPUTS

Publication number: 20230004523

Abstract: Systems and methods are provided to perform multiply-accumulate operations of reduced precision numbers in a systolic array. Each row of the systolic array can receive reduced inputs from a respective reducer. The reducer can receive a particular input and generate multiple reduced inputs from the input. The reduced inputs can include reduced input data elements and/or a reduced weights. The systolic array may lack support for inputs with a first bit-length and the reducers may reduce the bit-length of a given input from the first bit-length to a second shorter bit-length and provide multiple reduced inputs with second shorter bit-length to the array. The systolic array may perform multiply-accumulate operations on each unique combination of the multiple reduced input data elements and the reduced weights to generate multiple partial outputs. The systolic array may sum the partial outputs to generate the output.

Type: Application

Filed: June 30, 2021

Publication date: January 5, 2023

Inventors: Paul Gilbert Meyer, Thomas A. Volpe, Ron Diamant, Joshua Wayne Bowman, Nishith Desai, Thomas Elmer
SYSTOLIC ARRAY WITH EFFICIENT INPUT REDUCTION AND EXTENDED ARRAY PERFORMANCE

Publication number: 20230004384

Abstract: Systems and methods are provided to perform multiply-accumulate operations of reduced precision numbers in a systolic array. Each row of the systolic array can receive reduced inputs from a respective reducer. The reduced input can include a reduced input data element and/or a reduced weight. The systolic array may lack support for inputs with a first bit-length and the reducers may reduce the bit-length of a given input from the first bit-length to a second shorter bit-length and provide the reduced input to the array. In order to reduce the bit-length, the reducer may reduce the number of trailing bits of the input. Further, the systolic array can receive a reduced and rounded input. The systolic array can propagate the reduced input through the processing elements in the systolic array. Each processing element may include a multiplier and/or an adder to perform arithmetical operations based on the reduced input.

Type: Application

Filed: June 30, 2021

Publication date: January 5, 2023

Inventors: Paul Gilbert Meyer, Thomas A Volpe, Ron Diamant, Joshua Wayne Bowman, Nishith Desai, Thomas Elmer

1 2 3 next