Patents by Inventor Elliott Delaye

Elliott Delaye has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

HETEROGENEOUS INFERENCE ACCELERATION

Publication number: 20260161999

Abstract: The embodiments herein describe techniques for performing ML compilation using a unified interface that combines different processors in a heterogeneous processing system which allows for intelligent partitioning of a ML model. Unlike prior solutions which rely on user preferences to assign the ML model, the unified interface can violate or break the user preferences when partitioning the ML model. The unified interface can receive information from the processors (e.g., a NPU, CPU, GPU, etc.) and determine the capabilities, current workload, power metrics, subgraphs of the ML model they can execute, and the like. With this information, the unified interface can intelligently choose when to violate or break the user-entered priority based instructions.

Type: Application

Filed: December 5, 2024

Publication date: June 11, 2026

Inventors: Gabor SINES, Elliott DELAYE, Vinod KATHAIL
HARDWARE-OPTIMIZED MATRIX MULTIPLICATION OPERATIONS FOR LARGE LANGUAGE MODELS

Publication number: 20260050475

Abstract: A processing system configured to implement a large language model (LLM) includes an accelerator unit (AU) having hardware configured to perform matrix multiplication operations for the LLM using sets of predetermined matrix dimensions. Further, to help optimize the LLM for the processing system, the processing system includes a processor that modifies one or more matrix multiplication operations of the LLM based the sets of predetermined matrix dimensions supported by the hardware of the AU. The processor then recompiles the LLM using the modified multiplication operations and implements the recompiled LLM.

Type: Application

Filed: August 14, 2024

Publication date: February 19, 2026

Inventors: Rajeev Patwari, Abid Karumannil, Ashish Sirasao, Elliott Delaye, Jorn Tuyls, Tejus Siddagangaiah
Hardware acceleration of machine learning designs

Patent number: 12511460

Abstract: Hardware acceleration of machine learning (ML) designs includes translating an ML primitive into an intermediate representation. The intermediate representation is subdivided to specify a functional compute block. The functional compute block is sized according to a compute node primitive adapted for implementing the ML primitive on target hardware. An overlay is generated for the ML primitive, at least in part, by mapping the functional compute block to the compute node primitive. The overlay is synthesizable to implement the ML primitive on the target hardware. The overlay can be scheduled for operation within the target hardware as part of an ML design including the ML primitive.

Type: Grant

Filed: June 14, 2022

Date of Patent: December 30, 2025

Assignee: Xilinx, Inc.

Inventors: Ehsan Ghasemi, Rajeev Patwari, Elliott Delaye, Jorn Tuyls, Ephrem C. Wu, Xiao Teng, Sanket Pandit
SYSTEMS AND METHODS FOR PARALLELIZATION OF EMBEDDING OPERATIONS

Publication number: 20250190221

Abstract: A disclosed method may include initializing a deep learning recommendation model (DLRM) comprising a plurality of embedding tables, each embedding table comprising a plurality of embeddings. The method may also include receiving input data associated with accessing embeddings from the plurality of embedding tables and applying a parallelization strategy to process the plurality of embedding tables, the parallelization strategy configured to improve performance by distributing computational workloads and optimizing memory access. The method may also include processing the embeddings based on the input data in accordance with the parallelization strategy, the processing comprising aggregating embeddings accessed from the plurality of embedding tables. The method may also include generating, for further processing, output data based on the processed embeddings. Various other methods, systems, and computer-readable media are also disclosed.

Type: Application

Filed: December 9, 2024

Publication date: June 12, 2025

Applicants: Advanced Micro Devices, Inc., Xilinx, Inc.

Inventors: Krishnakumar Nair, Meenakshi Arunachalam, John Kalamatianos, Rishabh Jain, Varun Agrawal, Avinash Chandra Pandey, Siddappa Yallappa Karabannavar, Ashish Sirasao, Elliott Delaye
Instruction set architecture for data processing array control

Patent number: 12248786

Abstract: Controlling a data processing (DP) array includes creating a replica of a register address space of the DP array based on the design and the DP array. A sequence of instructions, including write instructions and read instructions, is received. The write instructions correspond to buffer descriptors specifying runtime data movements for a design for a DP array. The write instructions are converted into transaction instructions and the read instructions are converted into wait instructions based on the replica of the register address space. The transaction instructions and the wait instructions are included in an instruction buffer. The instruction buffer is provided to a microcontroller configured to execute the transaction instructions and the wait instructions to implement the runtime data movements for the design as implemented in the DP array. In another aspect, the instruction buffer is stored in a file for subsequent execution by the microcontroller.

Type: Grant

Filed: August 8, 2022

Date of Patent: March 11, 2025

Assignee: Xilinx, Inc.

Inventors: Xiao Teng, Tejus Siddagangaiah, Bryan Lozano, Ehsan Ghasemi, Rajeev Patwari, Elliott Delaye, Jorn Tuyls, Aaron Ng, Sanket Pandit, Pramod Peethambaran, Satyaprakash Pareek
Scalable acceleration of reentrant compute operations

Patent number: 12147379

Abstract: Examples herein describe techniques for performing parallel processing using a plurality of processing elements (PEs) and a controller for data that has data dependencies. For example, a calculation may require an entire row or column to be summed, or to determine its mean. The PEs can be assigned different chunks of a data set (e.g., a tensor set, a column, or a row) for processing. The PEs can use one or more tokens to inform the controller when they are done with partial processing of their data chunks. The controller can then gather the partial results and determine an intermediate value for the data set. The controller can then distribute this intermediate value to the PEs which then re-process their respective data chunks using the intermediate value to generate final results.

Type: Grant

Filed: December 28, 2022

Date of Patent: November 19, 2024

Assignee: XILINX, INC.

Inventors: Rajeev Patwari, Jorn Tuyls, Elliott Delaye, Xiao Teng, Ephrem Wu
Software defined neural network layer pipelining

Patent number: 12086572

Abstract: Embodiments herein describe techniques for expressing the layers of a neural network in a software model. In one embodiment, the software model includes a class that describes the various functional blocks (e.g., convolution units, max-pooling units, rectified linear units (ReLU), and scaling functions) used to execute the neural network layers. In turn, other classes in the software model can describe the operation of each of the functional blocks. In addition, the software model can include conditional logic for expressing how the data flows between the functional blocks since different layers in the neural network can process the data differently. A compiler can convert the high-level code in the software model (e.g., C++) into a hardware description language (e.g., register transfer level (RTL)) which is used to configure a hardware system to implement a neural network accelerator.

Type: Grant

Filed: October 17, 2017

Date of Patent: September 10, 2024

Assignee: XILINX, INC.

Inventors: Yongjun Wu, Jindrich Zejda, Elliott Delaye, Ashish Sirasao
Reconfigurable neural engine with extensible instruction set architecture

Patent number: 12079158

Abstract: An integrated circuit includes a plurality of kernels and a virtual machine coupled to the plurality of kernels. The virtual machine is configured to interpret instructions directed to different ones of the plurality of kernels. The virtual machine is configured to control operation of the different ones of the plurality of kernels responsive to the instructions.

Type: Grant

Filed: July 25, 2022

Date of Patent: September 3, 2024

Assignee: Xilinx, Inc.

Inventors: Sanket Pandit, Jorn Tuyls, Xiao Teng, Rajeev Patwari, Ehsan Ghasemi, Elliott Delaye, Aaron Ng
Static block scheduling in massively parallel software defined hardware systems

Patent number: 12061990

Abstract: Embodiments herein describe techniques for static scheduling a neural network implemented in a massively parallel hardware system. The neural network may be scheduled using three different scheduling levels referred to herein as an upper level, an intermediate level, and a lower level. In one embodiment, the upper level includes a hardware or software model of the layers in the neural network that establishes a sequential order of functions that operate concurrently in the hardware system. In the intermediate level, identical processes in the functions defined in the upper level are connected to form a systolic array or mesh and balanced data flow channels are used to minimize latency. In the lower level, a compiler can assign the operations performed by the processing elements in the systolic array to different portions of the hardware system to provide a static schedule for the neural network.

Type: Grant

Filed: October 17, 2017

Date of Patent: August 13, 2024

Assignee: XILINX, INC.

Inventors: Yongjun Wu, Jindrich Zejda, Elliott Delaye, Ashish Sirasao
SCALABLE ACCELERATION OF REENTRANT COMPUTE OPERATIONS

Publication number: 20240220444

Abstract: Examples herein describe techniques for performing parallel processing using a plurality of processing elements (PEs) and a controller for data that has data dependencies. For example, a calculation may require an entire row or column to be summed, or to determine its mean. The PEs can be assigned different chunks of a data set (e.g., a tensor set, a column, or a row) for processing. The PEs can use one or more tokens to inform the controller when they are done with partial processing of their data chunks. The controller can then gather the partial results and determine an intermediate value for the data set. The controller can then distribute this intermediate value to the PEs which then re-process their respective data chunks using the intermediate value to generate final results.

Type: Application

Filed: December 28, 2022

Publication date: July 4, 2024

Inventors: Rajeev PATWARI, Jorn TUYLS, Elliott DELAYE, Xiao TENG, Ephrem WU
INSTRUCTION GENERATION AND PROGRAMMING MODEL FOR A DATA PROCESSING ARRAY AND MICROCONTROLLER

Publication number: 20240069511

Abstract: Instruction generation for a data processing array and microcontroller includes generating a tensor-level intermediate representation from a machine learning model using kernel expressions. Statements of the tensor-level intermediate representation are partitioned into a first set of statements and a second set of statements. From the first set of statements, kernel instructions are generated based on a reconfigurable neural engine model. The kernel instructions are executable by a compute tile of a data processing array to implement compute functions of the machine learning model. From the set of second statements, microcontroller instructions are generated based on a super-graph model. The microcontroller instructions are executable by a microcontroller of the data processing array to move data into and out from the data processing array.

Type: Application

Filed: August 31, 2022

Publication date: February 29, 2024

Applicant: Xilinx, Inc.

Inventors: Jorn Tuyls, Xiao Teng, Sanket Pandit, Rajeev Patwari, Qian Zhou, Ehsan Ghasemi, Ephrem C. Wu, Elliott Delaye, Aaron Ng
INSTRUCTION SET ARCHITECTURE FOR DATA PROCESSING ARRAY CONTROL

Publication number: 20240045692

Abstract: Controlling a data processing (DP) array includes creating a replica of a register address space of the DP array based on the design and the DP array. A sequence of instructions, including write instructions and read instructions, is received. The write instructions correspond to buffer descriptors specifying runtime data movements for a design for a DP array. The write instructions are converted into transaction instructions and the read instructions are converted into wait instructions based on the replica of the register address space. The transaction instructions and the wait instructions are included in an instruction buffer. The instruction buffer is provided to a microcontroller configured to execute the transaction instructions and the wait instructions to implement the runtime data movements for the design as implemented in the DP array. In another aspect, the instruction buffer is stored in a file for subsequent execution by the microcontroller.

Type: Application

Filed: August 8, 2022

Publication date: February 8, 2024

Applicant: Xilinx, Inc.

Inventors: Xiao Teng, Tejus Siddagangaiah, Bryan Lozano, Ehsan Ghasemi, Rajeev Patwari, Elliott Delaye, Jorn Tuyls, Aaron Ng, Sanket Pandit, Pramod Peethambaran, Satyaprakash Pareek
RECONFIGURABLE NEURAL ENGINE WITH EXTENSIBLE INSTRUCTION SET ARCHITECTURE

Publication number: 20240028556

Abstract: An integrated circuit includes a plurality of kernels and a virtual machine coupled to the plurality of kernels. The virtual machine is configured to interpret instructions directed to different ones of the plurality of kernels. The virtual machine is configured to control operation of the different ones of the plurality of kernels responsive to the instructions.

Type: Application

Filed: July 25, 2022

Publication date: January 25, 2024

Applicant: Xilinx, Inc.

Inventors: Sanket Pandit, Jorn Tuyls, Xiao Teng, Rajeev Patwari, Ehsan Ghasemi, Elliott Delaye, Aaron Ng
HARDWARE ACCELERATION OF MACHINE LEARNING DESIGNS

Publication number: 20230401480

Abstract: Hardware acceleration of machine learning (ML) designs includes translating an ML primitive into an intermediate representation. The intermediate representation is subdivided to specify a functional compute block. The functional compute block is sized according to a compute node primitive adapted for implementing the ML primitive on target hardware. An overlay is generated for the ML primitive, at least in part, by mapping the functional compute block to the compute node primitive. The overlay is synthesizable to implement the ML primitive on the target hardware. The overlay can be scheduled for operation within the target hardware as part of an ML design including the ML primitive.

Type: Application

Filed: June 14, 2022

Publication date: December 14, 2023

Applicant: Xilinx, Inc.

Inventors: Ehsan Ghasemi, Rajeev Patwari, Elliott Delaye, Jorn Tuyls, Ephrem C. Wu, Xiao Teng, Sanket Pandit
Machine learning runtime library for neural network acceleration

Patent number: 11694066

Abstract: Embodiments herein describe techniques for interfacing a neural network application with a neural network accelerator using a library. The neural network application may execute on a host computing system while the neural network accelerator executes on a massively parallel hardware system, e.g., a FPGA. The library operates a pipeline for submitting the tasks received from the neural network application to the neural network accelerator. In one embodiment, the pipeline includes a pre-processing stage, an FPGA execution stage, and a post-processing stage which each correspond to different threads. When receiving a task from the neural network application, the library generates a packet that includes the information required for the different stages in the pipeline to perform the tasks. Because the stages correspond to different threads, the library can process multiple packets in parallel which can increase the utilization of the neural network accelerator on the hardware system.

Type: Grant

Filed: October 17, 2017

Date of Patent: July 4, 2023

Assignee: XILINX, INC.

Inventors: Aaron Ng, Jindrich Zejda, Elliott Delaye, Xiao Teng, Sonal Santan, Soren T. Soe, Ashish Sirasao, Ehsan Ghasemi, Sean Settle
Multi-layer neural network processing by a neural network accelerator using host communicated merged weights and a package of per-layer instructions

Patent number: 11620490

Abstract: In the disclosed methods and systems for processing in a neural network system, a host computer system writes a plurality of weight matrices associated with a plurality of layers of a neural network to a memory shared with a neural network accelerator. The host computer system further assembles a plurality of per-layer instructions into an instruction package. Each per-layer instruction specifies processing of a respective layer of the plurality of layers of the neural network, and respective offsets of weight matrices in a shared memory. The host computer system writes input data and the instruction package to the shared memory. The neural network accelerator reads the instruction package from the shared memory and processes the plurality of per-layer instructions of the instruction package.

Type: Grant

Filed: October 17, 2017

Date of Patent: April 4, 2023

Assignee: XILINX, INC.

Inventors: Aaron Ng, Elliott Delaye, Ehsan Ghasemi, Xiao Teng, Jindrich Zejda, Yongjun Wu, Sean Settle, Ashish Sirasao
Neural network processing system having host controlled kernel acclerators

Patent number: 11568218

Abstract: A disclosed neural network processing system includes a host computer system, a RAMs coupled to the host computer system, and neural network accelerators coupled to the RAMs, respectively. The host computer system is configured with software that when executed causes the host computer system to write input data and work requests to the RAMS. Each work request specifies a subset of neural network operations to perform and memory locations in a RAM of the input data and parameters. A graph of dependencies among neural network operations is built and additional dependencies added. The operations are partitioned into coarse grain tasks and fine grain subtasks for optimal scheduling for parallel execution. The subtasks are scheduled to accelerator kernels of matching capabilities. Each neural network accelerator is configured to read a work request from the respective RAM and perform the subset of neural network operations on the input data using the parameters.

Type: Grant

Filed: October 17, 2017

Date of Patent: January 31, 2023

Assignee: XILINX, INC.

Inventors: Aaron Ng, Jindrich Zejda, Elliott Delaye, Xiao Teng, Ashish Sirasao
Host-directed multi-layer neural network processing via per-layer work requests

Patent number: 11429848

Abstract: In disclosed approaches of neural network processing, a host computer system copies an input data matrix from host memory to a shared memory for performing neural network operations of a first layer of a neural network by a neural network accelerator. The host instructs the neural network accelerator to perform neural network operations of each layer of the neural network beginning with the input data matrix. The neural network accelerator performs neural network operations of each layer in response to the instruction from the host. The host waits until the neural network accelerator signals completion of performing neural network operations of layer i before instructing the neural network accelerator to commence performing neural network operations of layer i+1, for i?1. The host instructs the neural network accelerator to use a results data matrix in the shared memory from layer i as an input data matrix for layer i+1 for i?1.

Type: Grant

Filed: October 17, 2017

Date of Patent: August 30, 2022

Assignee: XILINX, INC.

Inventors: Aaron Ng, Elliott Delaye, Jindrich Zejda, Ashish Sirasao
Image preprocessing for generalized image processing

Patent number: 11386644

Abstract: An example preprocessor circuit includes: a first buffer configured to store rows of image data and output a row thereof; a second buffer, coupled to the first buffer, including storage locations to store respective image samples of the row output by the first buffer; shift registers; an interconnect network including connections, each connection coupling a respective one of the shift registers to more than one of the storage locations, one or more of the storage locations being coupled to more than one of the connections; and a control circuit configured to load the shift registers with the image samples based on the connections and shift the shift registers to output streams of image samples.

Type: Grant

Filed: October 17, 2017

Date of Patent: July 12, 2022

Assignee: XILINX, INC.

Inventors: Elliott Delaye, Ashish Sirasao, Aaron Ng, Yongjun Wu, Jindrich Zejda
Neural network processing system having multiple processors and a neural network accelerator

Patent number: 11222256

Abstract: At least one neural network accelerator performs operations of a first subset of layers of a neural network on an input data set, generates an intermediate data set, and stores the intermediate data set in a shared memory queue in a shared memory. A first processor element of a host computer system provides input data to the neural network accelerator and signals the neural network accelerator to perform the operations of the first subset of layers of the neural network on the input data set. A second processor element of the host computer system reads the intermediate data set from the shared memory queue, performs operations of a second subset of layers of the neural network on the intermediate data set, and generates an output data set while the neural network accelerator is performing the operations of the first subset of layers of the neural network on another input data set.

Type: Grant

Filed: October 17, 2017

Date of Patent: January 11, 2022

Assignee: XILINX, INC.

Inventors: Xiao Teng, Aaron Ng, Ashish Sirasao, Elliott Delaye

1 2 3 next