Patents by Inventor Ashish Sirasao

Ashish Sirasao has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Local retiming optimization for circuit designs

Patent number: 10678983

Abstract: Local retiming for a circuit design includes determining, using computer hardware, a load of a synchronous circuit element within the circuit design tagged for forward retiming, traversing, using the computer hardware, each input of the load backward through the circuit design until a sequential circuit element or a primary input is reached, and adding, using the computer hardware, each synchronous circuit element encountered in the traversing to a forward retiming list. In response to determining that forward retiming criteria is met for the forward retiming list, the computer hardware modifies the circuit design by creating a new synchronous circuit element at an output of the load.

Type: Grant

Filed: May 23, 2018

Date of Patent: June 9, 2020

Assignee: Xilinx, Inc.

Inventors: Shangzhi Sun, Chaithanya Dudha, Bing Tian, Ashish Sirasao
Software-driven design optimization for mapping between floating-point and fixed-point multiply accumulators

Patent number: 10678509

Abstract: An example multiply accumulate (MACC) circuit includes a multiply-accumulator having an accumulator output register, a scaler, coupled to the multiply accumulator, and a control circuit coupled to the multiply-accumulator and the scaler. The control circuit is configured to provide control data to the scaler, the control data indicative of: a most-significant bit (MSB) to least significant bit (LSB) range for selecting bit indices from the accumulator output register for implementing a first right shift; a multiplier; and a second right shift.

Type: Grant

Filed: August 21, 2018

Date of Patent: June 9, 2020

Assignee: XILINX, INC.

Inventors: Sean Settle, Elliott Delaye, Aaron Ng, Ehsan Ghasemi, Ashish Sirasao, Xiao Teng, Jindrich Zejda
MULTIPLY AND ACCUMULATE CIRCUIT

Publication number: 20200089472

Abstract: Circuits and method for multiplying floating point operands. An exponent adder circuit sums a first exponent and a second exponent and generates an output exponent. A mantissa multiplier circuit multiplies a first mantissa and a second mantissa and generates an output mantissa. A first conversion circuit converts the output exponent and output mantissa into a fixed point number. An accumulator circuit sums contents of an accumulation register and the fixed point number into an accumulated value and stores the accumulated value in the accumulation register.

Type: Application

Filed: September 19, 2018

Publication date: March 19, 2020

Applicant: Xilinx, Inc.

Inventors: Satyaprakash Pareek, Anup Hosangadi, Bing Tian, Ashish Sirasao, Yao Fu, Oscar Fernando C. Fernandez, Michael Wu, Christopher H. Dick
Sparse matrix processing circuitry

Patent number: 10572409

Abstract: A memory arrangement can store a matrix of matrix data elements specified as index-value pairs that indicate row and column indices and associated values. First split-and-merge circuitry is coupled between the memory arrangement and a first set of FIFO buffers for reading the matrix data elements from the memory arrangement and putting the matrix data elements in the first set of FIFO buffers based on column indices. A pairing circuit is configured to read vector data elements, pair the vector data elements with the matrix data elements, and put the paired matrix and vector data elements in a second set of FIFO buffers based on column indices. Second split-and-merge circuitry is configured to read paired matrix and vector data elements from the second set of FIFO buffers and put the paired matrix and vector data elements in a third set of FIFO buffers based on row indices.

Type: Grant

Filed: May 10, 2018

Date of Patent: February 25, 2020

Assignee: XILINX, INC.

Inventors: Jindrich Zejda, Ling Liu, Yifei Zhou, Ashish Sirasao
Circuit arrangements and methods for performing multiply-and-accumulate operations

Patent number: 10572225

Abstract: A and a request generator circuit is configured to read data elements of a three-dimensional (3-D) input feature map (IFM) from a memory and store a subset of the data elements in one of a plurality of N line buffers. Each line buffer is configured for storage of M data elements. A pixel iterator circuit is coupled to the line buffers and is configured to generate a sequence of addresses for reading the stored data elements from the line buffers based on a sequence of IFM height values and a sequence of IFM width values.

Type: Grant

Filed: September 26, 2018

Date of Patent: February 25, 2020

Assignee: XILINX, INC.

Inventors: Ehsan Ghasemi, Elliott Delaye, Ashish Sirasao, Sean Settle
Data format suitable for fast massively parallel general matrix multiplication in a programmable IC

Patent number: 10515135

Abstract: Methods and apparatus are described for performing data-intensive compute algorithms, such as fast massively parallel general matrix multiplication (GEMM), using a particular data format for both storing data to and reading data from memory. This data format may be utilized for arbitrarily-sized input matrices for GEMM implemented on a finite-size GEMM accelerator in the form of a rectangular compute array of digital signal processing (DSP) elements or similar compute cores. This data format solves the issue of double data rate (DDR) dynamic random access memory (DRAM) bandwidth by allowing both linear DDR addressing and single cycle loading of data into the compute array, avoiding input/output (I/O) and/or DDR bottlenecks.

Type: Grant

Filed: October 17, 2017

Date of Patent: December 24, 2019

Assignee: XILINX, INC.

Inventors: Jindrich Zejda, Elliott Delaye, Aaron Ng, Ashish Sirasao, Yongjun Wu
Inline image preprocessing for convolution operations using a matrix multiplier on an integrated circuit

Patent number: 10460416

Abstract: An example preprocessor circuit for formatting image data into a plurality of streams of image samples includes: a plurality of memory banks configured to store the image data; multiplexer circuitry coupled to the memory banks; a first plurality of registers coupled to the multiplexer circuitry; a second plurality of registers coupled to the first plurality of registers, outputs of the second plurality of registers configured to provide the plurality of streams of image samples; and control circuitry configured to generate addresses for the plurality of memory banks, control the multiplexer circuitry to select among outputs of the plurality of memory banks, control the first plurality of registers to store outputs of the second plurality of multiplexers, and control the second plurality of registers to store outputs of the first plurality of registers.

Type: Grant

Filed: October 17, 2017

Date of Patent: October 29, 2019

Assignee: XILINX, INC.

Inventors: Ashish Sirasao, Elliott Delaye, Aaron Ng, Ehsan Ghasemi
Method and apparatus for enhancing performance by moving or adding a pipelined register stage in a cascaded chain

Patent number: 10430539

Abstract: Methods and apparatus relating generally to synthesis are described. In such a method, a directed graph for a circuit design is generated. A cascaded chain is identified in the directed graph with a timing violation. A pipeline register stage of the cascaded chain is moved (or added) to remove the timing violation. The circuit design is transformed to provide a netlist including the pipeline register stage.

Type: Grant

Filed: December 16, 2016

Date of Patent: October 1, 2019

Assignee: XILINX, INC.

Inventors: Chaithanya Dudha, Zhao Ma, Krishna Garlapati, Ashish Sirasao
Circuit arrangements and methods for dividing a three-dimensional input feature map

Patent number: 10411709

Abstract: Disclosed circuits and methods include N line buffers. Each line buffer is configured for storage of M data elements of a three-dimensional (3-D) input feature map (IFM). A request generator circuit is coupled to the N line buffers and to a memory configured for storage of the 3-D IFM. The request generator circuit is divides the 3-D IFM into a plurality of IFM sub-volumes based on values of N, M, and dimensions of the 3-D IFM. The request generator circuit reads from the memory, data elements at addresses of an unprocessed one of the IFM sub-volumes and stores the data elements of the unprocessed one of the IFM sub-volumes in the N line buffers. In response to a completion signal, the request generator circuit repeats the reading of an unprocessed one of the IFM sub-volumes and storing the data elements in the N line buffers.

Type: Grant

Filed: July 25, 2018

Date of Patent: September 10, 2019

Assignee: XILINX, INC.

Inventors: Ehsan Ghasemi, Elliott Delaye, Ashish Sirasao
Software-defined memory bandwidth reduction by hierarchical stream buffering for general matrix multiplication in a programmable IC

Patent number: 10354733

Abstract: Methods and apparatus are described for partitioning and reordering block-based matrix multiplications for high-speed data streaming in general matrix multiplication (GEMM), which may be implemented by a programmable integrated circuit (IC). By preloading and hierarchically caching the blocks, examples of the present disclosure reduce the double data rate (DDR) memory intake bandwidth for software-defined GEMM accelerators.

Type: Grant

Filed: October 17, 2017

Date of Patent: July 16, 2019

Assignee: XILINX, INC.

Inventors: Jindrich Zejda, Elliott Delaye, Ashish Sirasao, Yongjun Wu, Aaron Ng
Loop optimization for implementing circuit designs in hardware

Patent number: 10331836

Abstract: Implementing a circuit design can include determining a chain of a plurality of loop elements of a circuit design, wherein each loop element includes a bit select node configured to perform a bit assignment operation and a corresponding address calculation node, wherein the address calculation nodes use a common variable to calculate a starting bit location provided to the corresponding bit select node. In response to the determining, the chain is replicated resulting in one chain for each value of the common variable and transforming each chain into a plurality of wires. A multiplexer is inserted into the circuit design. The plurality of wires for each chain is coupled to inputs of the multiplexer and the common variable is provided to the multiplexer as a select signal.

Type: Grant

Filed: October 11, 2017

Date of Patent: June 25, 2019

Assignee: XILINX, INC.

Inventors: Anup Hosangadi, Sumanta Datta, Aman Gayasen, Ashish Sirasao
Parallelizing timing-based operations for circuit designs

Patent number: 10303833

Abstract: Parallelizing operations for implementing a circuit design can include dividing, using a processor, the circuit design into a plurality of partitions, wherein each partition is stored as a separate file, for each partition, generating, using the processor, a timing arc file specifying boundary delays for the partition, and generating, using the processor, a partition design file specifying interfaces of the partitions. Using the processor, a plurality of processes executing in parallel can be initiated. Each process is adapted to operate on a selected partition using the partition design file and the timing arc files for the other partitions to generate an updated file for the selected partition.

Type: Grant

Filed: February 9, 2017

Date of Patent: May 28, 2019

Assignee: XILINX, INC.

Inventors: Aman Gayasen, Surya Pratik Saha, Elliott Delaye, Shangzhi Sun, Ashish Sirasao
Circuit design transformation for automatic latency reduction

Patent number: 10289786

Abstract: Reducing latency of a circuit design can include determining, using a processor, a set of sequential circuit elements of a circuit design that meets a condition for removal from the circuit design, wherein the condition is dependent upon a target technology process and a target operating frequency. Using the processor, a feasible cut for a selected sequential circuit element of the set is determined. The selected sequential circuit element and each other sequential circuit element of the set that is part of the cut is removed from the circuit design using the processor.

Type: Grant

Filed: June 27, 2017

Date of Patent: May 14, 2019

Assignee: XILINX, INC.

Inventors: Chaithanya Dudha, Shangzhi Sun, Ashish Sirasao, Nithin Kumar Guggilla
HOST-DIRECTED MULTI-LAYER NEURAL NETWORK PROCESSING VIA PER-LAYER WORK REQUESTS

Publication number: 20190114538

Abstract: In disclosed approaches of neural network processing, a host computer system copies an input data matrix from host memory to a shared memory for performing neural network operations of a first layer of a neural network by a neural network accelerator. The host instructs the neural network accelerator to perform neural network operations of each layer of the neural network beginning with the input data matrix. The neural network accelerator performs neural network operations of each layer in response to the instruction from the host. The host waits until the neural network accelerator signals completion of performing neural network operations of layer i before instructing the neural network accelerator to commence performing neural network operations of layer i+1, for i?1. The host instructs the neural network accelerator to use a results data matrix in the shared memory from layer i as an input data matrix for layer i+1 for i?1.

Type: Application

Filed: October 17, 2017

Publication date: April 18, 2019

Applicant: Xilinx, Inc.

Inventors: Aaron Ng, Elliott Delaye, Jindrich Zejda, Ashish Sirasao
STATIC BLOCK SCHEDULING IN MASSIVELY PARALLEL SOFTWARE DEFINED HARDWARE SYSTEMS

Publication number: 20190114548

Abstract: Embodiments herein describe techniques for static scheduling a neural network implemented in a massively parallel hardware system. The neural network may be scheduled using three different scheduling levels referred to herein as an upper level, an intermediate level, and a lower level. In one embodiment, the upper level includes a hardware or software model of the layers in the neural network that establishes a sequential order of functions that operate concurrently in the hardware system. In the intermediate level, identical processes in the functions defined in the upper level are connected to form a systolic array or mesh and balanced data flow channels are used to minimize latency. In the lower level, a compiler can assign the operations performed by the processing elements in the systolic array to different portions of the hardware system to provide a static schedule for the neural network.

Type: Application

Filed: October 17, 2017

Publication date: April 18, 2019

Applicant: Xilinx, Inc.

Inventors: Yongjun Wu, Jindrich Zejda, Elliott Delaye, Ashish Sirasao
IMAGE PREPROCESSING FOR GENERALIZED IMAGE PROCESSING

Publication number: 20190114499

Abstract: An example preprocessor circuit for formatting image data into a plurality of streams of image samples includes: a first buffer configured to store a plurality of rows of the image data and output a row of the plurality of rows; a second buffer, coupled to the first buffer, including a plurality of storage locations to store a respective plurality of image samples of the row output by the first buffer; a plurality of shift registers; an interconnect network including a plurality of connections, each connection coupling a respective one of the plurality of shift registers to more than one of the plurality of storage locations, one or more of the plurality of storage locations being coupled to more than one of the plurality of connections; and a control circuit configured to load the plurality of shift registers with the plurality of image samples based on the plurality of connections and shift the plurality of shift registers to output the plurality of streams of image samples.

Type: Application

Filed: October 17, 2017

Publication date: April 18, 2019

Applicant: Xilinx, Inc.

Inventors: Elliott Delaye, Ashish Sirasao, Aaron Ng, Yongjun Wu, Jindrich Zejda
NEURAL NETWORK PROCESSING SYSTEM HAVING MULTIPLE PROCESSORS AND A NEURAL NETWORK ACCELERATOR

Publication number: 20190114534

Abstract: At least one neural network accelerator performs operations of a first subset of layers of a neural network on an input data set, generates an intermediate data set, and stores the intermediate data set in a shared memory queue in a shared memory. A first processor element of a host computer system provides input data to the neural network accelerator and signals the neural network accelerator to perform the operations of the first subset of layers of the neural network on the input data set. A second processor element of the host computer system reads the intermediate data set from the shared memory queue, performs operations of a second subset of layers of the neural network on the intermediate data set, and generates an output data set while the neural network accelerator is performing the operations of the first subset of layers of the neural network on another input data set.

Type: Application

Filed: October 17, 2017

Publication date: April 18, 2019

Applicant: Xilinx, Inc.

Inventors: Xiao Teng, Aaron Ng, Ashish Sirasao, Elliott Delaye
MACHINE LEARNING RUNTIME LIBRARY FOR NEURAL NETWORK ACCELERATION

Publication number: 20190114533

Abstract: Embodiments herein describe techniques for interfacing a neural network application with a neural network accelerator using a library. The neural network application may execute on a host computing system while the neural network accelerator executes on a massively parallel hardware system, e.g., a FPGA. The library operates a pipeline for submitting the tasks received from the neural network application to the neural network accelerator. In one embodiment, the pipeline includes a pre-processing stage, an FPGA execution stage, and a post-processing stage which each correspond to different threads. When receiving a task from the neural network application, the library generates a packet that includes the information required for the different stages in the pipeline to perform the tasks. Because the stages correspond to different threads, the library can process multiple packets in parallel which can increase the utilization of the neural network accelerator on the hardware system.

Type: Application

Filed: October 17, 2017

Publication date: April 18, 2019

Applicant: Xilinx, Inc.

Inventors: Aaron Ng, Jindrich Zejda, Elliott Delaye, Xiao Teng, Sonal Santan, Soren T. Soe, Ashish Sirasao, Ehsan Ghasemi, Sean Settle
NEURAL NETWORK PROCESSING SYSTEM HAVING HOST CONTROLLED KERNEL ACCLERATORS

Publication number: 20190114535

Abstract: A disclosed neural network processing system includes a host computer system, a RAMs coupled to the host computer system, and neural network accelerators coupled to the RAMs, respectively. The host computer system is configured with software that when executed causes the host computer system to write input data and work requests to the RAMS. Each work request specifies a subset of neural network operations to perform and memory locations in a RAM of the input data and parameters. A graph of dependencies among neural network operations is built and additional dependencies added. The operations are partitioned into coarse grain tasks and fine grain subtasks for optimal scheduling for parallel execution. The subtasks are scheduled to accelerator kernels of matching capabilities. Each neural network accelerator is configured to read a work request from the respective RAM and perform the subset of neural network operations on the input data using the parameters.

Type: Application

Filed: October 17, 2017

Publication date: April 18, 2019

Applicant: Xilinx, Inc.

Inventors: Aaron Ng, Jindrich Zejda, Elliott Delaye, Xiao Teng, Ashish Sirasao
MULTI-LAYER NEURAL NETWORK PROCESSING BY A NEURAL NETWORK ACCELERATOR USING HOST COMMUNICATED MERGED WEIGHTS AND A PACKAGE OF PER-LAYER INSTRUCTIONS

Publication number: 20190114529

Abstract: In the disclosed methods and systems for processing in a neural network system, a host computer system writes a plurality of weight matrices associated with a plurality of layers of a neural network to a memory shared with a neural network accelerator. The host computer system further assembles a plurality of per-layer instructions into an instruction package. Each per-layer instruction specifies processing of a respective layer of the plurality of layers of the neural network, and respective offsets of weight matrices in a shared memory. The host computer system writes input data and the instruction package to the shared memory. The neural network accelerator reads the instruction package from the shared memory and processes the plurality of per-layer instructions of the instruction package.

Type: Application

Filed: October 17, 2017

Publication date: April 18, 2019

Applicant: Xilinx, Inc.

Inventors: Aaron Ng, Elliott Delaye, Ehsan Ghasemi, Xiao Teng, Jindrich Zejda, Yongjun Wu, Sean Settle, Ashish Sirasao

prev 1 2 3 next