Patents by Inventor Ashish Sirasao

Ashish Sirasao has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 11694066
    Abstract: Embodiments herein describe techniques for interfacing a neural network application with a neural network accelerator using a library. The neural network application may execute on a host computing system while the neural network accelerator executes on a massively parallel hardware system, e.g., a FPGA. The library operates a pipeline for submitting the tasks received from the neural network application to the neural network accelerator. In one embodiment, the pipeline includes a pre-processing stage, an FPGA execution stage, and a post-processing stage which each correspond to different threads. When receiving a task from the neural network application, the library generates a packet that includes the information required for the different stages in the pipeline to perform the tasks. Because the stages correspond to different threads, the library can process multiple packets in parallel which can increase the utilization of the neural network accelerator on the hardware system.
    Type: Grant
    Filed: October 17, 2017
    Date of Patent: July 4, 2023
    Assignee: XILINX, INC.
    Inventors: Aaron Ng, Jindrich Zejda, Elliott Delaye, Xiao Teng, Sonal Santan, Soren T. Soe, Ashish Sirasao, Ehsan Ghasemi, Sean Settle
  • Publication number: 20230153583
    Abstract: Processing of a neural network specification includes gathering first layers of a neural network graph into groups of layers based on profiled compute times of the layers and equalized compute times between the groups. Each group is a subgraph of one or more of the layers of the neural network. The neural network graph is compiled into instructions for pipelined execution of the neural network graph by compute circuits. The compiling includes designating, for each first subgraph of the subgraphs having output activations that are input activations of a second subgraph of the subgraphs, operations of the first subgraph to be performed by a first compute circuit and operations of the second subgraph to be performed by a second compute circuit. The compute circuits are configured to execute the instructions.
    Type: Application
    Filed: November 15, 2021
    Publication date: May 18, 2023
    Applicant: Xilinx, Inc.
    Inventors: Ashish Sirasao, Vishal Kumar Jain, Sumit Nagpal
  • Patent number: 11620490
    Abstract: In the disclosed methods and systems for processing in a neural network system, a host computer system writes a plurality of weight matrices associated with a plurality of layers of a neural network to a memory shared with a neural network accelerator. The host computer system further assembles a plurality of per-layer instructions into an instruction package. Each per-layer instruction specifies processing of a respective layer of the plurality of layers of the neural network, and respective offsets of weight matrices in a shared memory. The host computer system writes input data and the instruction package to the shared memory. The neural network accelerator reads the instruction package from the shared memory and processes the plurality of per-layer instructions of the instruction package.
    Type: Grant
    Filed: October 17, 2017
    Date of Patent: April 4, 2023
    Assignee: XILINX, INC.
    Inventors: Aaron Ng, Elliott Delaye, Ehsan Ghasemi, Xiao Teng, Jindrich Zejda, Yongjun Wu, Sean Settle, Ashish Sirasao
  • Patent number: 11568218
    Abstract: A disclosed neural network processing system includes a host computer system, a RAMs coupled to the host computer system, and neural network accelerators coupled to the RAMs, respectively. The host computer system is configured with software that when executed causes the host computer system to write input data and work requests to the RAMS. Each work request specifies a subset of neural network operations to perform and memory locations in a RAM of the input data and parameters. A graph of dependencies among neural network operations is built and additional dependencies added. The operations are partitioned into coarse grain tasks and fine grain subtasks for optimal scheduling for parallel execution. The subtasks are scheduled to accelerator kernels of matching capabilities. Each neural network accelerator is configured to read a work request from the respective RAM and perform the subset of neural network operations on the input data using the parameters.
    Type: Grant
    Filed: October 17, 2017
    Date of Patent: January 31, 2023
    Assignee: XILINX, INC.
    Inventors: Aaron Ng, Jindrich Zejda, Elliott Delaye, Xiao Teng, Ashish Sirasao
  • Patent number: 11449347
    Abstract: Time-multiplexing implementation of hardware accelerated functions includes associating each function of a plurality of functions from program code with an accelerator binary image specifying a hardware accelerated version of the associated function and determining which accelerator binary images are data independent. Using the computer hardware, the accelerator binary images can be scheduled for implementation in a programmable integrated circuit within each of a plurality of partial reconfiguration regions based on data independence.
    Type: Grant
    Filed: May 23, 2019
    Date of Patent: September 20, 2022
    Inventors: Raymond Kong, Brian S. Martin, Hao Yu, Jun Liu, Ashish Sirasao
  • Patent number: 11429848
    Abstract: In disclosed approaches of neural network processing, a host computer system copies an input data matrix from host memory to a shared memory for performing neural network operations of a first layer of a neural network by a neural network accelerator. The host instructs the neural network accelerator to perform neural network operations of each layer of the neural network beginning with the input data matrix. The neural network accelerator performs neural network operations of each layer in response to the instruction from the host. The host waits until the neural network accelerator signals completion of performing neural network operations of layer i before instructing the neural network accelerator to commence performing neural network operations of layer i+1, for i?1. The host instructs the neural network accelerator to use a results data matrix in the shared memory from layer i as an input data matrix for layer i+1 for i?1.
    Type: Grant
    Filed: October 17, 2017
    Date of Patent: August 30, 2022
    Assignee: XILINX, INC.
    Inventors: Aaron Ng, Elliott Delaye, Jindrich Zejda, Ashish Sirasao
  • Patent number: 11386644
    Abstract: An example preprocessor circuit includes: a first buffer configured to store rows of image data and output a row thereof; a second buffer, coupled to the first buffer, including storage locations to store respective image samples of the row output by the first buffer; shift registers; an interconnect network including connections, each connection coupling a respective one of the shift registers to more than one of the storage locations, one or more of the storage locations being coupled to more than one of the connections; and a control circuit configured to load the shift registers with the image samples based on the connections and shift the shift registers to output streams of image samples.
    Type: Grant
    Filed: October 17, 2017
    Date of Patent: July 12, 2022
    Assignee: XILINX, INC.
    Inventors: Elliott Delaye, Ashish Sirasao, Aaron Ng, Yongjun Wu, Jindrich Zejda
  • Patent number: 11222256
    Abstract: At least one neural network accelerator performs operations of a first subset of layers of a neural network on an input data set, generates an intermediate data set, and stores the intermediate data set in a shared memory queue in a shared memory. A first processor element of a host computer system provides input data to the neural network accelerator and signals the neural network accelerator to perform the operations of the first subset of layers of the neural network on the input data set. A second processor element of the host computer system reads the intermediate data set from the shared memory queue, performs operations of a second subset of layers of the neural network on the intermediate data set, and generates an output data set while the neural network accelerator is performing the operations of the first subset of layers of the neural network on another input data set.
    Type: Grant
    Filed: October 17, 2017
    Date of Patent: January 11, 2022
    Assignee: XILINX, INC.
    Inventors: Xiao Teng, Aaron Ng, Ashish Sirasao, Elliott Delaye
  • Patent number: 11204747
    Abstract: Embodiments herein describe techniques for interfacing a neural network application with a neural network accelerator that operate on two heterogeneous computing systems. For example, the neural network application may execute on a central processing unit (CPU) in a computing system while the neural network accelerator executes on a FPGA. As a result, when moving a software-hardware boundary between the two heterogeneous systems, changes may be made to both the neural network application (using software code) and to the accelerator (using RTL). The embodiments herein describe a software defined approach where shared interface code is used to express both sides of the interface between the two heterogeneous systems in a single abstraction (e.g., a software class).
    Type: Grant
    Filed: October 17, 2017
    Date of Patent: December 21, 2021
    Assignee: XILINX, INC.
    Inventors: Jindrich Zejda, Elliott Delaye, Yongjun Wu, Aaron Ng, Ashish Sirasao, Khang K. Dao, Christopher J. Case
  • Patent number: 11188697
    Abstract: Determining on-chip memory access patterns can include modifying a circuit design to include a profiler circuit for a random-access memory (RAM) of the circuit design, wherein the profiler circuit is configured to monitor an address bus of the RAM, and modifying the circuit design to include a debug circuit connected to the profiler circuit. Usage data for the RAM can be generated by detecting, using the profiler circuit, addresses of the RAM accessed during a test of the circuit design, as implemented in an integrated circuit. The usage data for the RAM can be output using the debug circuit.
    Type: Grant
    Filed: January 5, 2021
    Date of Patent: November 30, 2021
    Assignee: Xilinx, Inc.
    Inventors: Chaithanya Dudha, Rajeev Patwari, Nithin Kumar Guggilla, Ashish Sirasao, Krishna Garlapati
  • Patent number: 11106968
    Abstract: A circuit arrangement includes a buffer, a height traversal circuit configured to generate a sequence of IFM height values in response to first control signals, a width traversal circuit configured to generate a sequence of IFM width values in response to second control signals, a control circuit, and an address generation circuit. The control circuit is configured to input an OFM height, an OFM width, a kernel height, and a kernel width; generate the first control signals at times based on the OFM height and the kernel height; and generate the second control signals at times based on the OFM width and the kernel width. The address generation circuit is configured to generate a sequence of addresses based on the sequences of IFM height values and IFM width values, provide the sequence of addresses to the buffer, and enable reading from the buffer.
    Type: Grant
    Filed: May 24, 2018
    Date of Patent: August 31, 2021
    Assignee: XILINX, INC.
    Inventors: Ehsan Ghasemi, Elliott Delaye, Ashish Sirasao
  • Patent number: 11036827
    Abstract: Methods and apparatus are described for simultaneously buffering and reformatting (e.g., transposing) a matrix for high-speed data streaming in general matrix multiplication (GEMM), which may be implemented by a programmable integrated circuit (IC). Examples of the present disclosure increase the effective double data rate (DDR) memory throughput for streaming data into GEMM digital signal processing (DSP) engine multifold, as well as eliminate slow data reformatting on a host central processing unit (CPU). This may be accomplished through software-defined (e.g., C++) data structures and access patterns that result in hardware logic that simultaneously buffers and reorganizes the data to achieve linear DDR addressing.
    Type: Grant
    Filed: October 17, 2017
    Date of Patent: June 15, 2021
    Assignee: XILINX, INC.
    Inventors: Jindrich Zejda, Elliott Delaye, Yongjun Wu, Aaron Ng, Ashish Sirasao, Khang K. Dao
  • Patent number: 10990736
    Abstract: Implementing a circuit design can include detecting, using computer hardware, a re-convergent section of a circuit design, masking, using the computer hardware, a sequential circuit element of the re-convergent section located between a start and an end of the re-convergent section, and performing, using the computer hardware, an optimization operation on combinatorial logic of the re-convergent section to create optimized combinatorial logic. Using the computer hardware, the optimized combinatorial logic of the re-convergent section can be mapped. Further, the re-convergent section can be modified subsequent to the mapping to match timing of the re-convergent section prior to the masking.
    Type: Grant
    Filed: March 17, 2020
    Date of Patent: April 27, 2021
    Assignee: Xilinx, Inc.
    Inventors: Chaithanya Dudha, Satyaprakash Pareek, Krishna Garlapati, Ashish Sirasao
  • Patent number: 10990826
    Abstract: Detecting objects in video may include receiving object detections for a plurality of selected frames of a video from a still image detector, wherein the plurality of selected frames are non-adjacent frames of the video, propagating the object detections from the plurality of selected frames to sequential frames of the video adjacent to the plurality of selected frames based on a distance metric and vector flow data for the sequential frames, suppressing false positive object detections from the propagating, and outputting resulting object detections for the sequential frames of the video.
    Type: Grant
    Filed: March 20, 2019
    Date of Patent: April 27, 2021
    Assignee: Xilinx, Inc.
    Inventors: Mujib Haider, Venkata V. Dhanikonda, Ashish Sirasao
  • Patent number: 10984500
    Abstract: An example preprocessor circuit for formatting image data into a plurality of streams of image samples includes: a plurality of memory banks configured to store the image data; multiplexer circuitry coupled to the memory banks; a first plurality of registers coupled to the multiplexer circuitry; a second plurality of registers coupled to the first plurality of registers, outputs of the second plurality of registers configured to provide the plurality of streams of image samples; bank address and control circuitry coupled to control inputs of the plurality of memory banks, the multiplexer circuitry, and the first plurality of registers; output control circuitry coupled to control inputs of the second plurality of registers; and a control state machine coupled to the bank address and control circuitry and the output control circuitry.
    Type: Grant
    Filed: September 19, 2019
    Date of Patent: April 20, 2021
    Assignee: XILINX, INC.
    Inventors: Ashish Sirasao, Elliott Delaye, Aaron Ng, Ehsan Ghasemi
  • Patent number: 10943039
    Abstract: An example multiply accumulate (MACC) circuit includes: a multiply-accumulator having an accumulator output register; a quantizer, coupled to the multiply accumulator; and a control circuit coupled to the multiply-accumulator and the quantizer, the control circuit configured to provide control data to the quantizer, the control data indicative of a most-significant bit (MSB) to least significant bit (LSB) range for selecting bit indices from the accumulator output register.
    Type: Grant
    Filed: October 17, 2017
    Date of Patent: March 9, 2021
    Assignee: XILINX, INC.
    Inventors: Ashish Sirasao, Elliott Delaye, Sean Settle, Zhao Ma, Ehsan Ghasemi, Xiao Teng, Aaron Ng, Jindrich Zejda
  • Patent number: 10936311
    Abstract: Disclosed approaches for multiplying a sparse matrix by dense a vector or matrix include first memory banks for storage of column indices, second memory banks for storage of row indices, and third memory banks for storage of non-zero values of a sparse matrix. A pairing circuit distributes an input stream of vector elements across first first-in-first-out (FIFO) buffers according to the buffered column indices. Multiplication circuitry multiplies vector elements output from the first FIFO buffers by corresponding ones of the non-zero values from the third memory banks, and stores products in second FIFO buffers. Row-aligner circuitry organize the products output from the second FIFO buffers into third FIFO buffers according to row indices in the second memory banks. Accumulation circuitry accumulates respective totals from products output from the third FIFO buffers.
    Type: Grant
    Filed: July 9, 2019
    Date of Patent: March 2, 2021
    Assignee: Xilinx, Inc.
    Inventors: Ling Liu, Yifei Zhou, Xiao Teng, Ashish Sirasao, Chuanhua Song, Aaron Ng
  • Patent number: 10824434
    Abstract: Examples described herein relate to dynamically structured single instruction, multiple data (SIMD) instructions, and systems and circuits implementing such dynamically structured SIMD instructions. An example is a method for processing data. A first SIMD structure is determined by a processor. A characteristic of the first SIMD structure is altered by the processor to obtain a second SIMD structure. An indication of the second SIMD structure is communicated from the processor to a numerical engine. Data is packed by the numerical engine into an SIMD instruction according to the second SIMD structure. The SIMD instruction is transmitted from the numerical engine.
    Type: Grant
    Filed: November 29, 2018
    Date of Patent: November 3, 2020
    Assignee: XILINX, INC.
    Inventors: Sean Settle, Ehsan Ghasemi, Ashish Sirasao, Ralph D. Wittig
  • Patent number: 10747502
    Abstract: Circuits and method for multiplying floating point operands. An exponent adder circuit sums a first exponent and a second exponent and generates an output exponent. A mantissa multiplier circuit multiplies a first mantissa and a second mantissa and generates an output mantissa. A first conversion circuit converts the output exponent and output mantissa into a fixed point number. An accumulator circuit sums contents of an accumulation register and the fixed point number into an accumulated value and stores the accumulated value in the accumulation register.
    Type: Grant
    Filed: September 19, 2018
    Date of Patent: August 18, 2020
    Assignee: Xilinx, Inc.
    Inventors: Satyaprakash Pareek, Anup Hosangadi, Bing Tian, Ashish Sirasao, Yao Fu, Oscar Fernando C. Fernandez, Michael Wu, Christopher H. Dick
  • Patent number: 10726175
    Abstract: A memory optimization method includes identifying, within a circuit design, a memory having an arithmetic operator at an output side and/or an input side of the memory. The memory may include a read-only memory (ROM). In some examples, an input of the arithmetic operator includes a constant value. In some embodiments, the memory optimization method further includes absorbing a function of the arithmetic operator into the memory. By way of example, the absorbing the function includes modifying contents of the memory based on the function of the arithmetic operator to provide an updated memory and removing the arithmetic operator from the circuit design.
    Type: Grant
    Filed: March 4, 2019
    Date of Patent: July 28, 2020
    Assignee: Xilinx, Inc.
    Inventors: Chaithanya Dudha, Satyaprakash Pareek, Bing Tian, Ashish Sirasao