Patents by Inventor Jindrich Zejda

Jindrich Zejda has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

NEURAL NETWORK LAYER-BY-LAYER DEBUGGING

Publication number: 20200410354

Abstract: Techniques are disclosed for debugging a neural network execution on a target processor. A reference processor may generate a plurality of first reference tensors for the neural network. The neural network may be repeatedly reduced to produce a plurality of lengths. For each of the lengths, a compiler converts the neural network into first machine instructions, the target processor executes the first machine instructions to generate a first device tensor, and the debugger program determines whether the first device tensor matches a first reference tensor. A shortest length is identified for which the first device tensor does not match the first reference tensor. Tensor output is enabled for a lower-level intermediate representation of the shortest neural network, and the neural network is converted into second machine instructions, which are executed by the target processor to generate a second device tensor.

Type: Application

Filed: June 27, 2019

Publication date: December 31, 2020

Inventors: Jindrich Zejda, Jeffrey T. Huynh, Drazen Borkovic, Se jong Oh, Ron Diamant, Randy Renfu Huang
NEURAL NETWORK OPERATION REORDERING FOR PARALLEL EXECUTION

Publication number: 20200409717

Abstract: Techniques are disclosed for reordering operations of a neural network to improve runtime efficiency. In some examples, a compiler receives a description of the neural network comprising a plurality of operations. The compiler may determine which execution engine of a plurality of execution engines is to perform each of the plurality of operations. The compiler may determine an order of performance associated with the plurality of operations. The compiler may identify a runtime inefficiency based on the order of performance and a hardware usage for each of the plurality of operations. An operation may be reordered to reduce the runtime inefficiency. Instructions may be compiled based on the plurality of operations, which include the reordered operation.

Type: Application

Filed: June 26, 2019

Publication date: December 31, 2020

Inventors: Jeffrey T. Huynh, Drazen Borkovic, Jindrich Zejda, Randy Renfu Huang, Ron Diamant
Fast context switching for computational networks

Patent number: 10846621

Abstract: Provided are systems, methods, and integrated circuits neural network processor that can execute a fast context switch between one neural network and another. In various implementations, a neural network processor can include a plurality of memory banks storing a first set of weight values for a first neural network. When the neural network processor receives first input data, the neural network processor can compute a first result using the first set of weight values and the first input data. While computing the first result, the neural network processor can store, in the memory banks, a second set of weight values for a second neural network. When the neural network processor receives second input data, the neural network processor can compute a second result using the second set of weight values and the second input data, where the computation occurs upon completion of computation of the first result.

Type: Grant

Filed: December 12, 2017

Date of Patent: November 24, 2020

Assignee: Amazon Technologies, Inc.

Inventors: Randy Huang, Ron Diamant, Jindrich Zejda, Drazen Borkovic
Performance debug for networks

Patent number: 10846201

Abstract: Disclosed herein are techniques for debugging the performance of a neural network. In one embodiment, a neural network processor includes a processing engine, a debugging circuit coupled to the processing engine, and an interface to a memory device. The processing engine is configured to execute instructions for implementing a neural network. The debugging circuit is configurable to determine, for each instruction in a set of instructions, a first timestamp indicating a start time of executing the instruction and a second timestamp indicating an end time of executing the instruction by the processing engine. The interface is configured to save the first timestamp and the second timestamp for each instruction in the set of instructions into the memory device. The debugging circuit can be configured to different debug levels. The neural network processor can include multiple debugging circuits for multiple processing engines that operate in parallel.

Type: Grant

Filed: September 21, 2018

Date of Patent: November 24, 2020

Assignee: Amazon Technologies, Inc.

Inventors: Ron Diamant, Jindrich Zejda, Drazen Borkovic, Thomas A. Volpe
Synchronization of computation engines with non-blocking instructions

Patent number: 10761822

Abstract: Provided are systems and methods for generating program code for an integrated circuit, where instructions in the code synchronize computation engines that support non-blocking instructions. In various examples, a computing device can receiving an input data set including operations to be performed by an integrated circuit device and dependencies between the operations. The input data set can include a non-blocking instruction, and an operation that requires that the non-blocking instruction be completed. The computing device can generate instructions for performing the operation including a particular instruction to wait for a value to be set in a register of the integrated circuit device. The computing device can further generate program code including the non-blocking instruction and the instructions for performing the operation, wherein the non-blocking instruction is configured to set the value in the register.

Type: Grant

Filed: December 12, 2018

Date of Patent: September 1, 2020

Assignee: Amazon Technologies, Inc.

Inventors: Drazen Borkovic, Jindrich Zejda, Taemin Kim, Ron Diamant
Software-driven design optimization for mapping between floating-point and fixed-point multiply accumulators

Patent number: 10678509

Abstract: An example multiply accumulate (MACC) circuit includes a multiply-accumulator having an accumulator output register, a scaler, coupled to the multiply accumulator, and a control circuit coupled to the multiply-accumulator and the scaler. The control circuit is configured to provide control data to the scaler, the control data indicative of: a most-significant bit (MSB) to least significant bit (LSB) range for selecting bit indices from the accumulator output register for implementing a first right shift; a multiplier; and a second right shift.

Type: Grant

Filed: August 21, 2018

Date of Patent: June 9, 2020

Assignee: XILINX, INC.

Inventors: Sean Settle, Elliott Delaye, Aaron Ng, Ehsan Ghasemi, Ashish Sirasao, Xiao Teng, Jindrich Zejda
Sparse matrix processing circuitry

Patent number: 10572409

Abstract: A memory arrangement can store a matrix of matrix data elements specified as index-value pairs that indicate row and column indices and associated values. First split-and-merge circuitry is coupled between the memory arrangement and a first set of FIFO buffers for reading the matrix data elements from the memory arrangement and putting the matrix data elements in the first set of FIFO buffers based on column indices. A pairing circuit is configured to read vector data elements, pair the vector data elements with the matrix data elements, and put the paired matrix and vector data elements in a second set of FIFO buffers based on column indices. Second split-and-merge circuitry is configured to read paired matrix and vector data elements from the second set of FIFO buffers and put the paired matrix and vector data elements in a third set of FIFO buffers based on row indices.

Type: Grant

Filed: May 10, 2018

Date of Patent: February 25, 2020

Assignee: XILINX, INC.

Inventors: Jindrich Zejda, Ling Liu, Yifei Zhou, Ashish Sirasao
Data format suitable for fast massively parallel general matrix multiplication in a programmable IC

Patent number: 10515135

Abstract: Methods and apparatus are described for performing data-intensive compute algorithms, such as fast massively parallel general matrix multiplication (GEMM), using a particular data format for both storing data to and reading data from memory. This data format may be utilized for arbitrarily-sized input matrices for GEMM implemented on a finite-size GEMM accelerator in the form of a rectangular compute array of digital signal processing (DSP) elements or similar compute cores. This data format solves the issue of double data rate (DDR) dynamic random access memory (DRAM) bandwidth by allowing both linear DDR addressing and single cycle loading of data into the compute array, avoiding input/output (I/O) and/or DDR bottlenecks.

Type: Grant

Filed: October 17, 2017

Date of Patent: December 24, 2019

Assignee: XILINX, INC.

Inventors: Jindrich Zejda, Elliott Delaye, Aaron Ng, Ashish Sirasao, Yongjun Wu
Software-defined memory bandwidth reduction by hierarchical stream buffering for general matrix multiplication in a programmable IC

Patent number: 10354733

Abstract: Methods and apparatus are described for partitioning and reordering block-based matrix multiplications for high-speed data streaming in general matrix multiplication (GEMM), which may be implemented by a programmable integrated circuit (IC). By preloading and hierarchically caching the blocks, examples of the present disclosure reduce the double data rate (DDR) memory intake bandwidth for software-defined GEMM accelerators.

Type: Grant

Filed: October 17, 2017

Date of Patent: July 16, 2019

Assignee: XILINX, INC.

Inventors: Jindrich Zejda, Elliott Delaye, Ashish Sirasao, Yongjun Wu, Aaron Ng
FAST CONTEXT SWITCHING FOR COMPUTATIONAL NETWORKS

Publication number: 20190179795

Abstract: Provided are systems, methods, and integrated circuits neural network processor that can execute a fast context switch between one neural network and another. In various implementations, a neural network processor can include a plurality of memory banks storing a first set of weight values for a first neural network. When the neural network processor receives first input data, the neural network processor can compute a first result using the first set of weight values and the first input data. While computing the first result, the neural network processor can store, in the memory banks, a second set of weight values for a second neural network. When the neural network processor receives second input data, the neural network processor can compute a second result using the second set of weight values and the second input data, where the computation occurs upon completion of computation of the first result.

Type: Application

Filed: December 12, 2017

Publication date: June 13, 2019

Inventors: Randy Huang, Ron Diamant, Jindrich Zejda, Drazen Borkovic
STATIC BLOCK SCHEDULING IN MASSIVELY PARALLEL SOFTWARE DEFINED HARDWARE SYSTEMS

Publication number: 20190114548

Abstract: Embodiments herein describe techniques for static scheduling a neural network implemented in a massively parallel hardware system. The neural network may be scheduled using three different scheduling levels referred to herein as an upper level, an intermediate level, and a lower level. In one embodiment, the upper level includes a hardware or software model of the layers in the neural network that establishes a sequential order of functions that operate concurrently in the hardware system. In the intermediate level, identical processes in the functions defined in the upper level are connected to form a systolic array or mesh and balanced data flow channels are used to minimize latency. In the lower level, a compiler can assign the operations performed by the processing elements in the systolic array to different portions of the hardware system to provide a static schedule for the neural network.

Type: Application

Filed: October 17, 2017

Publication date: April 18, 2019

Applicant: Xilinx, Inc.

Inventors: Yongjun Wu, Jindrich Zejda, Elliott Delaye, Ashish Sirasao
MACHINE LEARNING RUNTIME LIBRARY FOR NEURAL NETWORK ACCELERATION

Publication number: 20190114533

Abstract: Embodiments herein describe techniques for interfacing a neural network application with a neural network accelerator using a library. The neural network application may execute on a host computing system while the neural network accelerator executes on a massively parallel hardware system, e.g., a FPGA. The library operates a pipeline for submitting the tasks received from the neural network application to the neural network accelerator. In one embodiment, the pipeline includes a pre-processing stage, an FPGA execution stage, and a post-processing stage which each correspond to different threads. When receiving a task from the neural network application, the library generates a packet that includes the information required for the different stages in the pipeline to perform the tasks. Because the stages correspond to different threads, the library can process multiple packets in parallel which can increase the utilization of the neural network accelerator on the hardware system.

Type: Application

Filed: October 17, 2017

Publication date: April 18, 2019

Applicant: Xilinx, Inc.

Inventors: Aaron Ng, Jindrich Zejda, Elliott Delaye, Xiao Teng, Sonal Santan, Soren T. Soe, Ashish Sirasao, Ehsan Ghasemi, Sean Settle
HOST-DIRECTED MULTI-LAYER NEURAL NETWORK PROCESSING VIA PER-LAYER WORK REQUESTS

Publication number: 20190114538

Abstract: In disclosed approaches of neural network processing, a host computer system copies an input data matrix from host memory to a shared memory for performing neural network operations of a first layer of a neural network by a neural network accelerator. The host instructs the neural network accelerator to perform neural network operations of each layer of the neural network beginning with the input data matrix. The neural network accelerator performs neural network operations of each layer in response to the instruction from the host. The host waits until the neural network accelerator signals completion of performing neural network operations of layer i before instructing the neural network accelerator to commence performing neural network operations of layer i+1, for i?1. The host instructs the neural network accelerator to use a results data matrix in the shared memory from layer i as an input data matrix for layer i+1 for i?1.

Type: Application

Filed: October 17, 2017

Publication date: April 18, 2019

Applicant: Xilinx, Inc.

Inventors: Aaron Ng, Elliott Delaye, Jindrich Zejda, Ashish Sirasao
NEURAL NETWORK PROCESSING SYSTEM HAVING HOST CONTROLLED KERNEL ACCLERATORS

Publication number: 20190114535

Abstract: A disclosed neural network processing system includes a host computer system, a RAMs coupled to the host computer system, and neural network accelerators coupled to the RAMs, respectively. The host computer system is configured with software that when executed causes the host computer system to write input data and work requests to the RAMS. Each work request specifies a subset of neural network operations to perform and memory locations in a RAM of the input data and parameters. A graph of dependencies among neural network operations is built and additional dependencies added. The operations are partitioned into coarse grain tasks and fine grain subtasks for optimal scheduling for parallel execution. The subtasks are scheduled to accelerator kernels of matching capabilities. Each neural network accelerator is configured to read a work request from the respective RAM and perform the subset of neural network operations on the input data using the parameters.

Type: Application

Filed: October 17, 2017

Publication date: April 18, 2019

Applicant: Xilinx, Inc.

Inventors: Aaron Ng, Jindrich Zejda, Elliott Delaye, Xiao Teng, Ashish Sirasao
IMAGE PREPROCESSING FOR GENERALIZED IMAGE PROCESSING

Publication number: 20190114499

Abstract: An example preprocessor circuit for formatting image data into a plurality of streams of image samples includes: a first buffer configured to store a plurality of rows of the image data and output a row of the plurality of rows; a second buffer, coupled to the first buffer, including a plurality of storage locations to store a respective plurality of image samples of the row output by the first buffer; a plurality of shift registers; an interconnect network including a plurality of connections, each connection coupling a respective one of the plurality of shift registers to more than one of the plurality of storage locations, one or more of the plurality of storage locations being coupled to more than one of the plurality of connections; and a control circuit configured to load the plurality of shift registers with the plurality of image samples based on the plurality of connections and shift the plurality of shift registers to output the plurality of streams of image samples.

Type: Application

Filed: October 17, 2017

Publication date: April 18, 2019

Applicant: Xilinx, Inc.

Inventors: Elliott Delaye, Ashish Sirasao, Aaron Ng, Yongjun Wu, Jindrich Zejda
MULTI-LAYER NEURAL NETWORK PROCESSING BY A NEURAL NETWORK ACCELERATOR USING HOST COMMUNICATED MERGED WEIGHTS AND A PACKAGE OF PER-LAYER INSTRUCTIONS

Publication number: 20190114529

Abstract: In the disclosed methods and systems for processing in a neural network system, a host computer system writes a plurality of weight matrices associated with a plurality of layers of a neural network to a memory shared with a neural network accelerator. The host computer system further assembles a plurality of per-layer instructions into an instruction package. Each per-layer instruction specifies processing of a respective layer of the plurality of layers of the neural network, and respective offsets of weight matrices in a shared memory. The host computer system writes input data and the instruction package to the shared memory. The neural network accelerator reads the instruction package from the shared memory and processes the plurality of per-layer instructions of the instruction package.

Type: Application

Filed: October 17, 2017

Publication date: April 18, 2019

Applicant: Xilinx, Inc.

Inventors: Aaron Ng, Elliott Delaye, Ehsan Ghasemi, Xiao Teng, Jindrich Zejda, Yongjun Wu, Sean Settle, Ashish Sirasao
Representation of complex timing characteristics of startpoint-endpoint pairs in a circuit design

Patent number: 9842187

Abstract: Approaches for processing a circuit design include determining pin slack values for pins of the circuit elements in the circuit design. A processor selects a subset of endpoints based on pin slack values of the endpoints being in a critical slack range and determines startpoints of the circuit design that are in respective critical fanin cones. For each endpoint of the subset, the processor determines an arrival time from each startpoint in the respective critical fanin cone and determines for each startpoint-endpoint pair, a respective set of constraint values as a function of the respective arrival time from the startpoint. The processor generates a graph in the memory circuit from the startpoint-endpoint pairs. First nodes in the graph represent the startpoints and second nodes in the graph represent the endpoints, and values in the respective set of constraint values are associated with edges that connect the nodes.

Type: Grant

Filed: March 28, 2016

Date of Patent: December 12, 2017

Assignee: XILINX, INC.

Inventors: Jindrich Zejda, Atul Srinivasan, Ilya K. Ganusov, Walter A. Manaker, Jr., Benjamin S. Devlin, Satish B. Sivaswamy
Multiple-power-domain static timing analysis

Patent number: 8321824

Abstract: Embodiments of a computer system, a method, an integrated circuit and a computer-program product (i.e., software) for use with the computer system are described. These devices and techniques may be used to perform STA for circuits that include multiple power domains. Power-domain crossing information and optionally the delay in each power domain can be propagated during the full circuit graph-based STA to accurately perform STA without enumerating all paths. Some embodiments can use a tag-based engine to track power-domain crossing(s) during graph-based STA. If a power-domain is crossed in a path, pessimism may be added to the cumulative delay at the end point of the path. For those paths that do not cross a power domain, pessimism may be removed from the cumulative delay at their end points. In some embodiments, pessimism may be removed from the cumulative delay at end points for paths that cross power domains.

Type: Grant

Filed: April 30, 2009

Date of Patent: November 27, 2012

Assignee: Synopsys, Inc.

Inventors: Jindrich Zejda, William Chiu-Ting Shu, Khalid Rahmat, Feroze Taraporevala
Integrated single spice deck sensitization for gate level tools

Patent number: 8091049

Abstract: One embodiment of the present invention provides systems and techniques for generating a transistor-level description of a subcircuit. A user may want to simulate a subcircuit in a circuit using a transistor-level simulator, and one or more cells in the subcircuit may need to be sensitized so that the cells are in a desired state when the subcircuit is simulated. An embodiment modifies the subcircuit by inserting analog switches in front of the cells that need to be sensitized, so that the analog switches can be used to apply a sensitization sequence to the cells during the transistor-level simulation. The embodiment can then generate a transistor-level description of the modified subcircuit. Next, the transistor-level description of the subcircuit can be stored, thereby enabling the transistor-level simulator to simulate the subcircuit.

Type: Grant

Filed: July 2, 2008

Date of Patent: January 3, 2012

Assignee: Synopsys, Inc.

Inventors: Jindrich Zejda, Narender Hanchate, Rupesh Nayak, Li Ding
Distorted waveform propagation and crosstalk delay analysis using multiple cell models

Patent number: 7861198

Abstract: A method to perform timing analysis for a complex logic cell with distorted input waveform and coupled load networks is presented. Timing arc based models are used in conjunction with CCB based current models of portions of the logic cell to compute the output signal of the logic cell. For example, an intermediary signal is generated using a first timing arc based model and an equivalent coupled network output signal is generated using a channel connected block (CCB) based current model.

Type: Grant

Filed: September 28, 2007

Date of Patent: December 28, 2010

Assignee: Synopsys, Inc.

Inventors: Li Ding, Peivand Tehrani, Jindrich Zejda, Alireza Kasnavi

prev 1 2 3 next