Patents by Inventor Raymond Jit-Hung Sung
Raymond Jit-Hung Sung has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20230351181Abstract: An activation function unit can compute activation functions approximated by Taylor series. The activation function unit may include a plurality of compute elements. Each compute element may include two multipliers and an accumulator. The first multiplier may compute intermediate products using an activation, such as an output activation of a DNN layer. The second multiplier may compute terms of Taylor series approximating an activation function based on the intermediate products from the first multiplier and coefficients of the Taylor series. The accumulator may compute a partial sum of the terms as an output of the activation function. The number of the terms may be determined based on a predetermined accuracy of the output of the activation function. The activation function unit may process multiple activations. Different activations may be input into different compute elements in different clock cycles. The activation function unit may compute activation functions with different accuracies.Type: ApplicationFiled: July 5, 2023Publication date: November 2, 2023Inventors: Umer Iftikhar Cheema, Deepak Abraham Mathaikutty, Arnab Raha, Dinakar Kondru, Raymond Jit-Hung Sung, Soumendu Kumar Ghosh
-
Publication number: 20230229507Abstract: Computations in processing elements (PEs) for executing a deep neural network are scheduled via a computation scheduler based on sparsity in input data of the computations to reduce voltage droops. Each PE may compute an input operand and a weight operand in a computation. The computation scheduler may predict the workload of the PE for the computation based on a combined sparsity bitmap, which may be generated based on a sparsity bitmap of the input operand and a sparsity bitmap of the weight operand. The computation scheduler can schedule the starts of the computations in the PEs based on the predicted workloads of the PEs. The computation scheduler may instruct the PE having the highest workload to start the computation first and instruct the other PEs to start computations later. In some embodiments, the computations in the PEs may end in the same clock cycle.Type: ApplicationFiled: March 8, 2023Publication date: July 20, 2023Applicant: Intel CorporationInventors: Raymond Jit-Hung Sung, Arnab Raha, Deepak Abraham Mathaikutty, Umer Iftikhar Cheema
-
Publication number: 20230221994Abstract: A compute block can dynamically uncompress compressed data for executing a channel-separable operation. The compressed data includes one or more nonzero-valued data elements. The compressed data may be stored in a datastore along with a sparsity bitmap of an input operand including the compressed data. An uncompressing module may determine whether the input operand includes any zero-valued data element, e.g., by determining whether the sparsity bitmap includes a zero-valued bit. After determining that the sparsity bitmap includes a zero-valued bit, the uncompressing module inserts a zero-valued data element into the compressed data based on a position of the bit in the sparsity bitmap and generates uncompressed data and update the sparsity bitmap so that all the bits become ones. The uncompressed dense data is transmitted to one or more processing elements (PE) in the compute block for computing an output operand based on the uncompressed dense data.Type: ApplicationFiled: March 16, 2023Publication date: July 13, 2023Inventors: Arnab Raha, Deepak Abraham Mathaikutty, Raymond Jit-Hung Sung, Umer Iftikhar Cheema, Dinakar Kondru, Soumendu Kumar Ghosh
-
Publication number: 20230140173Abstract: An DNN accelerator includes one or more heterogenous tile sets. A heterogenous tile set includes tiles of different sizes, e.g., PE arrays including different numbers of columns or rows. The DNN accelerator may identify a tile set from the tile sets for running a DNN model based on dimensions of output tensors convolutional layers in the DNN. Within the selected tile set, a tile may be selected for a convolutional layer in the DNN, e.g., based on dimensions of the output tensor of the convolutional layer and the size of the tile. After the tile is selected, the workload for running a convolutional operation of the layer may be partitioned and assigned to individual PEs in the tile by partitioning the output tensor into output tensor segments. The workload of computing an individual output tensor segment can be assigned to an individual PE in the tile.Type: ApplicationFiled: August 19, 2022Publication date: May 4, 2023Inventors: Arnab Raha, Umer Iftikhar Cheema, Praveen Kumar Gupta, Deepak Abraham Mathaikutty, Raymond Jit-Hung Sung
-
Publication number: 20230073661Abstract: An DNN (deep neural network) accelerator may accelerate deep learning, such as convolutions in frontend layers through a scheduler for loading data to be processed. The DNN accelerator may store, in a memory, an input tensor of a convolutional layer in a DNN. The convolutional layer may be the first layer or a layer that is arranged before the one or more other convolutional layers in the DNN such that data processed by the first layer can be efficiently reused across data load rounds. The input tensor includes one or more channels. A channel includes activations arranged in rows and columns. The DNN accelerator may read at least a portion of the input tensor from the memory into a datastore. The datastore includes some databanks. The DNN accelerator may provide a vector of one or more activations to a processing element for operations such as multiplications on the vector.Type: ApplicationFiled: November 14, 2022Publication date: March 9, 2023Applicant: Intel CorporationInventors: Deepak Abraham Mathaikutty, Arnab Raha, Umer Iftikhar Cheema, Raymond Jit-Hung Sung
-
Publication number: 20230059976Abstract: An DNN accelerator may include a PE array performing MAC operations. The PE array may include PEs capable of MAC operations on quantized values. A PE may include subtractors for subtracting zeropoints from quantized activations and quantized weights to generate intermediate activations and intermediate weights. The intermediate activations and intermediate weights may be stored in data storage units in the PE and maybe used by an MAC unit in the PE. The subtractors may be placed outside the MAC unit but inside the PE. The MAC unit may perform sequential cycles of MAC operations. The MAC unit may include a plurality of multipliers. The intermediate activations and intermediate weights stored in the data storage units may be reused by different multipliers in different cycles of MAC operations. An output of the MAC unit or of the PE may be multiplied with a quantization scale to produce a floating-point value.Type: ApplicationFiled: October 18, 2022Publication date: February 23, 2023Applicant: Intel CorporationInventors: Deepak Abraham Mathaikutty, Arnab Raha, Raymond Jit-Hung Sung, Martin Power, Umer Iftikhar Cheema, David Thomas Bernard
-
Publication number: 20230018857Abstract: Sparsity processing within a compute block can be done on unpacked data. The compute block includes a sparsity decoder that generates a combined sparsity vector from an activation sparsity vector and a weight sparsity vector. The activation sparsity vector indicates positions of non-zero valued activations in an activation context. The weight sparsity vector indicates positions of non-zero valued weights in a weight context. The combined sparsity vector comprises one or more zero valued bits and one or more non-zero valued bits. The sparsity decoder may determine the position of a non-zero valued bit in the combined sparsity vector and determine an address for the non-zero valued activation and the non-zero valued weight based on the position of the non-zero valued bit. The non-zero valued activation and the non-zero valued weight may be provided to a PE for performing MAC operations.Type: ApplicationFiled: September 19, 2022Publication date: January 19, 2023Inventors: Martin Power, Conor Byrne, Niall Hanrahan, Deepak Abraham Mathaikutty, Arnab Raha, Raymond Jit-Hung Sung, David Thomas Bernard, Kevin Brady, Martin-Thomas Grymel
-
Publication number: 20230014656Abstract: A memory array of a compute tile may store activations or weights of a DNN. The memory array may include databanks for storing contexts, context MUXs, and byte MUXs. A databank may store a context with flip-flop arrays, each of which includes a sequence of flip-flops. A logic gate and an ICG unit may gate flip-flops and control whether states of the flip-flops can be changed. The data gating can prevent a context not selected for the databank from inadvertently toggling and wasting power A context MUX may read a context from different flip-flop arrays in a databank based on gray-coded addresses. A byte MUX can combine bits from different bytes in a context read by the context MUX. The memory array may be implemented with bit packing to reduce distance between the context MUX and byte MUX to reduce lengths of wires connecting the context MUXs and byte MUXs.Type: ApplicationFiled: September 23, 2022Publication date: January 19, 2023Inventors: Raymond Jit-Hung Sung, Deepak Abraham Mathaikutty, Amit Agarwal, David Thomas Bernard, Steven Hsu, Martin Power, Conor Byme, Arnab Raha
-
Publication number: 20220261623Abstract: An DNN accelerator includes a column of PEs and an external adder assembly for performing depthwise convolution. Each PE includes register files, multipliers, and an internal adder assembly. Each register file can store an operand (input operand, weight operand, etc.) of the depthwise convolution. The operand includes a sequence of elements, each of which corresponds to a different depthwise channel. A multiplier can perform a sequence of multiplications on two operands, e.g., an input operand and a weight operand, and generate a product operand. The internal adder assembly can accumulate product operands and generate an output operand of the PE. The output operand includes output elements, each of which corresponds to a different depthwise channel. The operands may be reused in different rounds of operations by the multipliers. The external adder assembly can accumulate output operands of multiple PEs and generate an output operand of the PE column.Type: ApplicationFiled: April 29, 2022Publication date: August 18, 2022Applicant: Intel CorporationInventors: Raymond Jit-Hung Sung, Debabrata Mohapatra, Arnab Raha, Deepak Abraham Mathaikutty, Praveen Kumar Gupta
-
Publication number: 20220188638Abstract: An apparatus for convolution operations is provided. The apparatus includes a PE array, a datastore, writing modules, reading modules, and a controlling module. The PE array performs MAC operations. The datastore includes databanks, each of which stores data to be used by a column of the PE array. The writing modules transfer data from a memory to the datastore. The reading modules transfer data from the datastore to the PE array. Each reading module may transfer data to a particular column of the PE array. The controlling module can determine the rounds of a convolution operation. Each round includes MAC operations based on a weight. The controlling module controls the writing modules and reading modules so that the same data in a databank can be reused in multiple rounds. For different rounds, the controlling module can provide a reading module accesses to different databanks.Type: ApplicationFiled: March 2, 2022Publication date: June 16, 2022Applicant: Intel CorporationInventors: Deepak Abraham Mathaikutty, Arnab Raha, Raymond Jit-Hung Sung, Debabrata Mohapatra
-
Publication number: 20220188075Abstract: A FPMAC operation has two operands: an input operand and a weight operand. The operands may have a format of FP16, BF16, or INT8. Each operand is split into two portions. The two portions are stored in separate storage units. Then operands are transferred to register files of a PE, with each register file storing bits of an operand sequentially. The PE performs the FPMAC operation based on the operands. The PE may include an FPMAC unit configured to compute an individual partial sum of the PE. The PE may also include an FP adder to accumulate the individual partial sum with other data, such as an output from another PE or an output form another PE array. The FP adder may be fused with the FPMAC unit in a single circuit that can do speculative alignment and has separate critical paths for alignment and normalization.Type: ApplicationFiled: March 7, 2022Publication date: June 16, 2022Applicant: Intel CorporationInventors: Arnab Raha, Mark A. Anders, Raymond Jit-Hung Sung, Debabrata Mohapatra, Deepak Abraham Mathaikutty, Ram K. Krishnamurthy, Himanshu Kaul
-
Publication number: 20220083843Abstract: An apparatus is provided to access a weight vector of a layer in a sequence of layers in the DNN. The weight vector includes a first sequence of weights having different values. A bitmap is generated based on the weight vector. The bitmap includes a second sequence of bitmap elements. Each bitmap element corresponds to a different weight and has a value determined based at least on the value of the corresponding weight. The index of each bitmap element in the second sequence matches the index of the corresponding weight in the first sequence. A new bitmap is generated by rearranging the bitmap elements in the second sequence based on the values of the bitmap elements. The weight vector is rearranged based on the new bitmap. The rearranged weight vector is divided into subsets, each of which is assigned to a different PE for a MAC operation.Type: ApplicationFiled: November 24, 2021Publication date: March 17, 2022Applicant: Intel CorporationInventors: Arnab Raha, Debabrata Mohapatra, Deepak Abraham Mathaikutty, Raymond Jit-Hung Sung, Cormac Michael Brick
-
Publication number: 20220075659Abstract: There is disclosed a system and method of performing an artificial intelligence (AI) inference, including: programming an AI accelerator circuit to solve an AI problem with a plurality of layer-specific register file (RF) size allocations, wherein the AI accelerator circuit comprises processing elements (PEs) with respective associated RFs, wherein the RFs individually are divided into K sub-banks of size B bytes, wherein B and K are integers, and wherein the RFs include circuitry to individually allocate a sub-bank to one of input feature (IF), output feature (OF), or filter weight (FL), and wherein programming the plurality of layer-specific RF size allocations comprises accounting for sparse data within the layer; and causing the AI accelerator circuit to execute the AI problem, including applying the layer-specific RF size allocations at run-time.Type: ApplicationFiled: November 18, 2021Publication date: March 10, 2022Applicant: Intel CorporationInventors: Debabrata Mohapatra, Arnab Raha, Deepak Abraham Mathaikutty, Raymond Jit-Hung Sung, Cormac Michael Brick
-
MEMORY ARCHITECTURE HAVING MULTIPLE PARTIAL WORDLINE DRIVERS AND CONTACTED AND FEED-THROUGH BITLINES
Publication number: 20120195152Abstract: Various embodiments are disclosed relating to a memory circuit architecture. In an example embodiment, which may accommodate a change to a new memory size or cell aspect ratio, while migrating between different process nodes or the same process generation, while retaining at least a portion of the periphery circuitry, a memory circuit architecture may be employed in which the memory array is divided into an upper half and a lower half, thereby splitting the cache Ways among the two halves. The wordline may be split among the two array halves, with each half driven by a half wordline driver. Also, in another embodiment, two sets of bitlines may be provided for each column, including a contacted set of bitlines and a feed-through set of bitlines.Type: ApplicationFiled: July 26, 2011Publication date: August 2, 2012Applicant: BROADCOM CORPORATIONInventors: Raymond Jit-Hung Sung, Dongwook Suh, Daniel Rodriguez -
MEMORY ARCHITECTURE HAVING MULTIPLE PARTIAL WORDLINE DRIVERS AND CONTACTED AND FEED-THROUGH BITLINES
Publication number: 20100177586Abstract: Various embodiments are disclosed relating to a memory circuit architecture. In an example embodiment, which may accommodate a change to a new memory size or cell aspect ratio, while migrating between different process nodes or the same process generation, while retaining at least a portion of the periphery circuitry, a memory circuit architecture may be employed in which the memory array is divided into an upper half and a lower half, thereby splitting the cache Ways among the two halves. The wordline may be split among the two array halves, with each half driven by a half wordline driver. Also, in another embodiment, two sets of bitlines may be provided for each column, including a contacted set of bitlines and a feed-through set of bitlines.Type: ApplicationFiled: March 24, 2010Publication date: July 15, 2010Applicant: BROADCOM CORPORATIONInventors: Raymond Jit-Hung Sung, Dongwook Suh, Daniel Rodriguez -
Memory architecture having multiple partial wordline drivers and contacted and feed-through bitlines
Patent number: 7697364Abstract: Various embodiments are disclosed relating to a memory circuit architecture. In an example embodiment, which may accommodate a change to a new memory size or cell aspect ratio, while migrating between different process nodes or the same process generation, while retaining at least a portion of the periphery circuitry, a memory circuit architecture may be employed in which the memory array is divided into an upper half and a lower half, thereby splitting the cache Ways among the two halves. The wordline may be split among the two array halves, with each half driven by a half wordline driver. Also, in another embodiment, two sets of bitlines may be provided for each column, including a contacted set of bitlines and a feed-through set of bitlines.Type: GrantFiled: December 1, 2005Date of Patent: April 13, 2010Assignee: Broadcom CorporationInventors: Raymond Jit-Hung Sung, Dongwook Suh, Daniel Rodriguez -
Patent number: 7123056Abstract: A systematic method for single-rail domino logic circuits is provided, in which inverting and non-monotonic logic functions can be integrated into a pipelined system with almost zero overhead. This logic family, called Clock Logic (CL)-domino is functionally complete while tolerating skew and minimizing the number of clock phases that must be distributed. Simulation results for a CL-domino ALU at 1-GHz under high skew (1-FO4) conditions, shows a power reduction of 41% over the same ALU implemented in dual-rail skew-tolerant domino logic. This power reduction incurs no performance penalty over dual-rail techniques, although in some cases additional design effort is required.Type: GrantFiled: December 9, 2003Date of Patent: October 17, 2006Assignee: Mosaid Technologies IncorporatedInventors: Raymond Jit-Hung Sung, Duncan George Elliott
-
Method for scalable architectures in stackable three-dimensional integrated circuits and electronics
Patent number: 7046522Abstract: The design methods described enable three-dimensional integrated circuit systems in which all of the dies, in a vertically bonded stack of dies, are identical. Only one mask set and wafer type is required since a single circuit design is produced for one die in the stack and reused for all the dies with little or no modification. The system scales directly as the level of stacking is increased while incurring no extra design effort, beyond that required for the initial design.Type: GrantFiled: March 20, 2003Date of Patent: May 16, 2006Inventors: Raymond Jit-Hung Sung, Tyler Lee Brandon, John Conrad Koob, Duncan George Elliott, Daniel Arie Leder -
Patent number: 6806737Abstract: A circuit and method for accelerating bus line communication in an integrated circuit is disclosed. High speed transmission of signals along a bus line is achieved by driving a series of bus line segments with their own bi-directional bus amplification circuits. Because each bus line segment has less capacitive loading than longer non-segmented bus lines, voltage reversal, or data inversion of a pair of complementary lines of a bus line segment is accomplished at high speed. Each bi-directional bus amplification circuit includes a precharge circuit for precharging each complementary pair of lines to known logic levels, and a drive circuit for changing the logic level of each line.Type: GrantFiled: March 21, 2003Date of Patent: October 19, 2004Inventors: Raymond Jit-Hung Sung, John Conrad Koob, Tyler Lee Brandon, Duncan George Elliot
-
Patent number: 6803782Abstract: A column redundancy architecture for arrayed parallel processor devices is disclosed. In particular, daisy chained communication between processing elements is preserved after defective memory columns and their associated processing elements are disabled, by setting a bypass circuit within the processing element to be disabled. An address remapping circuit ensures that spare memory columns and associated processing elements replacing the defective memory columns and processing elements can be addressed in a linear column order. The column redundancy architecture is flexible as it permits replacement of arbitrary numbers of series adjacent processing elements as well as non adjacent processing elements with a minimal impact on device performance.Type: GrantFiled: March 21, 2003Date of Patent: October 12, 2004Inventors: John Conrad Koob, Raymond Jit-Hung Sung, Tyler Lee Brandon, Duncan George Elliot