Patents by Inventor Raymond Jit-Hung Sung

Raymond Jit-Hung Sung has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20230351181
    Abstract: An activation function unit can compute activation functions approximated by Taylor series. The activation function unit may include a plurality of compute elements. Each compute element may include two multipliers and an accumulator. The first multiplier may compute intermediate products using an activation, such as an output activation of a DNN layer. The second multiplier may compute terms of Taylor series approximating an activation function based on the intermediate products from the first multiplier and coefficients of the Taylor series. The accumulator may compute a partial sum of the terms as an output of the activation function. The number of the terms may be determined based on a predetermined accuracy of the output of the activation function. The activation function unit may process multiple activations. Different activations may be input into different compute elements in different clock cycles. The activation function unit may compute activation functions with different accuracies.
    Type: Application
    Filed: July 5, 2023
    Publication date: November 2, 2023
    Inventors: Umer Iftikhar Cheema, Deepak Abraham Mathaikutty, Arnab Raha, Dinakar Kondru, Raymond Jit-Hung Sung, Soumendu Kumar Ghosh
  • Publication number: 20230229507
    Abstract: Computations in processing elements (PEs) for executing a deep neural network are scheduled via a computation scheduler based on sparsity in input data of the computations to reduce voltage droops. Each PE may compute an input operand and a weight operand in a computation. The computation scheduler may predict the workload of the PE for the computation based on a combined sparsity bitmap, which may be generated based on a sparsity bitmap of the input operand and a sparsity bitmap of the weight operand. The computation scheduler can schedule the starts of the computations in the PEs based on the predicted workloads of the PEs. The computation scheduler may instruct the PE having the highest workload to start the computation first and instruct the other PEs to start computations later. In some embodiments, the computations in the PEs may end in the same clock cycle.
    Type: Application
    Filed: March 8, 2023
    Publication date: July 20, 2023
    Applicant: Intel Corporation
    Inventors: Raymond Jit-Hung Sung, Arnab Raha, Deepak Abraham Mathaikutty, Umer Iftikhar Cheema
  • Publication number: 20230221994
    Abstract: A compute block can dynamically uncompress compressed data for executing a channel-separable operation. The compressed data includes one or more nonzero-valued data elements. The compressed data may be stored in a datastore along with a sparsity bitmap of an input operand including the compressed data. An uncompressing module may determine whether the input operand includes any zero-valued data element, e.g., by determining whether the sparsity bitmap includes a zero-valued bit. After determining that the sparsity bitmap includes a zero-valued bit, the uncompressing module inserts a zero-valued data element into the compressed data based on a position of the bit in the sparsity bitmap and generates uncompressed data and update the sparsity bitmap so that all the bits become ones. The uncompressed dense data is transmitted to one or more processing elements (PE) in the compute block for computing an output operand based on the uncompressed dense data.
    Type: Application
    Filed: March 16, 2023
    Publication date: July 13, 2023
    Inventors: Arnab Raha, Deepak Abraham Mathaikutty, Raymond Jit-Hung Sung, Umer Iftikhar Cheema, Dinakar Kondru, Soumendu Kumar Ghosh
  • Publication number: 20230140173
    Abstract: An DNN accelerator includes one or more heterogenous tile sets. A heterogenous tile set includes tiles of different sizes, e.g., PE arrays including different numbers of columns or rows. The DNN accelerator may identify a tile set from the tile sets for running a DNN model based on dimensions of output tensors convolutional layers in the DNN. Within the selected tile set, a tile may be selected for a convolutional layer in the DNN, e.g., based on dimensions of the output tensor of the convolutional layer and the size of the tile. After the tile is selected, the workload for running a convolutional operation of the layer may be partitioned and assigned to individual PEs in the tile by partitioning the output tensor into output tensor segments. The workload of computing an individual output tensor segment can be assigned to an individual PE in the tile.
    Type: Application
    Filed: August 19, 2022
    Publication date: May 4, 2023
    Inventors: Arnab Raha, Umer Iftikhar Cheema, Praveen Kumar Gupta, Deepak Abraham Mathaikutty, Raymond Jit-Hung Sung
  • Publication number: 20230073661
    Abstract: An DNN (deep neural network) accelerator may accelerate deep learning, such as convolutions in frontend layers through a scheduler for loading data to be processed. The DNN accelerator may store, in a memory, an input tensor of a convolutional layer in a DNN. The convolutional layer may be the first layer or a layer that is arranged before the one or more other convolutional layers in the DNN such that data processed by the first layer can be efficiently reused across data load rounds. The input tensor includes one or more channels. A channel includes activations arranged in rows and columns. The DNN accelerator may read at least a portion of the input tensor from the memory into a datastore. The datastore includes some databanks. The DNN accelerator may provide a vector of one or more activations to a processing element for operations such as multiplications on the vector.
    Type: Application
    Filed: November 14, 2022
    Publication date: March 9, 2023
    Applicant: Intel Corporation
    Inventors: Deepak Abraham Mathaikutty, Arnab Raha, Umer Iftikhar Cheema, Raymond Jit-Hung Sung
  • Publication number: 20230059976
    Abstract: An DNN accelerator may include a PE array performing MAC operations. The PE array may include PEs capable of MAC operations on quantized values. A PE may include subtractors for subtracting zeropoints from quantized activations and quantized weights to generate intermediate activations and intermediate weights. The intermediate activations and intermediate weights may be stored in data storage units in the PE and maybe used by an MAC unit in the PE. The subtractors may be placed outside the MAC unit but inside the PE. The MAC unit may perform sequential cycles of MAC operations. The MAC unit may include a plurality of multipliers. The intermediate activations and intermediate weights stored in the data storage units may be reused by different multipliers in different cycles of MAC operations. An output of the MAC unit or of the PE may be multiplied with a quantization scale to produce a floating-point value.
    Type: Application
    Filed: October 18, 2022
    Publication date: February 23, 2023
    Applicant: Intel Corporation
    Inventors: Deepak Abraham Mathaikutty, Arnab Raha, Raymond Jit-Hung Sung, Martin Power, Umer Iftikhar Cheema, David Thomas Bernard
  • Publication number: 20230018857
    Abstract: Sparsity processing within a compute block can be done on unpacked data. The compute block includes a sparsity decoder that generates a combined sparsity vector from an activation sparsity vector and a weight sparsity vector. The activation sparsity vector indicates positions of non-zero valued activations in an activation context. The weight sparsity vector indicates positions of non-zero valued weights in a weight context. The combined sparsity vector comprises one or more zero valued bits and one or more non-zero valued bits. The sparsity decoder may determine the position of a non-zero valued bit in the combined sparsity vector and determine an address for the non-zero valued activation and the non-zero valued weight based on the position of the non-zero valued bit. The non-zero valued activation and the non-zero valued weight may be provided to a PE for performing MAC operations.
    Type: Application
    Filed: September 19, 2022
    Publication date: January 19, 2023
    Inventors: Martin Power, Conor Byrne, Niall Hanrahan, Deepak Abraham Mathaikutty, Arnab Raha, Raymond Jit-Hung Sung, David Thomas Bernard, Kevin Brady, Martin-Thomas Grymel
  • Publication number: 20230014656
    Abstract: A memory array of a compute tile may store activations or weights of a DNN. The memory array may include databanks for storing contexts, context MUXs, and byte MUXs. A databank may store a context with flip-flop arrays, each of which includes a sequence of flip-flops. A logic gate and an ICG unit may gate flip-flops and control whether states of the flip-flops can be changed. The data gating can prevent a context not selected for the databank from inadvertently toggling and wasting power A context MUX may read a context from different flip-flop arrays in a databank based on gray-coded addresses. A byte MUX can combine bits from different bytes in a context read by the context MUX. The memory array may be implemented with bit packing to reduce distance between the context MUX and byte MUX to reduce lengths of wires connecting the context MUXs and byte MUXs.
    Type: Application
    Filed: September 23, 2022
    Publication date: January 19, 2023
    Inventors: Raymond Jit-Hung Sung, Deepak Abraham Mathaikutty, Amit Agarwal, David Thomas Bernard, Steven Hsu, Martin Power, Conor Byme, Arnab Raha
  • Publication number: 20220261623
    Abstract: An DNN accelerator includes a column of PEs and an external adder assembly for performing depthwise convolution. Each PE includes register files, multipliers, and an internal adder assembly. Each register file can store an operand (input operand, weight operand, etc.) of the depthwise convolution. The operand includes a sequence of elements, each of which corresponds to a different depthwise channel. A multiplier can perform a sequence of multiplications on two operands, e.g., an input operand and a weight operand, and generate a product operand. The internal adder assembly can accumulate product operands and generate an output operand of the PE. The output operand includes output elements, each of which corresponds to a different depthwise channel. The operands may be reused in different rounds of operations by the multipliers. The external adder assembly can accumulate output operands of multiple PEs and generate an output operand of the PE column.
    Type: Application
    Filed: April 29, 2022
    Publication date: August 18, 2022
    Applicant: Intel Corporation
    Inventors: Raymond Jit-Hung Sung, Debabrata Mohapatra, Arnab Raha, Deepak Abraham Mathaikutty, Praveen Kumar Gupta
  • Publication number: 20220188638
    Abstract: An apparatus for convolution operations is provided. The apparatus includes a PE array, a datastore, writing modules, reading modules, and a controlling module. The PE array performs MAC operations. The datastore includes databanks, each of which stores data to be used by a column of the PE array. The writing modules transfer data from a memory to the datastore. The reading modules transfer data from the datastore to the PE array. Each reading module may transfer data to a particular column of the PE array. The controlling module can determine the rounds of a convolution operation. Each round includes MAC operations based on a weight. The controlling module controls the writing modules and reading modules so that the same data in a databank can be reused in multiple rounds. For different rounds, the controlling module can provide a reading module accesses to different databanks.
    Type: Application
    Filed: March 2, 2022
    Publication date: June 16, 2022
    Applicant: Intel Corporation
    Inventors: Deepak Abraham Mathaikutty, Arnab Raha, Raymond Jit-Hung Sung, Debabrata Mohapatra
  • Publication number: 20220188075
    Abstract: A FPMAC operation has two operands: an input operand and a weight operand. The operands may have a format of FP16, BF16, or INT8. Each operand is split into two portions. The two portions are stored in separate storage units. Then operands are transferred to register files of a PE, with each register file storing bits of an operand sequentially. The PE performs the FPMAC operation based on the operands. The PE may include an FPMAC unit configured to compute an individual partial sum of the PE. The PE may also include an FP adder to accumulate the individual partial sum with other data, such as an output from another PE or an output form another PE array. The FP adder may be fused with the FPMAC unit in a single circuit that can do speculative alignment and has separate critical paths for alignment and normalization.
    Type: Application
    Filed: March 7, 2022
    Publication date: June 16, 2022
    Applicant: Intel Corporation
    Inventors: Arnab Raha, Mark A. Anders, Raymond Jit-Hung Sung, Debabrata Mohapatra, Deepak Abraham Mathaikutty, Ram K. Krishnamurthy, Himanshu Kaul
  • Publication number: 20220083843
    Abstract: An apparatus is provided to access a weight vector of a layer in a sequence of layers in the DNN. The weight vector includes a first sequence of weights having different values. A bitmap is generated based on the weight vector. The bitmap includes a second sequence of bitmap elements. Each bitmap element corresponds to a different weight and has a value determined based at least on the value of the corresponding weight. The index of each bitmap element in the second sequence matches the index of the corresponding weight in the first sequence. A new bitmap is generated by rearranging the bitmap elements in the second sequence based on the values of the bitmap elements. The weight vector is rearranged based on the new bitmap. The rearranged weight vector is divided into subsets, each of which is assigned to a different PE for a MAC operation.
    Type: Application
    Filed: November 24, 2021
    Publication date: March 17, 2022
    Applicant: Intel Corporation
    Inventors: Arnab Raha, Debabrata Mohapatra, Deepak Abraham Mathaikutty, Raymond Jit-Hung Sung, Cormac Michael Brick
  • Publication number: 20220075659
    Abstract: There is disclosed a system and method of performing an artificial intelligence (AI) inference, including: programming an AI accelerator circuit to solve an AI problem with a plurality of layer-specific register file (RF) size allocations, wherein the AI accelerator circuit comprises processing elements (PEs) with respective associated RFs, wherein the RFs individually are divided into K sub-banks of size B bytes, wherein B and K are integers, and wherein the RFs include circuitry to individually allocate a sub-bank to one of input feature (IF), output feature (OF), or filter weight (FL), and wherein programming the plurality of layer-specific RF size allocations comprises accounting for sparse data within the layer; and causing the AI accelerator circuit to execute the AI problem, including applying the layer-specific RF size allocations at run-time.
    Type: Application
    Filed: November 18, 2021
    Publication date: March 10, 2022
    Applicant: Intel Corporation
    Inventors: Debabrata Mohapatra, Arnab Raha, Deepak Abraham Mathaikutty, Raymond Jit-Hung Sung, Cormac Michael Brick
  • Publication number: 20120195152
    Abstract: Various embodiments are disclosed relating to a memory circuit architecture. In an example embodiment, which may accommodate a change to a new memory size or cell aspect ratio, while migrating between different process nodes or the same process generation, while retaining at least a portion of the periphery circuitry, a memory circuit architecture may be employed in which the memory array is divided into an upper half and a lower half, thereby splitting the cache Ways among the two halves. The wordline may be split among the two array halves, with each half driven by a half wordline driver. Also, in another embodiment, two sets of bitlines may be provided for each column, including a contacted set of bitlines and a feed-through set of bitlines.
    Type: Application
    Filed: July 26, 2011
    Publication date: August 2, 2012
    Applicant: BROADCOM CORPORATION
    Inventors: Raymond Jit-Hung Sung, Dongwook Suh, Daniel Rodriguez
  • Publication number: 20100177586
    Abstract: Various embodiments are disclosed relating to a memory circuit architecture. In an example embodiment, which may accommodate a change to a new memory size or cell aspect ratio, while migrating between different process nodes or the same process generation, while retaining at least a portion of the periphery circuitry, a memory circuit architecture may be employed in which the memory array is divided into an upper half and a lower half, thereby splitting the cache Ways among the two halves. The wordline may be split among the two array halves, with each half driven by a half wordline driver. Also, in another embodiment, two sets of bitlines may be provided for each column, including a contacted set of bitlines and a feed-through set of bitlines.
    Type: Application
    Filed: March 24, 2010
    Publication date: July 15, 2010
    Applicant: BROADCOM CORPORATION
    Inventors: Raymond Jit-Hung Sung, Dongwook Suh, Daniel Rodriguez
  • Patent number: 7697364
    Abstract: Various embodiments are disclosed relating to a memory circuit architecture. In an example embodiment, which may accommodate a change to a new memory size or cell aspect ratio, while migrating between different process nodes or the same process generation, while retaining at least a portion of the periphery circuitry, a memory circuit architecture may be employed in which the memory array is divided into an upper half and a lower half, thereby splitting the cache Ways among the two halves. The wordline may be split among the two array halves, with each half driven by a half wordline driver. Also, in another embodiment, two sets of bitlines may be provided for each column, including a contacted set of bitlines and a feed-through set of bitlines.
    Type: Grant
    Filed: December 1, 2005
    Date of Patent: April 13, 2010
    Assignee: Broadcom Corporation
    Inventors: Raymond Jit-Hung Sung, Dongwook Suh, Daniel Rodriguez
  • Patent number: 7123056
    Abstract: A systematic method for single-rail domino logic circuits is provided, in which inverting and non-monotonic logic functions can be integrated into a pipelined system with almost zero overhead. This logic family, called Clock Logic (CL)-domino is functionally complete while tolerating skew and minimizing the number of clock phases that must be distributed. Simulation results for a CL-domino ALU at 1-GHz under high skew (1-FO4) conditions, shows a power reduction of 41% over the same ALU implemented in dual-rail skew-tolerant domino logic. This power reduction incurs no performance penalty over dual-rail techniques, although in some cases additional design effort is required.
    Type: Grant
    Filed: December 9, 2003
    Date of Patent: October 17, 2006
    Assignee: Mosaid Technologies Incorporated
    Inventors: Raymond Jit-Hung Sung, Duncan George Elliott
  • Patent number: 7046522
    Abstract: The design methods described enable three-dimensional integrated circuit systems in which all of the dies, in a vertically bonded stack of dies, are identical. Only one mask set and wafer type is required since a single circuit design is produced for one die in the stack and reused for all the dies with little or no modification. The system scales directly as the level of stacking is increased while incurring no extra design effort, beyond that required for the initial design.
    Type: Grant
    Filed: March 20, 2003
    Date of Patent: May 16, 2006
    Inventors: Raymond Jit-Hung Sung, Tyler Lee Brandon, John Conrad Koob, Duncan George Elliott, Daniel Arie Leder
  • Patent number: 6806737
    Abstract: A circuit and method for accelerating bus line communication in an integrated circuit is disclosed. High speed transmission of signals along a bus line is achieved by driving a series of bus line segments with their own bi-directional bus amplification circuits. Because each bus line segment has less capacitive loading than longer non-segmented bus lines, voltage reversal, or data inversion of a pair of complementary lines of a bus line segment is accomplished at high speed. Each bi-directional bus amplification circuit includes a precharge circuit for precharging each complementary pair of lines to known logic levels, and a drive circuit for changing the logic level of each line.
    Type: Grant
    Filed: March 21, 2003
    Date of Patent: October 19, 2004
    Inventors: Raymond Jit-Hung Sung, John Conrad Koob, Tyler Lee Brandon, Duncan George Elliot
  • Patent number: 6803782
    Abstract: A column redundancy architecture for arrayed parallel processor devices is disclosed. In particular, daisy chained communication between processing elements is preserved after defective memory columns and their associated processing elements are disabled, by setting a bypass circuit within the processing element to be disabled. An address remapping circuit ensures that spare memory columns and associated processing elements replacing the defective memory columns and processing elements can be addressed in a linear column order. The column redundancy architecture is flexible as it permits replacement of arbitrary numbers of series adjacent processing elements as well as non adjacent processing elements with a minimal impact on device performance.
    Type: Grant
    Filed: March 21, 2003
    Date of Patent: October 12, 2004
    Inventors: John Conrad Koob, Raymond Jit-Hung Sung, Tyler Lee Brandon, Duncan George Elliot