Patents by Inventor Arnab Raha
Arnab Raha has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 12242861Abstract: Methods, apparatus, systems, and articles of manufacture to load data into an accelerator are disclosed. An example apparatus includes data provider circuitry to load a first section and an additional amount of compressed machine learning parameter data into a processor engine. Processor engine circuitry executes a machine learning operation using the first section of compressed machine learning parameter data. A compressed local data re-user circuitry determines if a second section is present in the additional amount of compressed machine learning parameter data. The processor engine circuitry executes a machine learning operation using the second section when the second section is present in the additional amount of compressed machine learning parameter data.Type: GrantFiled: January 18, 2024Date of Patent: March 4, 2025Assignee: Intel CorporationInventors: Arnab Raha, Deepak Mathaikutty, Debabrata Mohapatra, Sang Kyun Kim, Gautham Chinya, Cormac Brick
-
Publication number: 20250060940Abstract: A data processing unit may include a memory, processing elements (PEs), and a control unit. The memory may store weight blocks within a weight tensor of a neural network operation. Each weight block has an input channel (IC) dimension and an output channel (OC) dimension and includes subblocks. A subblock includes one or more weights having a first data precision and one or more other weights having a second data precision. The second data precision is lower than the first data precision. The control unit may distribute different ones of the subblocks to different ones of the PEs. A PE may receive a subblock and perform a first MAC operation on a weight having a first data precision and a second MAC operation on a weight having a second data precision. The first MAC operation may consume more computation cycles or more multipliers than the second MAC operation.Type: ApplicationFiled: October 30, 2024Publication date: February 20, 2025Applicant: Intel CorporationInventors: Arnab Raha, Michael Wu, Deepak Abraham Mathaikutty, Daksha Sharma, Martin Langhammer
-
Patent number: 12229673Abstract: Systems, apparatuses and methods may provide for technology that prefetches compressed data and a sparsity bitmap from a memory to store the compressed data in a decode buffer, where the compressed data is associated with a plurality of tensors, wherein the compressed data is in a compressed format. The technology aligns the compressed data with the sparsity bitmap to generate decoded data, and provides the decoded data to a plurality of processing elements.Type: GrantFiled: November 11, 2021Date of Patent: February 18, 2025Assignee: Intel CorporationInventors: Deepak Mathaikutty, Arnab Raha, Raymond Sung, Debabrata Mohapatra, Cormac Brick
-
Publication number: 20250036928Abstract: Embodiments of the present disclosure are directed toward techniques and configurations enhancing the performance of hardware (HW) accelerators. Disclosed embodiments include static MAC scaling arrangement, which includes architectures and techniques for scaling the performance per unit of power and performance per area of HW accelerators. Disclosed embodiments also include dynamic MAC scaling arrangement, which includes architectures and techniques for dynamically scaling the number of active multiply-and-accumulate (MAC) within an HW accelerator based on activation and weight sparsity. Other embodiments may be described and/or claimed.Type: ApplicationFiled: October 7, 2024Publication date: January 30, 2025Applicant: Intel CorporationInventors: Arnab Raha, Debabrata Mohapatra, Gautham Chinya, Guruguhanathan Venkataramanan, Sang Kyun Kim, Deepak Mathaikutty, Raymond Sung, Cormac Brick
-
Publication number: 20250028565Abstract: Embodiments of the present disclosure are directed toward techniques and configurations enhancing the performance of hardware (HW) accelerators. The present disclosure provides a schedule-aware, dynamically reconfigurable, tree-based partial sum accumulator architecture for HW accelerators, wherein the depth of an adder tree in the HW accelerator is dynamically based on a dataflow schedule generated by a compiler. The adder tree depth is adjusted on a per-layer basis at runtime. Configuration registers, programmed via software, dynamically alter the adder tree depth for partial sum accumulation based on the dataflow schedule. By facilitating a variable depth adder tree during runtime, the compiler can choose a compute optimal dataflow schedule that minimizes the number of compute cycles needed to accumulate partial sums across multiple processing elements (PEs) within a PE array of a HW accelerator. Other embodiments may be described and/or claimed.Type: ApplicationFiled: October 4, 2024Publication date: January 23, 2025Applicant: Intel CorporationInventors: Debabrata Mohapatra, Arnab Raha, Deepak Mathaikutty, Raymond Sung, Cormac Brick
-
Publication number: 20240403616Abstract: An activation function in a neural network may be approximated by one or more linear functions. A linear function may correspond to a segment of the input range of the activation function, e.g., a linear segment. A programmable look-up table may store slopes and intercepts of linear functions. A post processing engine (PPE) array executing the activation function may determine that an input data element of the activation function falls into the linear segment and compute an output of the linear function using the input data element. The output of the linear function may be used as the approximated output of the activation function. Alternatively, the PPE array may determine that the input data element is in a saturation segment and use a fixed value associated with the saturation segment as the approximated output of the activation function.Type: ApplicationFiled: November 2, 2023Publication date: December 5, 2024Inventors: Umer Iftikhar Cheema, Kevin Brady, Robert Simofi, Colm O Faolain, Deepak Abraham Mathaikutty, Arnab Raha, Dinakar Kondru, Gary Baugh, Darren Crews, Fergal Connor
-
Patent number: 12147836Abstract: Techniques and configurations enhancing the performance of hardware (HW) accelerators are provided. A schedule-aware, dynamically reconfigurable, tree-based partial sum accumulator architecture for HW accelerators is provided, where the depth of an adder tree in the HW accelerator is dynamically based on a dataflow schedule generated by a compiler. The adder tree depth is adjusted on a per-layer basis at runtime. Configuration registers, programmed via software, dynamically alter the adder tree depth for partial sum accumulation based on the dataflow schedule. By facilitating a variable depth adder tree during runtime, the compiler can choose a compute optimal dataflow schedule that minimizes the number of compute cycles needed to accumulate partial sums across multiple processing elements (PEs) within a PE array of a HW accelerator.Type: GrantFiled: November 5, 2021Date of Patent: November 19, 2024Assignee: Intel CorporationInventors: Debabrata Mohapatra, Arnab Raha, Deepak Mathaikutty, Raymond Sung, Cormac Brick
-
Patent number: 12141683Abstract: Embodiments of the present disclosure are directed toward techniques and configurations enhancing the performance of hardware (HW) accelerators. Disclosed embodiments include static MAC scaling arrangement, which includes architectures and techniques for scaling the performance per unit of power and performance per area of HW accelerators. Disclosed embodiments also include dynamic MAC scaling arrangement, which includes architectures and techniques for dynamically scaling the number of active multiply-and-accumulate (MAC) within an HW accelerator based on activation and weight sparsity. Other embodiments may be described and/or claimed.Type: GrantFiled: April 30, 2021Date of Patent: November 12, 2024Assignee: Intel CorporationInventors: Arnab Raha, Debabrata Mohapatra, Gautham Chinya, Guruguhanathan Venkataramanan, Sang Kyun Kim, Deepak Mathaikutty, Raymond Sung, Cormac Brick
-
Publication number: 20240296961Abstract: A system and method for modeling neuronal synaptic functionality at least partially instantiated on an optimized computation core of one or more high-speed processors. The synaptic model is preferably created as a neural net and includes, at least, a presynaptic component with a presynaptic target having, at least, a plasticity parameter and activity spike strength. The model also includes a retrograde signaling component with a retrograde messenger that selectively generates a molecular uptake signal, and a postsynaptic receptor component. The retrograde messenger acts on a presynaptic target to modulate the plasticity parameter and activity spike strength of the presynaptic component based upon a calculated molecular uptake at the postsynaptic receptor component to generate the molecular uptake signal which is then transmitted to the presynaptic component.Type: ApplicationFiled: June 13, 2022Publication date: September 5, 2024Inventors: Janet L Paluh, Rounak Chatterjee, Soham Mondal, Souradeep Chowdhury, Arnab Raha, Amitava Mukherjee
-
Publication number: 20240231839Abstract: Methods, apparatus, systems, and articles of manufacture to load data into an accelerator are disclosed. An example apparatus includes data provider circuitry to load a first section and an additional amount of compressed machine learning parameter data into a processor engine. Processor engine circuitry executes a machine learning operation using the first section of compressed machine learning parameter data. A compressed local data re-user circuitry determines if a second section is present in the additional amount of compressed machine learning parameter data. The processor engine circuitry executes a machine learning operation using the second section when the second section is present in the additional amount of compressed machine learning parameter data.Type: ApplicationFiled: January 18, 2024Publication date: July 11, 2024Applicant: Intel CorporationInventors: Arnab Raha, Deepak Mathaikutty, Debabrata Mohapatra, Sang Kyun Kim, Gautham Chinya, Cormac Brick
-
Publication number: 20240220785Abstract: Methods and systems include a neural network system that includes a neural network accelerator comprising. The neural network accelerator includes multiple processing engines coupled together to perform arithmetic operations in support of an inference performed using the deep neural network system. The neural network accelerator also includes a schedule-aware tensor data distribution circuitry or software that is configured to load tensor data into the multiple processing engines in a load phase, extract output data from the multiple processing engines in an extraction phase, reorganize the extracted output data, and store the reorganized extracted output data to memory.Type: ApplicationFiled: January 10, 2024Publication date: July 4, 2024Applicant: Intel CorporationInventors: Gautham Chinya, Huichu Liu, Arnab Raha, Debabrata Mohapatra, Cormac Brick, Lance Hacking
-
Publication number: 20240160695Abstract: A non-linear activation function may be approximated by linear functions. The input range of the activation function may be divided into input segments. One or more input segments may be selected based on statistical analysis of input data elements in the input range. A parameter of a first linear function that approximates the activation function for at least part of a selected input segment may be stored in a first portion of a first look-up table (LUT). The first portion of the first LUT is dedicated to a first group of post processing engines (PPEs). A parameter of a second linear function that approximates the activation function for at least part of an unselected input segment may be stored in a shared pool of LUT entries, which includes a second portion of the first LUT and a portion of a second LUT and is shared by multiple groups of PPEs.Type: ApplicationFiled: December 21, 2023Publication date: May 16, 2024Applicant: Intel CorporationInventors: Dinakar Kondru, Deepak Abraham Mathaikutty, Arnab Raha, Umer Iftikhar Cheema
-
Publication number: 20240119269Abstract: A deep neural network (DNN) accelerator may facilitate dynamic sparsity-based acceleration and operate in various sparsity modes including a combined sparsity mode, a weight sparsity mode, an activation sparsity mode, and a dense mode. The DNN accelerator may receive a configuration parameter indicating whether to accelerate the layer based on sparsity in a weight tensor of the layer. The configuration parameter may be generated offline, e.g., before the execution of the DNN is started. The DNN accelerator computes one or more activations of the layer in a previous layer in the DNN. The one or more activations are one or more elements of an activation tensor of the layer. The DNN accelerator may determine a sparsity mode for the layer based on the configuration parameter and sparsity in the activation tensor. One or more sparse cells in the DNN accelerator may execute the layer in the sparsity mode.Type: ApplicationFiled: December 18, 2023Publication date: April 11, 2024Inventors: Arnab Raha, Dinakar Kondru, Deepak Abraham Mathaikutty, Umer Iftikhar Cheema
-
Publication number: 20240111830Abstract: A non-linear activation function in a neural network may be approximated by one or more linear functions. The input range may be divided into input segments, each of which corresponds to a different exponent in the input range of the activation function and includes input data elements having the exponent. Target accuracies may be assigned to the identified exponents based on a statistics analysis of the input data elements. The target accuracy of an input segment will be used to determine one or more linear functions that approximate the activation function for the input segment. An error of an approximation of the activation function by a linear function for the input segment may be within the target accuracy. The parameters of the linear functions may be stored in a look-up table (LUT). During the execution of the DNN, the LUT may be used to execute the activation function.Type: ApplicationFiled: December 8, 2023Publication date: April 4, 2024Applicant: Intel CorporationInventors: Umer Iftikhar Cheema, Robert Simofi, Deepak Abraham Mathaikutty, Arnab Raha, Dinakar Kondru
-
Patent number: 11922178Abstract: Methods, apparatus, systems, and articles of manufacture to load data into an accelerator are disclosed. An example apparatus includes data provider circuitry to load a first section and an additional amount of compressed machine learning parameter data into a processor engine. Processor engine circuitry executes a machine learning operation using the first section of compressed machine learning parameter data. A compressed local data re-user circuitry determines if a second section is present in the additional amount of compressed machine learning parameter data. The processor engine circuitry executes a machine learning operation using the second section when the second section is present in the additional amount of compressed machine learning parameter data.Type: GrantFiled: June 25, 2021Date of Patent: March 5, 2024Assignee: Intel CorporationInventors: Arnab Raha, Deepak Mathaikutty, Debabrata Mohapatra, Sang Kyun Kim, Gautham Chinya, Cormac Brick
-
Patent number: 11907827Abstract: Methods and systems include a neural network system that includes a neural network accelerator. The neural network accelerator includes multiple processing engines coupled together to perform arithmetic operations in support of an inference performed using the deep neural network system. The neural network accelerator also includes a schedule-aware tensor data distribution circuitry or software that is configured to load tensor data into the multiple processing engines in a load phase, extract output data from the multiple processing engines in an extraction phase, reorganize the extracted output data, and store the reorganized extracted output data to memory.Type: GrantFiled: June 28, 2019Date of Patent: February 20, 2024Assignee: Intel CorporationInventors: Gautham Chinya, Huichu Liu, Arnab Raha, Debabrata Mohapatra, Cormac Brick, Lance Hacking
-
Publication number: 20240028895Abstract: A load module in a deep neural network (DNN) accelerator may receive a configuration parameter indicating a selection between an activation sparsity mode and a weight sparsity mode. The load module may read a sparse activation tensor, an activation sparsity bitmap, a sparse weight tensor, and a weight sparsity bitmap from a memory. The load module may densify one of the compressed tensors based on the sparsity mode and leave the other compressed tensor as is. The load module may load the dense tensor and the sparse tensor to a sparse cell. The sparse cell includes a sparsity module that may select one or more elements of the dense tensor based on the sparsity bitmap of the sparse tensor. The sparse cell also includes multiply-accumulate (MAC) units that perform MAC operation on the selected elements and the sparse tensor. MAC operations on unselected elements of the dense tensor are skipped.Type: ApplicationFiled: September 28, 2023Publication date: January 25, 2024Inventors: Arnab Raha, Deepak Abraham Mathaikutty, Dinakar Kondru, Umer Iftikhar Cheema, Martin Power, Niall Hanrahan
-
Publication number: 20240022259Abstract: Methods, systems, articles of manufacture, and apparatus are disclosed to decode zero-value-compression data vectors. An example apparatus includes: a buffer monitor to monitor a buffer for a header including a value indicative of compressed data; a data controller to, when the buffer includes compressed data, determine a first value of a sparse select signal based on (1) a select signal and (2) a first position in a sparsity bitmap, the first value of the sparse select signal corresponding to a processing element that is to process a portion of the compressed data; and a write controller to, when the buffer includes compressed data, determine a second value of a write enable signal based on (1) the select signal and (2) a second position in the sparsity bitmap, the second value of the write enable signal corresponding to the processing element that is to process the portion of the compressed data.Type: ApplicationFiled: September 12, 2023Publication date: January 18, 2024Applicant: Intel CorporationInventors: Gautham Chinya, Debabrata Mohapatra, Arnab Raha, Huichu Liu, Cormac Brick
-
Publication number: 20240013040Abstract: A drain module may drain activations in an output tensor of a convolution from a PE array that performs the convolution. The drain module may extract activations generated in a collection of PE columns. The activations generated in the PE columns in the collection may be concatenated, e.g., activations generated in the first PE column of the collection may be followed by activations generated in the second PE column of the collection, and so on. The activations in the output tensor may be rearranged into activation vectors. Each activation vector may include activations in different output channels of the deep learning operation. The activations in each activation vector may have the same (X, Y) coordinate in the output tensor. The drain module may determine a memory address for an activation based on the activation's (X, Y, Z) coordinate in the output tensor and write the activation to the memory address.Type: ApplicationFiled: September 26, 2023Publication date: January 11, 2024Applicant: Intel CorporationInventors: Arnab Raha, Deepak Abraham Mathaikutty, Umer Iftikhar Cheema, Dinakar Kondru
-
Publication number: 20230394312Abstract: Activations (e.g., output activations) or weights of intermediate layers of deep neural networks (DNNs) can be pruned to increase sparsity and reduce the amount of computation required for performing the computations in the layers or subsequent layers. A pruning threshold may be determined, e.g., through an iterative process, and activations or weights having absolute values lower than the pruning threshold may be changed to zero. A first pruning threshold may be used to prune an output tensor or kernel of a layer. The loss in the accuracy of the DNN due to the pruning may be determined. A second pruning threshold may be determined based on the first pruning threshold and the accuracy loss. The DNN may be modified by adding a pruning operation to the layer. The pruning operation can prune output tensors or kernels of the layer based on the second pruning threshold.Type: ApplicationFiled: August 22, 2023Publication date: December 7, 2023Applicant: Intel CorporationInventors: Soumendu Kumar Ghosh, Shamik Kundu, Arnab Raha, Deepak Abraham Mathaikutty