Patents by Inventor Ali Shafiee

Ali Shafiee has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20230103997
    Abstract: A vision transformer includes L layers, and H attention heads in each layer. An h? of the attention heads include an attention mask added before a Softmax operation, and an h of the attention heads include unmasked attention heads in which H=h?+h. Each attention mask multiplies a Query vector and a Key vector for form element-wise products. At least one attention mask is a hard mask that selects closest neighbors of a patch and ignores patches further away than the closest neighbors of the patch. Alternatively, at least one attention mask includes a soft mask that multiplies weights of closest neighbors of a patch by a magnification factor and passes weights of patches that are further away than the closest neighbors of the patch. A learnable bias ? may be added to diagonal elements of the at least one attention map.
    Type: Application
    Filed: January 11, 2022
    Publication date: April 6, 2023
    Inventors: Ling LI, Ali SHAFIEE ARDESTANI, Joseph H. HASSOUN
  • Publication number: 20230047273
    Abstract: A computation unit for performing a computation of a neural network layer is disclosed. A number of processing element (PE) units are arranged in an array. First input values are provided in parallel in an input dimension of the array during a first processing period, and a second input values are provided in parallel in the input dimension during a second processing period. Computations are performed by the PE units based on stored weight values. An adder coupled to the first set of PE units generates a first sum of results of the computations by the first set of PE units during the first processing cycle, and generates a second sum of results of the computations during the second processing cycle. A first accumulator coupled to the first adder stores the first sum, and further shifts the first sum to a second accumulator prior to storing the second sum.
    Type: Application
    Filed: October 14, 2022
    Publication date: February 16, 2023
    Inventors: Hamzah Abdelaziz, Joseph Hassoun, Ali Shafiee Ardestani
  • Publication number: 20230047025
    Abstract: A multichannel data packer includes a plurality of two-input multiplexers and a controller. The plurality of two-input multiplexers is arranged in 2N rows and N columns in which N is an integer greater than 1. Each input of a multiplexer in a first column receives a respective bit stream of 2N channels of bit streams. Each respective bit stream includes a bit-stream length based on data in the bit stream. The multiplexers in a last column output 2N channels of packed bit streams each having a same bit-stream length. The controller controls the plurality of multiplexers so that the multiplexers in the last column output the 2N channels of bit streams that each has the same bit-stream length.
    Type: Application
    Filed: October 19, 2022
    Publication date: February 16, 2023
    Inventors: Ilia OVSIANNIKOV, Ali SHAFIEE ARDESTANI, Lei WANG, Joseph H. HASSOUN
  • Patent number: 11579842
    Abstract: An N×N multiplier may include a N/2×N first multiplier, a N/2×N/2 second multiplier, and a N/2×N/2 third multiplier. The N×N multiplier receives two operands to multiply. The first, second and/or third multipliers are selectively disabled if an operand equals zero or has a small value. If the operands are both less than 2N/2, the second or the third multiplier are used to multiply the operands. If one operand is less than 2N/2 and the other operand is equal to or greater than 2N/2, the first multiplier is used or the second and third multipliers are used to multiply the operands. If both operands are equal to or greater than 2N/2, the first, second and third multipliers are used to multiply the operands.
    Type: Grant
    Filed: January 15, 2021
    Date of Patent: February 14, 2023
    Inventors: Ilia Ovsiannikov, Ali Shafiee Ardestani, Joseph Hassoun, Lei Wang
  • Publication number: 20220413805
    Abstract: A method for performing a neural network operation. In some embodiments, method includes: calculating a first plurality of products, each of the first plurality of products being the product of a weight and an activation; calculating a first partial sum, the first partial sum being the sum of the products; and compressing the first partial sum to form a first compressed partial sum.
    Type: Application
    Filed: August 19, 2021
    Publication date: December 29, 2022
    Inventors: Ling LI, Ali SHAFIEE ARDESTANI
  • Publication number: 20220414421
    Abstract: A system and method for handling processing with outliers. In some embodiments, the method includes: reading a first activation and a second activation, each including a least significant part and a most significant part, multiplying a first weight and a second weight by the respective activations, the multiplying of the first weight by the first activation including multiplying the first weight by the least significant part of the first activation in a first multiplier, the multiplying of the second weight by the second activation including: multiplying the second weight by the least significant part of the second activation in a second multiplier, and multiplying the second weight by the most significant part of the second activation in a shared multiplier, the shared multiplier being associated with a plurality of rows of an array of activations.
    Type: Application
    Filed: October 4, 2021
    Publication date: December 29, 2022
    Inventors: Ali Shafiee Ardestani, Hamzah Ahmed Ali Abdelaziz, Joseph Hassoun
  • Publication number: 20220405559
    Abstract: A neural network accelerator is disclosed that includes a multiplication unit, an adder-tree unit and an accumulator unit. The multiplication unit and the adder tree unit are configured to perform lattice-multiplication operations. The accumulator unit is coupled to an output of the adder tree to form dot-product values from the lattice-multiplication operations performed by the multiplication unit and the adder tree unit. The multiplication unit includes n multiplier units that perform lattice-multiplication-based operations and output product values. Each multiplier unit includes a plurality of multipliers. Each multiplier unit receives first and second multiplicands that each include a most significant nibble (MSN) and a least significant nibble (LSN). The multipliers in each multiplier unit receive different combinations of the MSNs and the LSNs of the multiplicands. The multiplication unit and the adder can provide mixed-precision dot-product computations.
    Type: Application
    Filed: August 31, 2021
    Publication date: December 22, 2022
    Inventors: Hamzah Ahmed Ali ABDELAZIZ, Ali SHAFIEE ARDESTANI, Joseph H. HASSOUN
  • Publication number: 20220405557
    Abstract: A system and a method is disclosed for processing input feature map (IFM) data of a current layer of a neural network model using an array of reconfigurable neural processing units (NPUs) and storing output feature map (OFM) data of the next layer of the neural network model at a location that does not involve a data transfer between memories of the NPUs according to the subject matter disclosed herein. The reconfigurable NPUs may be used to improve NPU utilization of NPUs of a neural processing system.
    Type: Application
    Filed: August 11, 2021
    Publication date: December 22, 2022
    Inventors: Jong Hoon SHIN, Ali SHAFIEE ARDESTANI, Joseph H. HASSOUN
  • Publication number: 20220405558
    Abstract: A core of neural processing units is configured to efficiently process a depthwise convolution by maximizing spatial feature-map locality using adder trees. Data paths of activations and weights are inverted, and 2-to-1 multiplexers are every 2/9 multipliers along a row of multipliers. During a depthwise convolution operation, the core is operated using a RS×HW dataflow to maximize the locality of feature maps. For a normal convolution operation, the data paths of activations and weights may be configured for a normal convolution configuration and in which multiplexers are idle.
    Type: Application
    Filed: August 12, 2021
    Publication date: December 22, 2022
    Inventors: Jong Hoon SHIN, Ali SHAFIEE ARDESTANI, Joseph H. HASSOUN
  • Patent number: 11507817
    Abstract: A computation unit for performing a computation of a neural network layer is disclosed. A number of processing element (PE) units are arranged in an array. First input values are provided in parallel in an input dimension of the array during a first processing period, and a second input values are provided in parallel in the input dimension during a second processing period. Computations are performed by the PE units based on stored weight values. An adder coupled to the first set of PE units generates a first sum of results of the computations by the first set of PE units during the first processing cycle, and generates a second sum of results of the computations during the second processing cycle. A first accumulator coupled to the first adder stores the first sum, and further shifts the first sum to a second accumulator prior to storing the second sum.
    Type: Grant
    Filed: June 12, 2020
    Date of Patent: November 22, 2022
    Assignee: Samsung Electronics Co., Ltd.
    Inventors: Hamzah Abdelaziz, Joseph Hassoun, Ali Shafiee Ardestani
  • Publication number: 20220156569
    Abstract: A general matrix-matrix (GEMM) accelerator core includes first and second buffers, and a processing element (PE). The first buffer receives a elements of a matrix A of activation values. The second buffer receives b elements of a matrix B of weight values. The matrix B is preprocessed with a nonzero-valued b element replacing a zero-valued b element in a first row of the second buffer based on the zero-valued b element being in the first row of the second buffer. Metadata is generated that includes movement information of the nonzero-valued b element to replace the zero-valued b element. The PE receives b elements from a first row of the second buffer and a elements from the first buffer from locations in the first buffer that correspond to locations in the second buffer from where the b elements have been received by the PE as indicated by the metadata.
    Type: Application
    Filed: November 8, 2021
    Publication date: May 19, 2022
    Inventors: Jong Hoon SHIN, Ali SHAFIEE ARDESTANI, Joseph H. HASSOUN
  • Publication number: 20220156568
    Abstract: A general matrix-matrix (GEMM) accelerator core includes first and second buffers, a control logic circuit, and a first processing element (PE). The first buffer receives a elements of a first matrix A of activation values. The second buffer receives b elements of a second matrix B of weight values. The control logic circuit replaces a zero-valued a element in a first column of the first buffer with a nonzero-valued a element that is within a maximum borrowing distance of a location of the zero-valued a element in the first column of the first buffer. The PE receives a elements from the first column of the first buffer including the nonzero-valued element a selected to replace the zero-valued a element and receives b elements from locations in the second buffer that correspond to locations in the first buffer from where the a elements have been received by the PE.
    Type: Application
    Filed: November 8, 2021
    Publication date: May 19, 2022
    Inventors: Jong Hoon SHIN, Ali SHAFIEE ARDESTANI, Joseph H. HASSOUN
  • Publication number: 20220147312
    Abstract: A processor for fine-grain sparse integer and floating-point operations and method of operation thereof. In some embodiments, the method includes forming a first set of products and forming a second set of products. The forming of the first set of products may include: multiplying, in a first multiplier, a first activation value by a least significant sub-word and a most significant sub-word of a first weight form a first partial product and a second partial product; and adding the first partial product and the second partial product. The forming of the second set of products may include: multiplying, in the first multiplier, a second activation value by a first sub-word and a second sub-word of a mantissa to form a third partial product and a fourth partial product; and adding the third partial product and the fourth partial product.
    Type: Application
    Filed: December 22, 2020
    Publication date: May 12, 2022
    Inventors: Ali Shafiee Ardestani, Joseph Hassoun
  • Publication number: 20220147313
    Abstract: A processor for fine-grain sparse integer and floating-point operations and method of operation thereof. In some embodiments, the method includes forming a first set of products, and forming a second set of products. The forming of the first set of products may include: multiplying, in a first multiplier, a second multiplier, and a third multiplier, the first activation value by a first least significant sub-word, a second least significant sub-word, and a most significant sub-word; and adding a first resulting partial product and a second resulting partial product. The forming of the second set of products may include forming a first floating point product, the forming of the first floating point product including multiplying, in the first multiplier, a first sub-word of a mantissa of an activation value by a first sub-word of a mantissa of a weight, to form a third partial product.
    Type: Application
    Filed: December 23, 2020
    Publication date: May 12, 2022
    Inventors: Ali Shafiee Ardestani, Joseph H. Hassoun
  • Publication number: 20220114425
    Abstract: A system and method for performing sets of multiplications in a manner that accommodates outlier values. In some embodiments the method includes: forming a first set of products, each product of the first set of products being a product of a first activation value and a respective weight of a first plurality of weights. The forming of the first set of products may include multiplying, in a first multiplier, the first activation value and a least significant sub-word of a first weight to form a first partial product; multiplying, in a second multiplier, the first activation value and a least significant sub-word of a second weight; multiplying, in a third multiplier, the first activation value and a most significant sub-word of the first weight to form a second partial product; and adding the first partial product and the second partial product.
    Type: Application
    Filed: December 2, 2020
    Publication date: April 14, 2022
    Inventors: Ali Shafiee Ardestani, Joseph Hassoun
  • Publication number: 20210357748
    Abstract: A system and method for weight preprocessing. In some embodiments, the method includes performing intra-tile preprocessing of a first weight tensor to form a first pre-processed weight tensor, and performing inter-tile preprocessing of the first pre-processed weight tensor, to form a second pre-processed weight tensor. The intra-tile preprocessing may include moving a first element of a first weight tile of the first weight tensor by one position, within the first weight tile, in a lookahead direction or in a lookaside direction. The inter-tile preprocessing may include moving a first row of a weight tile of the first pre-processed weight tensor by one position in a lookahead direction or by one position in a lookaside direction.
    Type: Application
    Filed: August 11, 2020
    Publication date: November 18, 2021
    Inventors: Jong Hoon Shin, Ali Shafiee Ardestani, Hamzah Ahmed Ali Abdelaziz, Joseph H. Hassoun
  • Publication number: 20210326682
    Abstract: A method of flattening channel data of an input feature map in an inference system includes retrieving pixel values of a channel of a plurality of channels of the input feature map from a memory and storing the pixel values in a buffer, extracting first values of a first region having a first size from among the pixel values stored in the buffer, the first region corresponding to an overlap region of a kernel of the inference system with channel data of the input feature map, rearranging second values corresponding to the overlap region of the kernel from among the first values in the first region, and identifying a first group of consecutive values from among the rearranged second values for supplying to a first dot-product circuit of the inference system.
    Type: Application
    Filed: June 12, 2020
    Publication date: October 21, 2021
    Inventors: Ali Shafiee Ardestani, Joseph Hassoun
  • Publication number: 20210326686
    Abstract: A computation unit for performing a computation of a neural network layer is disclosed. A number of processing element (PE) units are arranged in an array. First input values are provided in parallel in an input dimension of the array during a first processing period, and a second input values are provided in parallel in the input dimension during a second processing period. Computations are performed by the PE units based on stored weight values. An adder coupled to the first set of PE units generates a first sum of results of the computations by the first set of PE units during the first processing cycle, and generates a second sum of results of the computations during the second processing cycle. A first accumulator coupled to the first adder stores the first sum, and further shifts the first sum to a second accumulator prior to storing the second sum.
    Type: Application
    Filed: June 12, 2020
    Publication date: October 21, 2021
    Inventors: Hamzah Abdelaziz, Joseph Hassoun, Ali Shafiee Ardestani
  • Publication number: 20210319079
    Abstract: A dot-product architecture and method are disclosed for calculating floating-point dot-products of two vectors. The architecture includes an array of multiplier units that each include an integer logic that multiplies integer values of corresponding elements of the two vectors; an exponent logic that adds exponent values of the corresponding elements of the two vectors to form an unbiased exponent values, and a local shifter that forms a first shifted value by shifting a product-integer value by a number of bits in a predetermined direction based on a difference value between an unbiased exponent value corresponding to the product-integer value and a maximum unbiased exponent value for the array of multiplier units. An adder tree adds shifted values output from local shifters of the array of multiplier units to form an output, and an accumulator accumulates the output of the addition unit.
    Type: Application
    Filed: January 20, 2021
    Publication date: October 14, 2021
    Inventors: Hamzah Ahmed Ali ABDELAZIZ, Ali SHAFIEE ARDESTANI, Joseph H. HASSOUN
  • Publication number: 20210312325
    Abstract: According to one general aspect, an apparatus may include a machine learning system. The machine learning system may include a precision determination circuit configured to: determine a precision level of data, and divide the data into a data subdivision. The machine learning system may exploit sparsity during the computation of each subdivision. The machine learning system may include a load balancing circuit configured to select a load balancing technique, wherein the load balancing technique includes alternately loading the computation circuit with at least a first data/weight subdivision combination and a second data/weight subdivision combination. The load balancing circuit may be configured to load a computation circuit with a selected data subdivision and a selected weight subdivision based, at least in part, upon the load balancing technique.
    Type: Application
    Filed: June 10, 2020
    Publication date: October 7, 2021
    Inventors: Hamzah ABDELAZIZ, Joseph HASSOUN, Ali SHAFIEE ARDESTANI