Patents by Inventor Ganesh Venkatesh

Ganesh Venkatesh has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 11392829
    Abstract: Approaches in accordance with various embodiments provide for the processing of sparse matrices for mathematical and programmatic operations. In particular, various embodiments enforce sparsity constraints for performing sparse matrix multiply-add instruction (MMA) operations. Deep neural networks can exhibit significant sparsity in the data used in operations, both in the activations and weights. The computational load can be reduced by excluding zero-valued data elements. A sparsity constraint is applied across all submatrices of a sparse matrix, providing fine-grained structured sparsity that is evenly distributed across the matrix. The matrix may then be compressed since a minimum number of elements of the matrix are known to have zero value. Matrix operations are then performed using these matrices.
    Type: Grant
    Filed: April 2, 2019
    Date of Patent: July 19, 2022
    Assignee: NVIDIA Corporation
    Inventors: Jeff Pool, Ganesh Venkatesh, Jorge Albericio Latorre, Jack Choquette, Ronny Krashinsky, John Tran, Feng Xie, Ming Y. Siu, Manan Patel
  • Patent number: 11379420
    Abstract: Compressed data is oftentimes beneficial for reducing the computing resources required, for example, to transmit and store data. The compression of data is particularly useful when dealing with sparse data (data that includes numerous zeros or near-zero values) and only non-zero values above a certain threshold have significance. When dealing with compressed data, oftentimes the data needs to be decompressed for processing (e.g., by deep learning networks or other applications configured to operate on sparse, or other uncompressed data). Instructions are disclosed for supporting the decompression of compressed data by a processing unit such as a CPU and GPU.
    Type: Grant
    Filed: March 20, 2019
    Date of Patent: July 5, 2022
    Assignee: NVIDIA CORPORATION
    Inventors: Jorge Albericio Latorre, Jack H. Choquette, Manan Maheshkumar Patel, Jeffrey Pool, Ming Y. Siu, Ronny Meir Krashinsky, Ganesh Venkatesh
  • Publication number: 20220207345
    Abstract: A machine-learning accelerator system, comprising: a plurality of controllers each configured to traverse a feature map with n-dimensions according to instructions that specify, for each of the n-dimensions, a respective traversal size, wherein each controller comprises: a counter stack comprising counters each associated with a respective dimension of the n-dimensions of the feature map, wherein each counter is configured to increment a respective count from a respective initial value to the respective traversal size associated with the respective dimension associated with that counter; a plurality of address generators each configured to use the respective counts of the counters to generate at least one memory address at which a portion of the feature map is stored; and a dependency controller computing module configured to (1) track conditional statuses for incrementing the counters and (2) allow or disallow each of the counters to increment based on the conditional statuses.
    Type: Application
    Filed: December 28, 2020
    Publication date: June 30, 2022
    Inventors: Harshit Khaitan, Ganesh Venkatesh, Simon James Hollis
  • Publication number: 20220164218
    Abstract: Embodiments of systems, methods, and apparatuses for heterogeneous computing are described. In some embodiments, a hardware heterogeneous scheduler dispatches instructions for execution on one or more plurality of heterogeneous processing elements, the instructions corresponding to a code fragment to be processed by the one or more of the plurality of heterogeneous processing elements, wherein the instructions are native instructions to at least one of the one or more of the plurality of heterogeneous processing elements.
    Type: Application
    Filed: July 21, 2021
    Publication date: May 26, 2022
    Inventors: Rajesh M. SANKARAN, Gilbert NEIGER, Narayan RANGANATHAN, Stephen R. VAN DOREN, Joseph NUZMAN, Niall D. MCDONNELL, Michael A. O'HANLON, Lokpraveen B. MOSUR, Tracy Garrett DRYSDALE, Eriko NURVITADHI, Asit K. MISHRA, Ganesh VENKATESH, Deborah T. MARR, Nicholas P. CARTER, Jonathan D. PEARCE, Edward T. GROCHOWSKI, Richard J. GRECO, Robert VALENTINE, Jesus CORBAL, Thomas D. FLETCHER, Dennis R. BRADFORD, Dwight P. MANLEY, Mark J. CHARNEY, Jeffrey J. COOK, Paul CAPRIOLI, Koichi YAMADA, Kent D. GLOSSOP, David B. SHEFFIELD
  • Publication number: 20220129524
    Abstract: Disclosed herein includes a system, a method, and a device for improving computational efficiency of deconvolution by reducing a number of dot products. In one aspect, an input image having a set of pixels is received. A first dot product may be performed on a subset of the set of pixels of the input image and a portion of a kernel, to generate a first pixel of an output image. A number of multiplications performed for the first dot product performed may be less than a number of elements of the kernel. A second dot product on a remaining portion of the kernel to generate the first pixel of the output image may be bypassed.
    Type: Application
    Filed: January 10, 2022
    Publication date: April 28, 2022
    Inventor: Ganesh Venkatesh
  • Publication number: 20220083844
    Abstract: In one embodiment, a method for machine learning acceleration includes receiving, by a shared controller of a tensor processor cluster that includes multiple tensor processors, a multi-cycle instruction, determining, based on the instruction, a sequence of vector operations to be executed by the tensor processors and address information usable to determine a respective spatial partition of an input tensor on which each tensor processor is to operate when performing each vector operation. The method also includes, for each vector operation in the sequence, generating, based on the address information, a common address offset, relative to a respective base address associated with each tensor processor, at which each tensor processor is to retrieve the respective spatial partition on which the tensor processor is to operate, multicasting the common address offset to the tensor processors, and controlling the tensor processors to execute the vector operation in parallel and in lock step.
    Type: Application
    Filed: September 16, 2020
    Publication date: March 17, 2022
    Inventors: Harshit Khaitan, Ganesh Venkatesh, Vikas Chandra
  • Publication number: 20220058026
    Abstract: Disclosed herein includes improving computational efficiency of multiply-accumulate (MAC) operation. In one aspect, a computing device identifies, a first vector including non-zero elements of a base matrix, and a second vector indicating a location of each of the non-zero elements of the base matrix. In one aspect, the device determines a first element and a second element of the first vector. In one aspect, the device determines a third element and a fourth element of the second vector. In one aspect, the device determines i) a fifth element of an input vector according to the third element of the second vector, and ii) a sixth element of the input vector according to the fourth element of the second vector. In one aspect, the device causes a MAC circuitry to perform a dot product according to the first element, the second element, the fifth element, and the sixth element.
    Type: Application
    Filed: August 19, 2020
    Publication date: February 24, 2022
    Inventors: Alagappan Valliappan, Ganesh Venkatesh, Pierce I-Jen Chuang
  • Patent number: 11222092
    Abstract: Disclosed herein includes a system, a method, and a device for improving computational efficiency of deconvolution by reducing a number of dot products. In one aspect, an input image having a set of pixels is received. A first dot product may be performed on a subset of the set of pixels of the input image and a portion of a kernel, to generate a first pixel of an output image. A number of multiplications performed for the first dot product performed may be less than a number of elements of the kernel. A second dot product on a remaining portion of the kernel to generate the first pixel of the output image may be bypassed.
    Type: Grant
    Filed: July 16, 2019
    Date of Patent: January 11, 2022
    Assignee: FACEBOOK TECHNOLOGIES, LLC
    Inventor: Ganesh Venkatesh
  • Patent number: 11093277
    Abstract: Embodiments of systems, methods, and apparatuses for heterogeneous computing are described. In some embodiments, a hardware heterogeneous scheduler dispatches instructions for execution on one or more plurality of heterogeneous processing elements, the instructions corresponding to a code fragment to be processed by the one or more of the plurality of heterogeneous processing elements, wherein the instructions are native instructions to at least one of the one or more of the plurality of heterogeneous processing elements.
    Type: Grant
    Filed: June 26, 2020
    Date of Patent: August 17, 2021
    Assignee: Intel Corporation
    Inventors: Rajesh M. Sankaran, Gilbert Neiger, Narayan Ranganathan, Stephen R. Van Doren, Joseph Nuzman, Niall D. McDonnell, Michael A. O'Hanlon, Lokpraveen B. Mosur, Tracy Garrett Drysdale, Eriko Nurvitadhi, Asit K. Mishra, Ganesh Venkatesh, Deborah T. Marr, Nicholas P. Carter, Jonathan D. Pearce, Edward T. Grochowski, Richard J. Greco, Robert Valentine, Jesus Corbal, Thomas D. Fletcher, Dennis R. Bradford, Dwight P. Manley, Mark J. Charney, Jeffrey J. Cook, Paul Caprioli, Koichi Yamada, Kent D. Glossop, David B. Sheffield
  • Patent number: 10977002
    Abstract: Disclosed herein includes a system, a method, and a device including shift circuitry and add circuitry for performing multiplication of a first value and a second value for a neural network. The first value has a predetermined format including a first bit, and two or more second bits to represent a value of zero or 2n where n is an integer greater than or equal to 0. The device shifts, when the two or more second bits represent the value of 2n, the second value by (n+1) bits via the shift circuitry to provide a first result, selectively outputs zero or the second value, based on a value of the first bit of the first value, to provide a second result, and adds the first result and the second results via the add circuitry to provide a result of the multiplication of the first and second values.
    Type: Grant
    Filed: July 15, 2019
    Date of Patent: April 13, 2021
    Assignee: Facebook Technologies, LLC
    Inventors: Ganesh Venkatesh, Liangzhen Lai, Pierce I-Jen Chuang, Meng Li, Vikas Chandra
  • Publication number: 20210019591
    Abstract: Disclosed herein includes a system, a method, and a device for receiving input data to generate a plurality of outputs for a layer of a neural network. The plurality of outputs are arranged in a first array. Dimensions of the first array may be compared with dimensions of a processing unit (PE) array including a plurality of PEs. According to a result of the comparing, the first array is partitioned into subarrays by the processor. Each of the subarrays has dimensions less than or equal to the dimensions of the PE array. A first group of PEs in the PE array is assigned to a first one of the subarrays. A corresponding output of the plurality of outputs is generated using a portion of the input data by each PE of the first group of PEs assigned to the first one of the subarrays.
    Type: Application
    Filed: July 15, 2019
    Publication date: January 21, 2021
    Applicant: Facebook Technologies, LLC
    Inventors: Ganesh Venkatesh, Liangzhen Lai, Pierce I-Jen Chuang, Meng Li
  • Publication number: 20210019633
    Abstract: Disclosed herein includes a system, a method, and a device for performing a convolution on data of a current layer of a neural network, including a plurality of channels arranged in a first order and partitioned into a plurality of first partitions according to the first order. Each first partition includes a result of a convolution on a corresponding partition of channels in data of a previous layer of the neural network. The device shifts the plurality of channels arranged in the first order to a second order, partition the shifted plurality of channels into a plurality of second partitions, according to the second order. For each of the plurality of second partitions, the device performs a convolution on channels of the shifted plurality of channels that are in the corresponding second partition.
    Type: Application
    Filed: July 15, 2019
    Publication date: January 21, 2021
    Applicant: Facebook Technologies, LLC
    Inventor: Ganesh Venkatesh
  • Publication number: 20210019363
    Abstract: Disclosed herein includes a system, a method, and a device for improving computational efficiency of deconvolution by reducing a number of dot products. In one aspect, an input image having a set of pixels is received. A first dot product may be performed on a subset of the set of pixels of the input image and a portion of a kernel, to generate a first pixel of an output image. A number of multiplications performed for the first dot product performed may be less than a number of elements of the kernel. A second dot product on a remaining portion of the kernel to generate the first pixel of the output image may be bypassed.
    Type: Application
    Filed: July 16, 2019
    Publication date: January 21, 2021
    Applicant: Facebook Technologies, LLC
    Inventor: Ganesh Venkatesh
  • Publication number: 20210019115
    Abstract: Disclosed herein includes a system, a method, and a device including shift circuitry and add circuitry for performing multiplication of a first value and a second value for a neural network. The first value has a predetermined format including a first bit, and two or more second bits to represent a value of zero or 2n where n is an integer greater than or equal to 0. The device shifts, when the two or more second bits represent the value of 2n, the second value by (n+1) bits via the shift circuitry to provide a first result, selectively outputs zero or the second value, based on a value of the first bit of the first value, to provide a second result, and adds the first result and the second results via the add circuitry to provide a result of the multiplication of the first and second values.
    Type: Application
    Filed: July 15, 2019
    Publication date: January 21, 2021
    Applicant: Facebook Technologies, LLC
    Inventors: Ganesh Venkatesh, Liangzhen Lai, Pierce I-Jen Chuang, Meng Li, Vikas Chandra
  • Publication number: 20210011288
    Abstract: Disclosed herein is a method for using a neural network across multiple devices. The method can include receiving, by a first device configured with a first one or more layers of a neural network, input data for processing via the neural network implemented across the first device and a second device. The method can include outputting, by the first one or more layers of the neural network implemented on the first device, a data set that is reduced in size relative to the input data while identifying one or more features of the input data for processing by a second one or more layers of the neural network. The method can include communicating, by the first device, the data set to the second device for processing via the second one or more layers of the neural network implemented on the second device.
    Type: Application
    Filed: July 9, 2019
    Publication date: January 14, 2021
    Applicant: Facebook Technologies, LLC
    Inventors: Liangzhen Lai, Pierce I-Jen Chuang, Vikas Chandra, Ganesh Venkatesh
  • Publication number: 20210012178
    Abstract: Disclosed herein includes a system, a method, and a device for early-exit from convolution. In some embodiments, at least one processing element (PE) circuit is configured to perform, for a node of a neural network corresponding to a dot-product operation with a set of operands, computation using a subset of the set of operands to generate a dot-product value of the subset of the set of operands. The at least one PE circuit can compare the dot-product value of the subset of the set of operands, to a threshold value. The at least one PE circuit can determine whether to activate the node of the neural network, based at least on a result of the comparing.
    Type: Application
    Filed: July 11, 2019
    Publication date: January 14, 2021
    Applicant: Facebook Technologies, LLC
    Inventors: Ganesh Venkatesh, Liangzhen Lai, Pierce I-Jen Chuang
  • Publication number: 20210012186
    Abstract: Disclosed herein includes a system, a method, and a device for pipelined parallelism to accelerate distributed learning network graph. First data for a first layer of a neural network may be stored in memory. First circuitry including a first plurality of processing element (PE) circuits may read the first data from the memory and perform computation for the first layer of the neural network using the first data to generate second data. The first circuitry includes a plurality of buffers for outputting the generated second data as input to second circuitry to perform computation for a second layer of the neural network. The second circuitry includes a second plurality of PE circuits configured to perform computation for the second layer of the neural network using the second data.
    Type: Application
    Filed: July 11, 2019
    Publication date: January 14, 2021
    Applicant: Facebook Technologies, LLC
    Inventors: Ganesh Venkatesh, Liangzhen Lai
  • Publication number: 20210011846
    Abstract: Disclosed herein includes a system, a method, and a device for reading and writing sparse data in a neural network accelerator. A plurality of slices can be established to access a memory having an access size of a data word. A first slice can be configured to access a first side of the data word in memory. Circuitry can access a mask identifying byte positions within the data word having non-zero values. The circuitry can modify the data word to have non-zero byte values stored starting at an end of the first side, and any zero byte values stored in a remainder of the data word. A determination can be made whether a number of non-zero byte values is less than or equal to a first access size of the first slice. The circuitry can write the modified data word to the memory via at least the first slice.
    Type: Application
    Filed: July 11, 2019
    Publication date: January 14, 2021
    Applicant: Facebook Technologies, LLC
    Inventors: Ganesh Venkatesh, Liangzhen Lai, Pierce I-Jen Chuang, Meng Li
  • Publication number: 20210012202
    Abstract: Disclosed herein includes a system, a method, and a device for asymmetrical scaling factor support for negative and positive values. A device can include a circuit having a shift circuitry and multiply circuitry. The circuit can be configured to perform computation for a neural network, including multiplying, via the multiply circuitry, a first value and a second value. The circuit can be configured to perform computation for a neural network, including shifting, via the shift circuitry, a result of the multiplying by a determined number of bits. The circuit can be configured to perform computation for a neural network, including outputting the result of the multiplying when a sign bit of the first value is negative, and a result of the shifting when the sign bit of the first value is positive.
    Type: Application
    Filed: July 12, 2019
    Publication date: January 14, 2021
    Applicant: Facebook Technologies, LLC
    Inventors: Ganesh Venkatesh, Pierce I-Jen Chuang
  • Publication number: 20200401440
    Abstract: Embodiments of systems, methods, and apparatuses for heterogeneous computing are described. In some embodiments, a hardware heterogeneous scheduler dispatches instructions for execution on one or more plurality of heterogeneous processing elements, the instructions corresponding to a code fragment to be processed by the one or more of the plurality of heterogeneous processing elements, wherein the instructions are native instructions to at least one of the one or more of the plurality of heterogeneous processing elements.
    Type: Application
    Filed: June 26, 2020
    Publication date: December 24, 2020
    Inventors: Rajesh M. SANKARAN, Gilbert NEIGER, Narayan RANGANATHAN, Stephen R. VAN DOREN, Joseph NUZMAN, Niall D. MCDONNELL, Michael A. O'HANLON, Lokpraveen B. MOSUR, Tracy Garrett DRYSDALE, Eriko NURVITADHI, Asit K. MISHRA, Ganesh VENKATESH, Deborah T. MARR, Nicholas P. CARTER, Jonathan D. PEARCE, Edward T. GROCHOWSKI, Richard J. GRECO, Robert VALENTINE, Jesus CORBAL, Thomas D. FLETCHER, Dennis R. BRADFORD, Dwight P. MANLEY, Mark J. CHARNEY, Jeffrey J. COOK, Paul CAPRIOLI, Koichi YAMADA, Kent D. GLOSSOP, David B. SHEFFIELD