Patents by Inventor Ganesh Venkatesh

Ganesh Venkatesh has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20210011846
    Abstract: Disclosed herein includes a system, a method, and a device for reading and writing sparse data in a neural network accelerator. A plurality of slices can be established to access a memory having an access size of a data word. A first slice can be configured to access a first side of the data word in memory. Circuitry can access a mask identifying byte positions within the data word having non-zero values. The circuitry can modify the data word to have non-zero byte values stored starting at an end of the first side, and any zero byte values stored in a remainder of the data word. A determination can be made whether a number of non-zero byte values is less than or equal to a first access size of the first slice. The circuitry can write the modified data word to the memory via at least the first slice.
    Type: Application
    Filed: July 11, 2019
    Publication date: January 14, 2021
    Applicant: Facebook Technologies, LLC
    Inventors: Ganesh Venkatesh, Liangzhen Lai, Pierce I-Jen Chuang, Meng Li
  • Publication number: 20210011288
    Abstract: Disclosed herein is a method for using a neural network across multiple devices. The method can include receiving, by a first device configured with a first one or more layers of a neural network, input data for processing via the neural network implemented across the first device and a second device. The method can include outputting, by the first one or more layers of the neural network implemented on the first device, a data set that is reduced in size relative to the input data while identifying one or more features of the input data for processing by a second one or more layers of the neural network. The method can include communicating, by the first device, the data set to the second device for processing via the second one or more layers of the neural network implemented on the second device.
    Type: Application
    Filed: July 9, 2019
    Publication date: January 14, 2021
    Applicant: Facebook Technologies, LLC
    Inventors: Liangzhen Lai, Pierce I-Jen Chuang, Vikas Chandra, Ganesh Venkatesh
  • Publication number: 20210012202
    Abstract: Disclosed herein includes a system, a method, and a device for asymmetrical scaling factor support for negative and positive values. A device can include a circuit having a shift circuitry and multiply circuitry. The circuit can be configured to perform computation for a neural network, including multiplying, via the multiply circuitry, a first value and a second value. The circuit can be configured to perform computation for a neural network, including shifting, via the shift circuitry, a result of the multiplying by a determined number of bits. The circuit can be configured to perform computation for a neural network, including outputting the result of the multiplying when a sign bit of the first value is negative, and a result of the shifting when the sign bit of the first value is positive.
    Type: Application
    Filed: July 12, 2019
    Publication date: January 14, 2021
    Applicant: Facebook Technologies, LLC
    Inventors: Ganesh Venkatesh, Pierce I-Jen Chuang
  • Publication number: 20200401440
    Abstract: Embodiments of systems, methods, and apparatuses for heterogeneous computing are described. In some embodiments, a hardware heterogeneous scheduler dispatches instructions for execution on one or more plurality of heterogeneous processing elements, the instructions corresponding to a code fragment to be processed by the one or more of the plurality of heterogeneous processing elements, wherein the instructions are native instructions to at least one of the one or more of the plurality of heterogeneous processing elements.
    Type: Application
    Filed: June 26, 2020
    Publication date: December 24, 2020
    Inventors: Rajesh M. SANKARAN, Gilbert NEIGER, Narayan RANGANATHAN, Stephen R. VAN DOREN, Joseph NUZMAN, Niall D. MCDONNELL, Michael A. O'HANLON, Lokpraveen B. MOSUR, Tracy Garrett DRYSDALE, Eriko NURVITADHI, Asit K. MISHRA, Ganesh VENKATESH, Deborah T. MARR, Nicholas P. CARTER, Jonathan D. PEARCE, Edward T. GROCHOWSKI, Richard J. GRECO, Robert VALENTINE, Jesus CORBAL, Thomas D. FLETCHER, Dennis R. BRADFORD, Dwight P. MANLEY, Mark J. CHARNEY, Jeffrey J. COOK, Paul CAPRIOLI, Koichi YAMADA, Kent D. GLOSSOP, David B. SHEFFIELD
  • Publication number: 20200285618
    Abstract: Compressed data is oftentimes beneficial for reducing the computing resources required, for example, to transmit and store data. The compression of data is particularly useful when dealing with sparse data (data that includes numerous zeros or near-zero values) and only non-zero values above a certain threshold have significance. When dealing with compressed data, oftentimes the data needs to be decompressed for processing (e.g., by deep learning networks or other applications configured to operate on sparse, or other uncompressed data). Instructions are disclosed for supporting the decompression of compressed data by a processing unit such as a CPU and GPU.
    Type: Application
    Filed: March 20, 2019
    Publication date: September 10, 2020
    Inventors: Jorge Albericio Latorre, Jack H. Choquette, Manan Maheshkumar Patel, Jeffrey Pool, Ming Y. Siu, Ronny Meir Krashinsky, Ganesh Venkatesh
  • Publication number: 20190347125
    Abstract: Embodiments of systems, methods, and apparatuses for heterogeneous computing are described. In some embodiments, a hardware heterogeneous scheduler dispatches instructions for execution on one or more plurality of heterogeneous processing elements, the instructions corresponding to a code fragment to be processed by the one or more of the plurality of heterogeneous processing elements, wherein the instructions are native instructions to at least one of the one or more of the plurality of heterogeneous processing elements.
    Type: Application
    Filed: December 31, 2016
    Publication date: November 14, 2019
    Inventors: Rajesh M. SANKARAN, Gilbert NEIGER, Narayan RANGANATHAN, Stephen R. VAN DOREN, Joseph NUZMAN, Niall D. MCDONNELL, Michael A. O'HANLON, Lokpraveen B. MOSUR, Tracy Garrett DRYSDALE, Eriko NURVITADHI, Asit K. MISHRA, Ganesh VENKATESH, Deborah T. MARR, Nicholas P. CARTER, Jonathan D. PEARCE, Edward T. GROCHOWSKI, Richard J. GRECO, Robert VALENTINE, Jesus CORBAL, Thomas D. FLETCHER, Dennis R. BRADFORD, Dwight P. MANLEY, Mark J. CHARNEY, Jeffrey J. COOK, Paul CAPRIOLI, Koichi YAMADA, Kent D. GLOSSOP, David B. SHEFFIELD
  • Patent number: 10452551
    Abstract: A processor may include a programmable memory prefetcher that includes a programmable hardware prefetch engine and a prefetch engine control register.
    Type: Grant
    Filed: December 12, 2016
    Date of Patent: October 22, 2019
    Assignee: Intel Corporation
    Inventors: Ganesh Venkatesh, Christopher B. Wilkerson, Seth H. Pugsley, Deborah T. Marr
  • Patent number: 10387037
    Abstract: Techniques for enabling enhanced parallelism for sparse linear algebra operations having write-to-read dependencies are disclosed. A hardware processor includes a plurality of processing elements, a memory that is heavily-banked into a plurality of banks, and an arbiter. The arbiter is to receive requests from threads executing at the plurality of processing elements seeking to perform operations involving the memory, and to maintain a plurality of lock buffers corresponding to the plurality of banks. Each of the lock buffers is able to track up to a plurality of memory addresses within the corresponding bank that are to be treated as locked in that the values stored at those memory addresses cannot be updated by those of the threads that did not cause the memory addresses to be locked until those memory addresses have been removed from being tracked by the plurality of lock buffers.
    Type: Grant
    Filed: December 31, 2016
    Date of Patent: August 20, 2019
    Assignee: Intel Corporation
    Inventors: Ganesh Venkatesh, Deborah Marr
  • Patent number: 10372507
    Abstract: Techniques involving a compute engine architecture to support data-parallel loops with reduction operations are described. In some embodiments, a hardware processor includes a memory unit and a plurality of processing elements (PEs). Each of the PEs is directly coupled via one or more neighbor-to-neighbor links with one or more neighboring PEs so that each PE can receive a value from a neighboring PE, provide a value to a neighboring PE, or both receive a value from one neighboring PE and also provide a value to another neighboring PE. The hardware processor also includes a control engine coupled with the plurality of PEs that is to cause the plurality of PEs to collectively perform a task to generate one or more output values by each performing one or more iterations of a same subtask of the task.
    Type: Grant
    Filed: December 31, 2016
    Date of Patent: August 6, 2019
    Assignee: Intel Corporation
    Inventors: Ganesh Venkatesh, Deborah Marr
  • Patent number: 10289752
    Abstract: A processor may include a gather-update-scatter accelerator, and an allocator comprising circuitry to direct an instruction to the accelerator for execution. The instruction may include a search index, an operation to be performed, and a scalar data value. The accelerator may include a content-addressable memory (CAM) storing multiple entries, each of which stores a respective index key and a data value associated with the index key. The accelerator may include a CAM controller, which includes circuitry. The CAM controller may be configured to select, based on the information in the instruction, one of the plurality of entries in the CAM on which to operate. The CAM controller may be configured to perform an arithmetic or logical operation on the selected entry dependent on the information in the instruction. The CAM controller may be configured to store a result of the operation in the selected entry in the CAM.
    Type: Grant
    Filed: December 12, 2016
    Date of Patent: May 14, 2019
    Assignee: Intel Corporation
    Inventors: Ganesh Venkatesh, Nicholas P. Carter, Deborah T. Marr
  • Publication number: 20180188961
    Abstract: Techniques for enabling enhanced parallelism for sparse linear algebra operations having write-to-read dependencies are disclosed. A hardware processor includes a plurality of processing elements, a memory that is heavily-banked into a plurality of banks, and an arbiter. The arbiter is to receive requests from threads executing at the plurality of processing elements seeking to perform operations involving the memory, and to maintain a plurality of lock buffers corresponding to the plurality of banks. Each of the lock buffers is able to track up to a plurality of memory addresses within the corresponding bank that are to be treated as locked in that the values stored at those memory addresses cannot be updated by those of the threads that did not cause the memory addresses to be locked until those memory addresses have been removed from being tracked by the plurality of lock buffers.
    Type: Application
    Filed: December 31, 2016
    Publication date: July 5, 2018
    Inventors: Ganesh VENKATESH, Deborah MARR
  • Publication number: 20180189110
    Abstract: Techniques involving a compute engine architecture to support data-parallel loops with reduction operations are described. In some embodiments, a hardware processor includes a memory unit and a plurality of processing elements (PEs). Each of the PEs is directly coupled via one or more neighbor-to-neighbor links with one or more neighboring PEs so that each PE can receive a value from a neighboring PE, provide a value to a neighboring PE, or both receive a value from one neighboring PE and also provide a value to another neighboring PE. The hardware processor also includes a control engine coupled with the plurality of PEs that is to cause the plurality of PEs to collectively perform a task to generate one or more output values by each performing one or more iterations of a same subtask of the task.
    Type: Application
    Filed: December 31, 2016
    Publication date: July 5, 2018
    Inventors: Ganesh VENKATESH, Deborah MARR
  • Publication number: 20180189675
    Abstract: Hardware accelerator architectures for clustering are described. A hardware accelerator includes sparse tiles and very/hyper sparse tiles. The sparse tile(s) execute operations for a clustering task involving a matrix. Each sparse tile includes a first plurality of processing units to operate upon a first plurality of blocks of the matrix that have been streamed to one or more random access memories of the sparse tiles over a high bandwidth interface from a first memory unit. Each of the very/hyper sparse tiles are to execute operations for the clustering task involving the matrix. Each of the very/hyper sparse tiles includes a second plurality of processing units to operate upon a second plurality of blocks of the matrix that have been randomly accessed over a low-latency interface from a second memory unit.
    Type: Application
    Filed: December 31, 2016
    Publication date: July 5, 2018
    Inventors: Eriko NURVITADHI, Ganesh VENKATESH, Srivatsan KRISHNAN, Suchit SUBHASCHANDRA, Deborah MARR
  • Publication number: 20180165204
    Abstract: A processor may include a programmable hardware prefetch engine and a prefetch engine control register. The processor may include circuitry to receive, during execution of an application, a first instruction for configuring the prefetch engine for prefetching multiple cache lines to be accessed in the future, at predictable locations, by the application; to store, in the prefetch engine control register, dependent on information in the first instruction, data representing an amount of prefetching to be performed and data representing a stride distance between consecutive cache lines to be prefetched; to receive a second instruction for prefetching a single cache line whose location is identified in the second instruction; and to initiate, in response to receiving the second instruction, prefetching of multiple cache lines by the prefetch engine, to be performed in parallel with execution of the application and in accordance with the data stored in the prefetch engine control register.
    Type: Application
    Filed: December 12, 2016
    Publication date: June 14, 2018
    Inventors: Ganesh Venkatesh, Christopher B. Wilkerson, Seth H. Pugsley, Deborah T. Marr
  • Publication number: 20180165381
    Abstract: A processor may include a gather-update-scatter accelerator, and circuitry to direct an instruction to the accelerator for execution. The instruction may include a search index, an operation to be performed, and a scalar data value. The accelerator may include a content-associative memory (CAM) storing multiple entries, each of which stores a respective index key and a data value associated with the index key. The accelerator may include a CAM controller, including circuitry to select, based on the information in the instruction, one of the plurality of entries in the CAM on which to operate, an arithmetic logic unit (ALU), including circuitry to perform an arithmetic or logical operation on the selected entry, the operation being dependent on the information in the instruction, and circuitry to store a result of the operation in the selected entry in the CAM.
    Type: Application
    Filed: December 12, 2016
    Publication date: June 14, 2018
    Inventors: Ganesh Venkatesh, Nicholas P. Carter, Deborah T. Marr
  • Publication number: 20160378465
    Abstract: In one embodiment, a processor includes at least one core to execute instructions and an accelerator coupled to the at least one core. The accelerator may include a plurality of walker logics, which may be adapted to fetch at least a portion of a first array block and at least a portion of a second array block, determine whether a first index of the first array block matches a second index of the second array block, and send a first value of the first array block associated with the first index and a second value of the second array block associated with the second index to an arithmetic unit, based at least in part on the determination. Other embodiments are described and claimed.
    Type: Application
    Filed: June 23, 2015
    Publication date: December 29, 2016
    Inventors: Ganesh Venkatesh, Tianlu C. Zhang, Deborah T. Marr