Patents by Inventor Ganesh Venkatesh

Ganesh Venkatesh has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

SYSTEMS AND METHODS FOR READING AND WRITING SPARSE DATA IN A NEURAL NETWORK ACCELERATOR

Publication number: 20210011846

Abstract: Disclosed herein includes a system, a method, and a device for reading and writing sparse data in a neural network accelerator. A plurality of slices can be established to access a memory having an access size of a data word. A first slice can be configured to access a first side of the data word in memory. Circuitry can access a mask identifying byte positions within the data word having non-zero values. The circuitry can modify the data word to have non-zero byte values stored starting at an end of the first side, and any zero byte values stored in a remainder of the data word. A determination can be made whether a number of non-zero byte values is less than or equal to a first access size of the first slice. The circuitry can write the modified data word to the memory via at least the first slice.

Type: Application

Filed: July 11, 2019

Publication date: January 14, 2021

Applicant: Facebook Technologies, LLC

Inventors: Ganesh Venkatesh, Liangzhen Lai, Pierce I-Jen Chuang, Meng Li
SYSTEMS AND METHODS FOR DISTRIBUTING A NEURAL NETWORK ACROSS MULTIPLE COMPUTING DEVICES

Publication number: 20210011288

Abstract: Disclosed herein is a method for using a neural network across multiple devices. The method can include receiving, by a first device configured with a first one or more layers of a neural network, input data for processing via the neural network implemented across the first device and a second device. The method can include outputting, by the first one or more layers of the neural network implemented on the first device, a data set that is reduced in size relative to the input data while identifying one or more features of the input data for processing by a second one or more layers of the neural network. The method can include communicating, by the first device, the data set to the second device for processing via the second one or more layers of the neural network implemented on the second device.

Type: Application

Filed: July 9, 2019

Publication date: January 14, 2021

Applicant: Facebook Technologies, LLC

Inventors: Liangzhen Lai, Pierce I-Jen Chuang, Vikas Chandra, Ganesh Venkatesh
SYSTEMS AND METHODS FOR ASYMMETRICAL SCALING FACTOR SUPPORT FOR NEGATIVE AND POSITIVE VALUES

Publication number: 20210012202

Abstract: Disclosed herein includes a system, a method, and a device for asymmetrical scaling factor support for negative and positive values. A device can include a circuit having a shift circuitry and multiply circuitry. The circuit can be configured to perform computation for a neural network, including multiplying, via the multiply circuitry, a first value and a second value. The circuit can be configured to perform computation for a neural network, including shifting, via the shift circuitry, a result of the multiplying by a determined number of bits. The circuit can be configured to perform computation for a neural network, including outputting the result of the multiplying when a sign bit of the first value is negative, and a result of the shifting when the sign bit of the first value is positive.

Type: Application

Filed: July 12, 2019

Publication date: January 14, 2021

Applicant: Facebook Technologies, LLC

Inventors: Ganesh Venkatesh, Pierce I-Jen Chuang
SYSTEMS, METHODS, AND APPARATUSES FOR HETEROGENEOUS COMPUTING

Publication number: 20200401440

Abstract: Embodiments of systems, methods, and apparatuses for heterogeneous computing are described. In some embodiments, a hardware heterogeneous scheduler dispatches instructions for execution on one or more plurality of heterogeneous processing elements, the instructions corresponding to a code fragment to be processed by the one or more of the plurality of heterogeneous processing elements, wherein the instructions are native instructions to at least one of the one or more of the plurality of heterogeneous processing elements.

Type: Application

Filed: June 26, 2020

Publication date: December 24, 2020

Inventors: Rajesh M. SANKARAN, Gilbert NEIGER, Narayan RANGANATHAN, Stephen R. VAN DOREN, Joseph NUZMAN, Niall D. MCDONNELL, Michael A. O'HANLON, Lokpraveen B. MOSUR, Tracy Garrett DRYSDALE, Eriko NURVITADHI, Asit K. MISHRA, Ganesh VENKATESH, Deborah T. MARR, Nicholas P. CARTER, Jonathan D. PEARCE, Edward T. GROCHOWSKI, Richard J. GRECO, Robert VALENTINE, Jesus CORBAL, Thomas D. FLETCHER, Dennis R. BRADFORD, Dwight P. MANLEY, Mark J. CHARNEY, Jeffrey J. COOK, Paul CAPRIOLI, Koichi YAMADA, Kent D. GLOSSOP, David B. SHEFFIELD
DECOMPRESSION TECHNIQUES FOR PROCESSING COMPRESSED DATA SUITABLE FOR ARTIFICIAL NEURAL NETWORKS

Publication number: 20200285618

Abstract: Compressed data is oftentimes beneficial for reducing the computing resources required, for example, to transmit and store data. The compression of data is particularly useful when dealing with sparse data (data that includes numerous zeros or near-zero values) and only non-zero values above a certain threshold have significance. When dealing with compressed data, oftentimes the data needs to be decompressed for processing (e.g., by deep learning networks or other applications configured to operate on sparse, or other uncompressed data). Instructions are disclosed for supporting the decompression of compressed data by a processing unit such as a CPU and GPU.

Type: Application

Filed: March 20, 2019

Publication date: September 10, 2020

Inventors: Jorge Albericio Latorre, Jack H. Choquette, Manan Maheshkumar Patel, Jeffrey Pool, Ming Y. Siu, Ronny Meir Krashinsky, Ganesh Venkatesh
SYSTEMS, METHODS, AND APPARATUSES FOR HETEROGENEOUS COMPUTING

Publication number: 20190347125

Abstract: Embodiments of systems, methods, and apparatuses for heterogeneous computing are described. In some embodiments, a hardware heterogeneous scheduler dispatches instructions for execution on one or more plurality of heterogeneous processing elements, the instructions corresponding to a code fragment to be processed by the one or more of the plurality of heterogeneous processing elements, wherein the instructions are native instructions to at least one of the one or more of the plurality of heterogeneous processing elements.

Type: Application

Filed: December 31, 2016

Publication date: November 14, 2019

Inventors: Rajesh M. SANKARAN, Gilbert NEIGER, Narayan RANGANATHAN, Stephen R. VAN DOREN, Joseph NUZMAN, Niall D. MCDONNELL, Michael A. O'HANLON, Lokpraveen B. MOSUR, Tracy Garrett DRYSDALE, Eriko NURVITADHI, Asit K. MISHRA, Ganesh VENKATESH, Deborah T. MARR, Nicholas P. CARTER, Jonathan D. PEARCE, Edward T. GROCHOWSKI, Richard J. GRECO, Robert VALENTINE, Jesus CORBAL, Thomas D. FLETCHER, Dennis R. BRADFORD, Dwight P. MANLEY, Mark J. CHARNEY, Jeffrey J. COOK, Paul CAPRIOLI, Koichi YAMADA, Kent D. GLOSSOP, David B. SHEFFIELD
Programmable memory prefetcher for prefetching multiple cache lines based on data in a prefetch engine control register

Patent number: 10452551

Abstract: A processor may include a programmable memory prefetcher that includes a programmable hardware prefetch engine and a prefetch engine control register.

Type: Grant

Filed: December 12, 2016

Date of Patent: October 22, 2019

Assignee: Intel Corporation

Inventors: Ganesh Venkatesh, Christopher B. Wilkerson, Seth H. Pugsley, Deborah T. Marr
Microarchitecture enabling enhanced parallelism for sparse linear algebra operations having write-to-read dependencies

Patent number: 10387037

Abstract: Techniques for enabling enhanced parallelism for sparse linear algebra operations having write-to-read dependencies are disclosed. A hardware processor includes a plurality of processing elements, a memory that is heavily-banked into a plurality of banks, and an arbiter. The arbiter is to receive requests from threads executing at the plurality of processing elements seeking to perform operations involving the memory, and to maintain a plurality of lock buffers corresponding to the plurality of banks. Each of the lock buffers is able to track up to a plurality of memory addresses within the corresponding bank that are to be treated as locked in that the values stored at those memory addresses cannot be updated by those of the threads that did not cause the memory addresses to be locked until those memory addresses have been removed from being tracked by the plurality of lock buffers.

Type: Grant

Filed: December 31, 2016

Date of Patent: August 20, 2019

Assignee: Intel Corporation

Inventors: Ganesh Venkatesh, Deborah Marr
Compute engine architecture to support data-parallel loops with reduction operations

Patent number: 10372507

Abstract: Techniques involving a compute engine architecture to support data-parallel loops with reduction operations are described. In some embodiments, a hardware processor includes a memory unit and a plurality of processing elements (PEs). Each of the PEs is directly coupled via one or more neighbor-to-neighbor links with one or more neighboring PEs so that each PE can receive a value from a neighboring PE, provide a value to a neighboring PE, or both receive a value from one neighboring PE and also provide a value to another neighboring PE. The hardware processor also includes a control engine coupled with the plurality of PEs that is to cause the plurality of PEs to collectively perform a task to generate one or more output values by each performing one or more iterations of a same subtask of the task.

Type: Grant

Filed: December 31, 2016

Date of Patent: August 6, 2019

Assignee: Intel Corporation

Inventors: Ganesh Venkatesh, Deborah Marr
Accelerator for gather-update-scatter operations including a content-addressable memory (CAM) and CAM controller

Patent number: 10289752

Abstract: A processor may include a gather-update-scatter accelerator, and an allocator comprising circuitry to direct an instruction to the accelerator for execution. The instruction may include a search index, an operation to be performed, and a scalar data value. The accelerator may include a content-addressable memory (CAM) storing multiple entries, each of which stores a respective index key and a data value associated with the index key. The accelerator may include a CAM controller, which includes circuitry. The CAM controller may be configured to select, based on the information in the instruction, one of the plurality of entries in the CAM on which to operate. The CAM controller may be configured to perform an arithmetic or logical operation on the selected entry dependent on the information in the instruction. The CAM controller may be configured to store a result of the operation in the selected entry in the CAM.

Type: Grant

Filed: December 12, 2016

Date of Patent: May 14, 2019

Assignee: Intel Corporation

Inventors: Ganesh Venkatesh, Nicholas P. Carter, Deborah T. Marr
MICROARCHITECTURE ENABLING ENHANCED PARALLELISM FOR SPARSE LINEAR ALGEBRA OPERATIONS HAVING WRITE-TO-READ DEPENDENCIES

Publication number: 20180188961

Abstract: Techniques for enabling enhanced parallelism for sparse linear algebra operations having write-to-read dependencies are disclosed. A hardware processor includes a plurality of processing elements, a memory that is heavily-banked into a plurality of banks, and an arbiter. The arbiter is to receive requests from threads executing at the plurality of processing elements seeking to perform operations involving the memory, and to maintain a plurality of lock buffers corresponding to the plurality of banks. Each of the lock buffers is able to track up to a plurality of memory addresses within the corresponding bank that are to be treated as locked in that the values stored at those memory addresses cannot be updated by those of the threads that did not cause the memory addresses to be locked until those memory addresses have been removed from being tracked by the plurality of lock buffers.

Type: Application

Filed: December 31, 2016

Publication date: July 5, 2018

Inventors: Ganesh VENKATESH, Deborah MARR
COMPUTE ENGINE ARCHITECTURE TO SUPPORT DATA-PARALLEL LOOPS WITH REDUCTION OPERATIONS

Publication number: 20180189110

Abstract: Techniques involving a compute engine architecture to support data-parallel loops with reduction operations are described. In some embodiments, a hardware processor includes a memory unit and a plurality of processing elements (PEs). Each of the PEs is directly coupled via one or more neighbor-to-neighbor links with one or more neighboring PEs so that each PE can receive a value from a neighboring PE, provide a value to a neighboring PE, or both receive a value from one neighboring PE and also provide a value to another neighboring PE. The hardware processor also includes a control engine coupled with the plurality of PEs that is to cause the plurality of PEs to collectively perform a task to generate one or more output values by each performing one or more iterations of a same subtask of the task.

Type: Application

Filed: December 31, 2016

Publication date: July 5, 2018

Inventors: Ganesh VENKATESH, Deborah MARR
HARDWARE ACCELERATOR ARCHITECTURE AND TEMPLATE FOR WEB-SCALE K-MEANS CLUSTERING

Publication number: 20180189675

Abstract: Hardware accelerator architectures for clustering are described. A hardware accelerator includes sparse tiles and very/hyper sparse tiles. The sparse tile(s) execute operations for a clustering task involving a matrix. Each sparse tile includes a first plurality of processing units to operate upon a first plurality of blocks of the matrix that have been streamed to one or more random access memories of the sparse tiles over a high bandwidth interface from a first memory unit. Each of the very/hyper sparse tiles are to execute operations for the clustering task involving the matrix. Each of the very/hyper sparse tiles includes a second plurality of processing units to operate upon a second plurality of blocks of the matrix that have been randomly accessed over a low-latency interface from a second memory unit.

Type: Application

Filed: December 31, 2016

Publication date: July 5, 2018

Inventors: Eriko NURVITADHI, Ganesh VENKATESH, Srivatsan KRISHNAN, Suchit SUBHASCHANDRA, Deborah MARR
Programmable Memory Prefetcher

Publication number: 20180165204

Abstract: A processor may include a programmable hardware prefetch engine and a prefetch engine control register. The processor may include circuitry to receive, during execution of an application, a first instruction for configuring the prefetch engine for prefetching multiple cache lines to be accessed in the future, at predictable locations, by the application; to store, in the prefetch engine control register, dependent on information in the first instruction, data representing an amount of prefetching to be performed and data representing a stride distance between consecutive cache lines to be prefetched; to receive a second instruction for prefetching a single cache line whose location is identified in the second instruction; and to initiate, in response to receiving the second instruction, prefetching of multiple cache lines by the prefetch engine, to be performed in parallel with execution of the application and in accordance with the data stored in the prefetch engine control register.

Type: Application

Filed: December 12, 2016

Publication date: June 14, 2018

Inventors: Ganesh Venkatesh, Christopher B. Wilkerson, Seth H. Pugsley, Deborah T. Marr
Accelerator for Gather-Update-Scatter Operations

Publication number: 20180165381

Abstract: A processor may include a gather-update-scatter accelerator, and circuitry to direct an instruction to the accelerator for execution. The instruction may include a search index, an operation to be performed, and a scalar data value. The accelerator may include a content-associative memory (CAM) storing multiple entries, each of which stores a respective index key and a data value associated with the index key. The accelerator may include a CAM controller, including circuitry to select, based on the information in the instruction, one of the plurality of entries in the CAM on which to operate, an arithmetic logic unit (ALU), including circuitry to perform an arithmetic or logical operation on the selected entry, the operation being dependent on the information in the instruction, and circuitry to store a result of the operation in the selected entry in the CAM.

Type: Application

Filed: December 12, 2016

Publication date: June 14, 2018

Inventors: Ganesh Venkatesh, Nicholas P. Carter, Deborah T. Marr
EFFICIENT SPARSE ARRAY HANDLING IN A PROCESSOR

Publication number: 20160378465

Abstract: In one embodiment, a processor includes at least one core to execute instructions and an accelerator coupled to the at least one core. The accelerator may include a plurality of walker logics, which may be adapted to fetch at least a portion of a first array block and at least a portion of a second array block, determine whether a first index of the first array block matches a second index of the second array block, and send a first value of the first array block associated with the first index and a second value of the second array block associated with the second index to an arithmetic unit, based at least in part on the determination. Other embodiments are described and claimed.

Type: Application

Filed: June 23, 2015

Publication date: December 29, 2016

Inventors: Ganesh Venkatesh, Tianlu C. Zhang, Deborah T. Marr

prev 1 2 3