Abstract: According to exemplary embodiments, a method, processor, and system for accelerating a recurrent neural network are presented. A method of accelerating a recurrent neural network may include distributing from a first master core to each of a plurality of processing cores a same relative one or more columns of weight matrix data for each of a plurality of gates in the neural network, broadcasting a current input vector from the first master core to each of the processing cores, and processing each column of weight matrix data in parallel, at each of the respective processing cores.
Type:
Grant
Filed:
January 6, 2020
Date of Patent:
June 22, 2021
Assignee:
SIMPLEMACHINES INC.
Inventors:
Karthikeyan Sankaralingam, Yunfeng Li, Vinay Gangadhar, Anthony Nowatzki
Abstract: A method for performing acceleration of simultaneous access to shared data may include providing a plurality of groups of cores and a plurality of shared memory structures, providing a pod comprising the plurality of groups of cores linked by a common broadcast channel, and coordinating each shared memory structure to provide a logically unified memory structure. Each memory structure may be associated with a group of cores, and each group of cores may include one or more cores. The common broadcast channel may be operatively coupled to each shared memory structure. The coordinating each shared memory structure may include identifying a simultaneous read-reuse load to a first shared memory structure, fetching data corresponding to the simultaneous read-reuse load, and forwarding the data to shared memory structures other than the first shared memory structure and to groups of cores other than a first group of cores via the broadcast channel.
Type:
Grant
Filed:
December 18, 2019
Date of Patent:
March 30, 2021
Assignee:
SimpleMachines Inc.
Inventors:
Karthikeyan Sankaralingam, Vinay Gangadhar, Anthony Nowatzki, Yunfeng Li
Abstract: According to exemplary embodiments, a method, processor, and system for accelerating a recurrent neural network are presented. A method of accelerating a recurrent neural network may include distributing from a first master core to each of a plurality of processing cores a same relative one or more columns of weight matrix data for each of a plurality of gates in the neural network, broadcasting a current input vector from the first master core to each of the processing cores, and processing each column of weight matrix data in parallel, at each of the respective processing cores.
Type:
Application
Filed:
January 6, 2020
Publication date:
July 9, 2020
Applicant:
SimpleMachines Inc.
Inventors:
Karthikeyan Sankaralingam, Yunfeng Li, Vinay Gangadhar, Anthony Nowatzki
Abstract: A method for performing acceleration of simultaneous access to shared data may include providing a plurality of groups of cores and a plurality of shared memory structures, providing a pod comprising the plurality of groups of cores linked by a common broadcast channel, and coordinating each shared memory structure to provide a logically unified memory structure. Each memory structure may be associated with a group of cores, and each group of cores may include one or more cores. The common broadcast channel may be operatively coupled to each shared memory structure. The coordinating each shared memory structure may include identifying a simultaneous read-reuse load to a first shared memory structure, fetching data corresponding to the simultaneous read-reuse load, and forwarding the data to shared memory structures other than the first shared memory structure and to groups of cores other than a first group of cores via the broadcast channel.
Type:
Application
Filed:
December 18, 2019
Publication date:
June 25, 2020
Applicant:
SimpleMachines Inc.
Inventors:
Karthikeyan Sankaralingam, Vinay Gangadhar, Anthony Nowatzki, Yunfeng Li
Abstract: According to some embodiments, a dataflow accelerator comprises a control/command core, a scratchpad and a coarse grain reconfigurable array (CGRA). The scratchpad comprises a write controller to transmit data to an input vector port interface and to receive data from the input vector port interface. The CGRA receives data from the input vector port interface where the CGRA comprising a plurality of interconnects and a plurality of functional units.