Reconfigurable Parallel Processing
Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise a configuration buffer, a sequencer coupled to the configuration buffer of each of the plurality of PEs and configured to distribute one or more PE configurations to the plurality of PEs, and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.
The application claims priority to U.S. Provisional Application 62/471,340, filed Mar. 14, 2017, entitled “Reconfigurable Parallel Processing,” U.S. Provisional Application 62/471,367, filed Mar. 15, 2017, entitled “Circular Reconfiguration for Reconfigurable Parallel Processor,” U.S. Provisional Application 62/471,368, filed Mar. 15, 2017, entitled “Private Memory Structure for Reconfigurable Parallel Processor,” U.S. Provisional Application 62/471,372, filed Mar. 15, 2017, entitled “Shared Memory Structure for Reconfigurable Parallel Processor,” U.S. Provisional Application 62/472,579, filed Mar. 17, 2017, entitled “Static Shared Memory Access for Reconfigurable Parallel Processor,” the contents of these applications are hereby incorporated by reference in their entities.
TECHNICAL FIELDThe disclosure herein relates to computer architectures, particularly relates to reconfigurable processors.
BACKGROUNDReconfigurable computing architecture with large amount of processing array can meet demand of computation power while keeping the power and silicon area efficient. Unlike field-programmable gate array (FPGA), Coarse-Grained Reconfigurable Architecture (CGRA) utilizes larger processing elements like arithmetic logic units (ALU) as its building blocks. It provides features of reconfigurability using high level language to quickly program the processing element (PE) array. One typical design of CGRA is shown in
In general, CGRA is an approach to explore loop level parallelism. It is not specifically targeted to handle thread level parallelism. With any data dependency from one iteration to the next, the parallelism is largely limited. Therefore, the 2D array size is intended to be limited to an 8×8 PE array in most of the designs.
Graphics processing unit (GPU) architecture has provided a way to execute parallel threads in a Same Instruction Multiple Thread (SIMT) fashion. It is especially suitable for massive parallel computing applications. In these applications, typically no dependency is assumed between threads. This type of parallelism is beyond loop level parallelism within a software task, which CGRA is designed for. The thread level parallelism can be easily scalable beyond single core execution to multicore execution. The thread parallelism provides optimization opportunities and makes the PE array more efficient and more capable and it is easily made larger than 8×8. GPU, however, is not reconfigurable. Therefore, there is a need in the art to develop a next generation processor that can harness the processing capability of both CGRA and GPU.
SUMMARYThe present disclosure describes apparatus, methods and systems for massive parallel data processing. A processor according to various embodiments of the present disclosure may be designed to take advantage of massive thread level parallelism similar to GPU using programable processor array similar to CGRA. In one embodiment, a processor may efficiently process threads which are identical to each other but with different data, similar to SIMT architecture. A software program's data dependency graph may be mapped to a virtual data path with infinite length. Then the virtual data path may be chopped into segments that can be fit into multiple physical data paths, each physical data path may have its configuration context. A sequencer may distribute configurations of each PE into its configuration FIFO and similarly to switch boxes. A gasket memory may be used to temporarily store outputs of one physical data path configuration and give it back to the processing elements for the next configuration. Memory ports may be used to calculate addresses for read and write. FIFOs may be used to allow each PE for independent operation. Data stored in a memory unit may be accessed through either private or shared memory access method. The same data can be access through different access methods in different part of the software program to reduce data movement between memories.
In an exemplary embodiment, there is provided a processor comprising: a plurality of processing elements (PEs) each comprising a configuration buffer; a sequencer coupled to the configuration buffer of each of the plurality of PEs and configured to distribute one or more PE configurations to the plurality of PEs; and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.
According to an embodiment, the processor may further comprise a plurality of switch boxes coupled to the sequencer to receive switch box configurations from the sequencer, each of the plurality of switch boxes may be associated with a respective PE of the plurality of PEs and configured to provide input data switching for the respective PE according to the switch box configurations.
According to an embodiment, the plurality of switch boxes and their associated PEs may be arranged in a plurality of columns, a first switch box in a first column of the plurality of columns may be coupled between the gasket memory and a first PE in the first column of the plurality of columns, and a second PE in a last column of the plurality of columns may be coupled to the gasket memory.
According to an embodiment, the processor may further comprise a memory unit for providing data storage for the plurality of PEs; and a plurality of memory ports each arranged in a separate column of the plurality of columns for the plurality of PEs to access the memory unit.
According to an embodiment, the processor may further comprise a plurality of inter-column switch boxes (ICSBs) coupled to the sequencer to receive ICSB configurations from the sequencer, the plurality of ICSBs may be configured to provide data switching between neighboring columns of the plurality of columns according to the ICSB configurations.
According to an embodiment, the plurality of memory ports (MPs) may be coupled to the sequencer to receive MP configurations from the sequencer and configured to operate in a private access mode or a shared access mode during one MP configuration.
According to an embodiment, a piece of data stored in the memory unit may be accessed through the private access mode and the shared access mode in different part of a program without the piece of data being moved in the memory unit.
According to an embodiment, each of the plurality of columns comprises one PE, the plurality of PEs may be identical and form one row of repetitive identical PEs.
According to an embodiment, each of the plurality of columns may comprise two or more PEs and the plurality of PEs form two or more rows.
According to an embodiment, a first row of PEs may be configured to implement a first set of instructions and a second row of PEs may be configured to implement a second set of instructions, at least one instruction of the second set of instructions is not in the first set of instructions, and the of plurality of columns may be identical and form repetitive columns.
According to an embodiment, each of the plurality of PEs may comprise a plurality of arithmetic logic units (ALUs) that may be configured to execute same instruction in parallel threads.
According to an embodiment, each of the plurality of PEs may comprise a plurality of data buffers for the plurality of ALUs and may be configured to operate independently.
According to an embodiment, each of the plurality of memory ports may be configured to access the memory unit using a vector address, and in the private access mode, one address in the vector address may be routed to one memory bank of the memory unit according to a thread index and all private data for one thread may be located in a same memory bank.
According to an embodiment, each of the plurality of memory ports may be configured to access the memory unit using a vector address, and in the shared access mode, one address in the vector address may be routed in a defined region across memory banks regardless of the thread index and data shared to all threads may be spread in all memory banks.
In another exemplary embodiment, there is provided a method comprising: mapping an execution kernel into a virtual data path at a processor, wherein the execution kernel includes a sequence of instructions to be executed by the processor, and the processor comprises various reconfigurable units that include a gasket memory; chopping the virtual data path into one or more physical data paths; delivering configurations to various reconfigurable units of the processor for the various reconfigurable units to form the one or more physical data paths to execute the sequence of instructions; and executing the processor to complete the one or more physical data paths by operating the various reconfigurable units according to the configurations, including routing data from one physical data path to the gasket memory to be used in a future physical data path as input.
According to an embodiment, the various reconfigurable units may further comprise a plurality of processing elements, a plurality of switch boxes each associated with a separate processing element, a plurality of memory ports that provide access to a memory unit for the plurality of processing elements, and a plurality of inter-column switch boxes, where each of the various reconfigurable units are reconfigured by applying a next configuration independently from other reconfigurable units.
According to an embodiment, each of the plurality of PEs may comprise a plurality of arithmetic logic units (ALUs) that may be configured to execute same instruction in parallel threads.
According to an embodiment, each of the plurality of memory ports may be configured to operate in a private access mode or a shared access mode during one configuration.
According to an embodiment, the method may further comprise accessing a piece of data stored in the memory unit through the private access mode and the shared access mode in different physical data paths without the piece of data being moved in the memory unit.
According to an embodiment, each of the memory ports may be configured to access the memory unit using a vector address, and in the private access mode, one address in the vector address may be routed to one memory bank of the memory unit according to a thread index and all private data for one thread may be located in a same memory bank, and in the shared access mode, one address in the vector address may be routed in a defined region across memory banks regardless of the thread index and data shared to all threads may be spread in all memory banks.
According to an embodiment, each of the plurality of PEs may comprise a plurality of data buffers for the plurality of ALUs and may be configured to operate independently during one physical data path.
According to an embodiment, the plurality of PEs may form a PE array and the execution kernel may be mapped into one or more physical data paths on the processor based on a size of the PE array, connections between the plurality of PEs, and memory access capability.
According to an embodiment, the various reconfigurable units may form multiple repetitive columns, and each of the one or more physical data paths may be fitted into the multiple repetitive columns, and data flows between the repetitive columns may be in one direction.
In yet another exemplary embodiment, there is provided a system comprising: a processor, comprising: a sequencer configured to map an execution kernel to be executed by the processor into a virtual data path and chop the virtual data path into one or more physical data paths; a plurality of processing elements (PEs) coupled to the sequencer, each of the plurality of PEs comprising a configuration buffer configured to receive PE configurations for the one or more physical data paths from the sequencer; and a gasket memory coupled to the plurality of PEs and being configured to store data from one of the one or more physical data paths to be used by another physical data path of the one or more physical data paths as input.
According to an embodiment, the processor may further comprise a plurality of switch boxes (SBs) coupled to the sequencer to receive SB configurations for the one or more physical data paths from the sequencer, each of the plurality of SBs being associated with a respective PE of the plurality of PEs and configured to provide input data switching for the respective PE according to the SB configurations.
According to an embodiment, the plurality of switch boxes and their associated PEs may be arranged in a plurality of columns, a first switch box in a first column of the plurality of columns may be coupled between the gasket memory and a first PE in the first column of the plurality of columns, and a second PE in a last column of the plurality of columns may be coupled to the gasket memory.
According to an embodiment, the processor may further comprise a memory unit for providing data storage for the plurality of PEs and a plurality of memory ports each arranged in a separate column of the plurality of columns for the plurality of PEs to access the memory unit.
According to an embodiment, the processor may further comprise a plurality of inter-column switch boxes (ICSBs) coupled to the sequencer to receive ICSB configurations for the one or more physical data paths from the sequencer, the plurality of ICSBs being configured to provide data switching between neighboring columns of the plurality of columns according to the ICSB configurations.
According to an embodiment, the plurality of memory ports (MPs) may be coupled to the sequencer to receive MP configurations for the one or more physical data paths from the sequencer and configured to operate in a private access mode or a shared access mode during one MP configuration.
According to an embodiment, a piece of data stored in the memory unit may be accessed through the private access mode and the shared access mode in different physical data paths of the one or more physical data paths without the piece of data being moved in the memory unit.
According to an embodiment, each of the plurality of columns may comprise one PE and the plurality of PEs may be identical and may form one row of repetitive identical PEs.
According to an embodiment, each of the plurality of columns may comprise two or more PEs and the plurality of PEs form two or more rows.
According to an embodiment, a first row of PEs may be configured to implement a first set of instructions and a second row of PEs may be configured to implement a second set of instructions, at least one instruction of the second set of instructions may be not in the first set of instructions, and the of plurality of columns may be identical and may form repetitive columns.
According to an embodiment, each of the plurality of PEs may comprise a plurality of arithmetic logic units (ALUs) that may be configured to execute same instruction in parallel threads.
According to an embodiment, each of the plurality of PEs may comprise a plurality of data buffers for the plurality of ALUs and may be configured to operate independently.
According to an embodiment, each of the plurality of memory ports may be configured to access the memory unit using a vector address, and in the private access mode, one address in the vector address may be routed to one memory bank of the memory unit according to a thread index and all private data for one thread may be located in a same memory bank.
According to an embodiment, each of the plurality of memory ports may be configured to access the memory unit using a vector address, and in the shared access mode, one address in the vector address may be routed in a defined region across memory banks regardless of the thread index and data shared to all threads may be spread in all memory banks.
According to an embodiment, the plurality of PEs may form a PE array and the execution kernel may be mapped into one or more physical data paths on the processor based on a size of the PE array, connections between the plurality of PEs, and memory access capability.
In yet another exemplary embodiment, there is provided a processor comprising: a plurality of processing elements (PEs); a plurality of switch boxes arranged in a plurality of columns, each of the plurality of switch boxes being associated with a respective PE and configured to provide input data switching for the respective PE; a plurality of memory ports arranged in the plurality of columns and being coupled to a memory unit and a top switch box in each column of the plurality of columns, each of the plurality of memory port being configured to provide data access to the memory unit for one or more switch boxes in a respective column; a plurality of inter-column switch boxes (ICSBs) each coupled to a bottom switch box in each column of the plurality of columns; and a gasket memory with its input coupled to a memory port, a PE, one or more switch boxes and an ICSB in a last column of the plurality of columns, and its output coupled to a memory port, one or more switch boxes and an ICSB in a first column of the plurality of columns.
According to an embodiment, the processor may further comprise a sequencer coupled to the plurality of PEs, the plurality of switch boxes, the plurality of ICSBs, the plurality of memory ports and the gasket memory to deliver configurations to these components.
According to an embodiment, the processor may further comprise a configuration memory coupled to the sequencer to store compiled configurations for the sequencer to decode and deliver.
According to an embodiment, the processor may further comprise a memory unit for providing data storage for the processor.
In another exemplary embodiment, there is provided a processor comprising: a plurality of processing elements (PEs) each comprising a configuration buffer and a plurality of arithmetic logic units (ALUs), and each configured to operate independently according to respective PE configurations stored in the configuration buffer; and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.
According to an embodiment, the processor may further comprise a plurality of switch boxes each comprising a configuration buffer configured to store switch box configurations, each of the plurality of switch boxes being associated with a respective PE of the plurality of PEs and configured to provide input data switching for the respective PE according to the switch box configurations.
According to an embodiment, the plurality of switch boxes and their associated PEs may be arranged in a plurality of columns, a first switch box in a first column of the plurality of columns may be coupled between the gasket memory and a first PE in the first column of the plurality of columns, and a second PE in a last column of the plurality of columns may be coupled to the gasket memory.
According to an embodiment, the processor may further comprise: a memory unit for providing data storage for the plurality of PEs; and a plurality of memory ports each arranged in a separate column of the plurality of columns for the plurality of PEs to access the memory unit.
According to an embodiment, the processor may further comprise a plurality of inter-column switch boxes (ICSBs) each comprising a configuration buffer configured to store ICSB configurations, the plurality of ICSBs may be configured to provide data switching between neighboring columns of the plurality of columns according to the ICSB configurations.
According to an embodiment, each of the plurality of memory ports (MPs) may comprise a configuration buffer to store MP configurations and may be configured to operate in a private access mode or a shared access mode during one MP configuration.
According to an embodiment, a piece of data stored in the memory unit may be accessed through the private access mode and the shared access mode in different part of a program without the piece of data being moved in the memory unit.
According to an embodiment, each of the plurality of columns may comprise one PE, the plurality of PEs are identical and form one row of repetitive identical PEs.
According to an embodiment, each of the plurality of columns may comprise two or more PEs and the plurality of PEs form two or more rows.
According to an embodiment, a first row of PEs may be configured to implement a first set of instructions and a second row of PEs may be configured to implement a second set of instructions, at least one instruction of the second set of instructions is not in the first set of instructions, and the of plurality of columns may be identical and form repetitive columns.
According to an embodiment, each of the plurality of memory ports may be configured to access the memory unit using a vector address, and in the private access mode, one address in the vector address may be routed to one memory bank of the memory unit according to a thread index and all private data for one thread may be located in a same memory bank.
According to an embodiment, each of the plurality of memory ports may be configured to access the memory unit using a vector address, and in the shared access mode, one address in the vector address may be routed in a defined region across memory banks regardless of the thread index and data shared to all threads may be spread in all memory banks.
According to an embodiment, each of the plurality of PEs may comprise a plurality of data buffers for the plurality of ALUs and may be configured to operate independently.
In yet another exemplary embodiment, there is provided a processor comprising: a plurality of processing elements (PEs) arranged in a plurality of columns; a plurality of switch boxes (SBs) each associated with a separate PE of the plurality of PEs to providing data switching; and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be delivered to at least one of the plurality of PEs via a switch box for the PE execute result to be used as input data during a next PE configuration.
In another exemplary embodiment, there is provided a method comprising mapping an execution kernel into a virtual data path. The execution kernel may include a sequence of instructions to be executed by a processor, and the processor may comprise various reconfigurable units that form repetitive columns. The method may further comprises: chopping the virtual data path into one or more physical data paths to fit each physical data path into the repetitive columns respectively; and delivering configurations to various reconfigurable units of the processor for the various reconfigurable units to form the one or more physical data paths to execute the sequence of instructions.
In yet another exemplary embodiment, there is provided a method comprising mapping an execution kernel into a virtual data path for a processor to execute, the processor comprising various reconfigurable units that form repetitive columns; chopping the virtual data path into a plurality of physical data paths including a first physical data path that fits into the repetitive columns and a second physical data path that fits into the repetitive columns; and delivering configurations to various reconfigurable units for the repetitive columns to form the first physical data path to execute a first part of the execution kernel and form the second physical data path to execute a second part of the execution kernel.
In yet an exemplary embodiment, there is provided a processor comprising: a plurality of reconfigurable units including a plurality of processing elements (PEs) and a plurality of memory ports (MPs) for the plurality of PEs to access a memory unit, each of the plurality of reconfigurable units comprising a configuration buffer and a reconfiguration counter; and a sequencer coupled to the configuration buffer of each of the plurality of reconfigurable units and configured to distribute a plurality of configurations to the plurality of reconfigurable units for the plurality of PEs and the plurality of memory ports to execute a sequence of instructions.
According to an embodiment, each of the plurality of configurations may include a specified number, and the reconfigurable counter of each of the plurality of PEs and the plurality of memory ports may be configured to count for a respective PE or MP to repeat an instruction of the sequence of instructions the specified number of times.
According to an embodiment, the plurality of reconfigurable units may further include a plurality of data switching units, each of the plurality of data switching units may be configured to a data switching setting according to a current data switching configuration the specified number of times.
According to an embodiment, the plurality of reconfigurable units may further include a gasket memory, the gasket memory may comprise a plurality of data buffers, an input configuration buffer, an output configuration buffer, a plurality of input reconfigurable counters and a plurality of output reconfigurable counters, and the gasket memory may be configured to perform reconfiguration for input and output independently.
According to an embodiment, the plurality of configurations may include a first set of configurations for the plurality of reconfigurable units to form a first physical data path and a second set of configurations for the plurality of reconfigurable units to form a second physical data path, the gasket memory may be configured to store data from the first physical data path to be used as input to the second physical data path.
According to an embodiment, each of the plurality of reconfigurable units may be configured to switch to a next configuration independently after its reconfigurable counter reaches the specified number.
According to an embodiment, each of the plurality of memory ports may be configured to operate in a private memory access mode or a shared memory access mode during one configuration.
According to an embodiment, a piece of data stored in the memory unit may be accessed through the private memory access mode and the shared memory access mode in configurations for different physical data paths without the piece of data being moved in the memory unit.
According to an embodiment, each of the plurality of memory ports may be configured to access the memory unit using a vector address, and in the private memory access mode, one address in the vector address may be routed to one memory bank of the memory unit according to a thread index and all private data for one thread may be located in a same memory bank.
According to an embodiment, each of the plurality of memory ports may be configured to access the memory unit using a vector address, and in the shared memory access mode, one address in the vector address may be routed in a defined region across memory banks regardless of thread index and data shared to all threads may be spread in all memory banks.
According to an embodiment, each of the plurality of PEs may comprise a plurality of arithmetic logic units (ALUs) that may be configured to execute a same instruction in parallel threads.
According to an embodiment, each of the plurality of PEs may comprise a plurality of data buffers and may be configured to operate independently.
According to an embodiment, the plurality of PEs may form two or more rows. According to an embodiment, a first row of PEs may be configured to implement a first set of instructions and a second row of PEs may be configured to implement a second set of instructions, at least one instruction of the second set of instructions is not in the first set of instructions
According to an embodiment, the plurality of PEs and the plurality of memory ports (MPs) may be arranged in repetitive columns.
According to an embodiment, each of the sequence of instructions may be executed by one of the plurality of PEs or one of the plurality of memory ports as a stage of pipeline according to a respective configuration.
In yet another exemplary embodiment, there is provided a method comprising: delivering a plurality of configurations to a plurality of reconfigurable units of a processor for the plurality of reconfigurable units to form a plurality of physical data paths to execute a sequence of instructions, each of the plurality of configurations including a specified number; repeating a respective operation at each of the plurality of reconfigurable units for the specified number of times, including executing a first instruction of the sequence of instructions at a first reconfigurable processing element (PE) the specified number of times according to a first configuration in a first physical data path; and reconfiguring each of the plurality of reconfiguration units to a new configuration after repeating the respective operation the specified number of times, including executing a second instruction of the sequence of instructions at the first reconfigurable PE the specified number of times according to a second configuration in a second physical data path.
According to an embodiment, the plurality of reconfiguration units may include a plurality of PEs and a plurality of memory ports, and at least one instruction of the sequence of instructions may be a memory access instruction and executed by a memory port the specified number of times before the memory port is reconfigured by applying a next memory port configuration.
According to an embodiment, the plurality of reconfigurable units may further include a plurality of data switching units, and each of the plurality of data switching units may be configured to repeat a respective operation by applying a data switching setting according to a current data switching configuration the specified number of times.
According to an embodiment, the method may further comprise a gasket memory. The gasket memory may comprise a plurality of data buffers, an input configuration buffer, an output configuration buffer, a plurality of input reconfigurable counters and a plurality of output reconfigurable counters. And the gasket memory may be configured to perform reconfiguration for input and output independently.
According to an embodiment, the method may further comprise storing data from the first physical data path in the gasket memory to be used as input to the second physical data path.
According to an embodiment, each of the plurality of PEs may comprise a plurality of arithmetic logic units (ALUs) that may be configured to execute a same instruction in parallel threads.
According to an embodiment, each of the plurality of memory ports may be configured to operate in a private memory access mode or a shared memory access mode during one configuration.
According to an embodiment, the method may further comprise accessing a piece of data stored in a memory unit through the private memory access mode and the shared memory access mode in different physical data paths without the piece of data being moved in the memory unit.
According to an embodiment, each of the plurality of memory ports may be configured to access the memory unit using a vector address, and in the private memory access mode, one address in the vector address is routed to one memory bank of a memory unit according to a thread index and all private data for one thread are located in a same memory bank.
According to an embodiment, each of the plurality of memory ports may be configured to access the memory unit using a vector address, and in the shared memory access mode, one address in the vector address may be routed in a defined region across memory banks regardless of thread index and data shared to all threads are spread in all memory banks.
According to an embodiment, each of the plurality of PEs may comprise a plurality of data buffers and may be configured to operate independently during one physical data path configuration.
According to an embodiment, the plurality of PEs may form a PE array and the sequence of instructions may be mapped into one or more physical data paths on the processor based on a size of the PE array, connections between the plurality of PEs, and memory access capability.
In yet another exemplary embodiment, there is provided a method comprising: delivering a first set of configurations to a plurality of reconfigurable units of a processor for the plurality of reconfigurable units to form a first physical data path to execute a first part of a sequence of instructions, each of the first set of configurations including a specified number; delivering a second set of configurations to the plurality of reconfigurable units for the plurality of reconfigurable units to form a second physical data path to execute a second part of a sequence of instructions, each of the second set of configurations including the specified number; applying the first set of configurations at the plurality of reconfigurable units for each of the plurality of reconfiguration units to repeat a respective operation the specified number of times to execute the first physical data path; storing data from the first physical data path to a gasket memory; and applying the second set of configurations at the plurality of reconfigurable units for each of the plurality of reconfiguration units to repeat a respective operation the specified number of times to execute the second physical data path, with the data stored in the gasket memory as input to the second physical data path.
According to an embodiment, the gasket memory may comprise a plurality of data buffers, an input configuration buffer, an output configuration buffer, a plurality of input reconfigurable counters and a plurality of output reconfigurable counters, and wherein the gasket memory may be configured to perform reconfiguration for input and output independently.
According to an embodiment, the plurality of reconfiguration units may include a plurality of PEs and a plurality of memory ports, and at least one instruction of the sequence of instructions may be a memory access instruction and executed by a memory port the specified number of times before the memory port may be reconfigured by applying a next memory port configuration.
According to an embodiment, the plurality of reconfigurable units may further include a plurality of data switching units, and each of the plurality of data switching units may be configured to repeat a respective operation by applying a data switching setting according to a current data switching configuration the specified number of times.
According to an embodiment, each of the plurality of PEs may comprise a plurality of arithmetic logic units (ALUs) that may be configured to execute a same instruction in parallel threads.
According to an embodiment, each of the plurality of memory ports may be configured to operate in a private memory access mode or a shared memory access mode during one configuration.
According to an embodiment, the method may further comprise accessing a piece of data stored in a memory unit through the private memory access mode and the shared memory access mode in different physical data paths without the piece of data being moved in the memory unit.
According to an embodiment, each of the plurality of memory ports may be configured to access the memory unit using a vector address, wherein in the private memory access mode, one address in the vector address may be routed to one memory bank of a memory unit according to a thread index and all private data for one thread are located in a same memory bank.
According to an embodiment, each of the plurality of memory ports may be configured to access the memory unit using a vector address, and in the shared memory access mode, one address in the vector address may be routed in a defined region across memory banks regardless of thread index and data shared to all threads may be spread in all memory banks.
According to an embodiment, each of the plurality of PEs may comprise a plurality of data buffers and may be configured to operate independently during one physical data path configuration.
According to an embodiment, the plurality of PEs may form a PE array and the sequence of instructions may be mapped into one or more physical data paths on the processor based on a size of the PE array, connections between the plurality of PEs, and memory access capability.
In another exemplary embodiment, there is provided a processor comprising: a plurality of processing elements (PEs) each having a plurality of arithmetic logic units (ALUs) that are configured to execute a same instruction in parallel threads; and a plurality of memory ports (MPs) for the plurality of PEs to access a memory unit, each of the plurality of MPs comprising an address calculation unit configured to generate respective memory addresses for each thread to access a different memory bank in the memory unit.
According to an embodiment, the address calculation unit may have a first input coupled to a base address input that provides a base address common to all threads, a second input coupled to a vector address that provides address offsets for each thread individually, and a third input coupled to a counter that is configured to provide thread indexes.
According to an embodiment, one address in the vector address may be routed to one memory bank according to a thread index.
According to an embodiment, the memory unit may comprise a plurality of memory caches each associated with a different memory bank.
According to an embodiment, each of the plurality of memory ports may be coupled to the plurality of memory caches.
According to an embodiment, each memory bank may comprise a plurality of memory words and a cache miss in a memory cache causes a word to be fetched from a memory bank associated with the memory cache.
According to an embodiment, each of the plurality of PEs may comprise a plurality of data buffers to store data for each thread separately.
According to an embodiment, the processor may further comprise a sequencer coupled to the plurality of memory ports, and each of the plurality of memory ports may comprise a configuration buffer to receive one or more configurations from the sequencer, and each memory port may be configured to provide a same memory access pattern during one configuration.
According to an embodiment, consecutive data pieces for one thread may be located in one word of a memory bank and continue in a next word of the memory bank.
According to an embodiment, consecutive data pieces for one thread may be located in a same position of consecutive words of a memory bank.
According to an embodiment, the plurality of MPs may be configured in a one column mode, in which one MP may be configured to access the memory unit for all concurrent threads in one PE and the address offsets may be independent for each thread.
According to an embodiment, the plurality of MPs may be configured in a linear mode, in which multiple MPs may be configured to access the memory unit, wherein a first MP may be configured to access the memory unit for all concurrent threads in a first PE, and a second MP may be configured to access the memory unit for all concurrent threads in a second PE, the address offsets in the second MP may be linear to the address offsets in the first MP.
According to an embodiment, the plurality of MPs may be configured in a reverse linear mode, in which multiple MPs may be configured to access the memory unit, wherein a first MP may be configured to access the memory unit for all concurrent threads in a first PE, and a second MP may be configured to access the memory unit for all concurrent threads in a second PE, the address offsets in the second MP may be reverse linear to the address offsets in the first MP.
According to an embodiment, the plurality of MPs may be configured in an overlap mode, in which multiple MPs are configured to access the memory unit, wherein a first MP is configured to access the memory unit for all concurrent threads in a first PE, and a second MP is configured to access the memory unit for all concurrent threads in a second PE, the address offsets in the second MP have overlap with the address offsets in the first MP.
According to an embodiment, the plurality of MPs may be configured in a non-unity stride mode, in which multiple MPs are configured to access the memory unit, wherein a first MP is configured to access the memory unit for all concurrent threads in a first PE, and a second MP is configured to access the memory unit for all concurrent threads in a second PE, the address offsets in the second MP and the address offsets in the first MP may be spaced by a stride.
According to an embodiment, the plurality of MPs may be configured in a random mode, in which multiple MPs may be configured to access the memory unit, and address offsets in different MPs may be random numbers.
According to an embodiment, the memory unit may comprise a plurality of memory caches each associated with a different memory bank and the random numbers may be within a range depending on a size of a memory cache.
According to an embodiment, the memory unit may be configured to be used as registers to store spilled variables for register spilling.
In another exemplary embodiment, there is provided a method comprising: generating a plurality of memory addresses by an address calculation unit in a memory port of a plurality of memory ports, wherein the plurality of memory ports provide access to a memory unit for a plurality of processing elements (PEs) each having a plurality of arithmetic logic units (ALUs) that are configured to execute a same instruction in parallel threads; and accessing a plurality of memory banks in the memory unit using the plurality of memory addresses with each thread accessing a different memory bank in the memory unit.
According to an embodiment, the address calculation unit may have a first input coupled to a base address input that provides a base address common to all threads, a second input coupled to a vector address that provides address offsets for each thread individually, and a third input coupled to a counter that is configured to provide thread indexes, and the address calculation unit may be configured to generate the plurality of memory addresses using the first input, the second input and the third input.
According to an embodiment, one address in the vector address may be routed to one memory bank according to a thread index.
According to an embodiment, the memory unit may comprise a plurality of memory caches each associated with a different memory bank, and accessing the plurality of memory banks in the memory unit may comprise accessing the plurality of memory caches.
According to an embodiment, each of the plurality of memory ports may be coupled to the plurality of memory caches.
According to an embodiment, the method may further comprise fetching a word from a plurality of words of a memory bank when there is a cache miss in a memory cache associated with the memory bank.
According to an embodiment, the method may further comprise storing data for each thread in a separate data buffer in each of the plurality of PEs.
According to an embodiment, the method may further comprise receiving one or more configurations by the memory port from a sequencer, and the memory port may be configured to provide a same memory access pattern during one configuration.
According to an embodiment, consecutive data pieces for one thread may be located in one word of a memory bank and continue in a next word of the memory bank.
According to an embodiment, consecutive data pieces for one thread may be located in a same position of consecutive words of a memory bank.
According to an embodiment, accessing the plurality of memory banks in the memory unit may be in a one column mode, in which one MP may be configured to access the memory unit for all concurrent threads in one PE and the address offsets are independent for each thread.
According to an embodiment, accessing the plurality of memory banks in the memory unit may be in a linear mode, in which multiple MPs may be configured to access the memory unit, and a first MP may be configured to access the memory unit for all concurrent threads in a first PE, and a second MP may be configured to access the memory unit for all concurrent threads in a second PE, address offsets in the second MP may be linear to address offsets in the first MP.
According to an embodiment, accessing the plurality of memory banks in the memory unit may be in a reverse linear mode, in which multiple MPs may be configured to access the memory unit, and a first MP may be configured to access the memory unit for all concurrent threads in a first PE, and a second MP may be configured to access the memory unit for all concurrent threads in a second PE, address offsets in the second MP may be reverse linear to address offsets in the first MP.
According to an embodiment, accessing the plurality of memory banks in the memory unit may be in an overlap mode, in which multiple MPs may be configured to access the memory unit, and a first MP may be configured to access the memory unit for all concurrent threads in a first PE, and a second MP may be configured to access the memory unit for all concurrent threads in a second PE, address offsets in the second MP may have overlap with address offsets in the first MP.
According to an embodiment, accessing the plurality of memory banks in the memory unit may be in a non-unity stride mode, in which multiple MPs may be configured to access the memory unit, and a first MP may be configured to access the memory unit for all concurrent threads in a first PE, and a second MP may be configured to access the memory unit for all concurrent threads in a second PE, address offsets in the second MP and address offsets in the first MP may be spaced by a stride.
According to an embodiment, accessing the plurality of memory banks in the memory unit may be in a random mode, in which multiple MPs may be configured to access the memory unit, and address offsets in different MPs may be random numbers.
According to an embodiment, the memory unit may comprise a plurality of memory caches each associated with a different memory bank and the random numbers may be within a range depending on a size of a memory cache.
According to an embodiment, the method may further comprise storing variables for register spilling.
In an exemplary embodiment, there is provided a processor comprising: a memory unit comprising a plurality of memory banks; a plurality of processing elements (PEs) each having a plurality of arithmetic logic units (ALUs) that are configured to execute a same instruction in parallel threads; and a plurality of memory ports (MPs) for the plurality of PEs to access the memory unit, each of the plurality of MPs comprising an address calculation unit configured to generate respective memory addresses for each thread to access a different memory bank in the memory unit.
According to an embodiment, the address calculation unit may have a first input coupled to a base address input that provides a base address common to all threads, a second input coupled to a vector address that provides address offsets for each thread individually, and a third input coupled to a counter that is configured to provide thread indexes.
In an exemplary embodiment, there is provided a processor comprising: a processing element (PE) having a plurality of arithmetic logic units (ALUs) that are configured to execute a same instruction in parallel threads; and a memory port (MP) for the PE to access a memory unit, the MP comprising an address calculation unit configured to generate respective memory addresses for each thread to access a different memory bank in the memory unit.
According to an embodiment, the PE may be one of a plurality of PEs that each has a plurality of ALUs configured to execute a same instruction in parallel threads.
According to an embodiment, the MP may be one of a plurality of MPs that each has an address calculation unit configured to generate respective memory addresses for each thread in one of plurality of PEs to access a different memory bank in the memory unit.
In yet another exemplary embodiment, there is provided a method comprising: generating a plurality of memory addresses by an address calculation unit in a memory port, wherein the memory port provides access to a memory unit for a processing element (PE) having a plurality of arithmetic logic units (ALUs) configured to execute a same instruction in parallel threads; and accessing a plurality of memory banks in the memory unit using the plurality of memory addresses with each thread accessing a different memory bank in the memory unit.
According to an embodiment, the PE may be one of a plurality of PEs that each may have a plurality of ALUs configured to execute a same instruction in parallel threads.
According to an embodiment, the MP may be one of a plurality of MPs that each may have an address calculation unit configured to generate respective memory addresses for each thread in one of plurality of PEs to access a different memory bank in the memory unit.
In an exemplary embodiment, there is provided a processor comprising: a plurality of processing elements (PEs) each having a plurality of arithmetic logic units (ALUs) that are configured to execute a same instruction in parallel threads; and a plurality of memory ports (MPs) for the plurality of PEs to access a memory unit, each of the plurality of MPs comprising an address calculation unit configured to generate respective memory addresses for each thread to access a common area in the memory unit.
According to an embodiment, the address calculation unit may have a first input coupled to a base address input that provides a base address common to all threads, and a second input coupled to a vector address that provides address offsets for each thread individually.
According to an embodiment, the address calculation unit may be configured to generate a number of memory addresses that match a number of threads in a PE.
According to an embodiment, each of the plurality of MPs may further comprise a plurality of selection units coupled to the number of memory addresses, each of the plurality of selection units may be configured to select zero or more memory addresses to be routed to one memory bank of the memory unit.
According to an embodiment, each selection unit may be configured with a mask for a different memory bank of the memory unit.
According to an embodiment, one MP may be configured to access the memory unit for all threads in one PE and the address offsets may be same for all threads.
According to an embodiment, multiple MPs may be configured to access the memory units for threads in different PEs, the address offsets may be the same within one MP but different for different MPs.
According to an embodiment, one MP may be configured to access the memory unit for all threads in one PE and the address offsets may be sequential in the MP.
According to an embodiment, multiple MPs may be configured to access the memory unit for threads in different PEs, the address offsets may be the sequential within each MP respectively.
According to an embodiment, one MP may be configured to access the memory unit for all threads in one PE, the address offsets may be sequential with discontinuity.
According to an embodiment, the plurality of MPs may be configured to access the memory unit for different threads in different PEs, the address offsets may be sequential with discontinuity in each of the MPs respectively.
According to an embodiment, one MP may be configured to access the memory unit for all threads in one PE, the address offsets may be linear with non-unity stride.
According to an embodiment, multiple MPs may be configured to access the memory unit for all threads in one PE, the address offsets may be random but within a small range C to C+R dependent on a size of the memory cache.
According to an embodiment, multiple MPs may be configured to access the memory unit for threads in different PEs, the address offsets may be random but with a small range C to C+R dependent on a size of the memory cache.
According to an embodiment, the common area may include all memory banks of the memory unit.
According to an embodiment, the memory unit may comprise a plurality of memory caches each associated with a different memory bank.
According to an embodiment, each of the plurality of memory ports may be coupled to the plurality of memory caches.
According to an embodiment, each memory bank may comprise a plurality of memory words and a cache miss in a memory cache may cause a memory word to be fetched from a memory bank associated with the memory cache.
According to an embodiment, each of the plurality of PEs may comprise a plurality of data buffers to store data for each thread separately.
According to an embodiment, the processor may comprise a sequencer coupled to the plurality of memory ports, and each of the plurality of memory ports may comprise a configuration buffer to receive one or more configurations from the sequencer, and each memory port may be configured to provide a same memory access pattern during one configuration.
In yet another exemplary embodiment, there is provided a method comprising: generating a plurality of memory addresses by an address calculation unit in a memory port of a plurality of memory ports, wherein the plurality of memory ports provide access to a memory unit for a plurality of processing elements (PEs) each having a plurality of arithmetic logic units (ALUs) that are configured to execute a same instruction in parallel threads; and accessing a plurality of memory banks in the memory unit using the plurality of memory addresses with all threads accessing a common area in the memory unit.
According to an embodiment, the address calculation unit may take a base address common to all threads as a first input and a vector address that provides address offsets for each thread individually as a second input to generate the plurality of memory address.
According to an embodiment, the address calculation unit may be configured to generate a number of memory addresses that match a number of threads in a PE.
According to an embodiment, accessing the plurality of memory banks may comprise selecting zero or more memory addresses to be routed to one memory bank of the memory unit using a plurality of selection units respectively.
According to an embodiment, each selection unit may be configured with a mask for a different memory bank of the memory unit.
According to an embodiment, one MP may be configured to access the memory unit for all threads in one PE and the address offsets may be same for all threads.
According to an embodiment, multiple MPs may be configured to access the memory units for threads in different PEs, the address offsets may be the same within one MP but different for different MPs.
According to an embodiment, one MP may be configured to access the memory unit for all threads in one PE and the address offsets may be sequential in the MP.
According to an embodiment, multiple MPs may be configured to access the memory unit for threads in different PEs, the address offsets may be the sequential within each MP respectively.
According to an embodiment, one MP may be configured to access the memory unit for all threads in one PE, the address offsets may be sequential with discontinuity.
According to an embodiment, the plurality of MPs may be configured to access the memory unit for different threads in different PEs, the address offsets may be sequential with discontinuity in each of the MPs respectively.
According to an embodiment, one MP may be configured to access the memory unit for all threads in one PE, the address offsets may be linear with non-unity stride.
According to an embodiment, multiple MPs may be configured to access the memory unit for all threads in one PE, the address offsets may be random but within a small range C to C+R dependent on a size of the memory cache.
According to an embodiment, multiple MPs may be configured to access the memory unit for threads in different PEs, the address offsets may be random but with a small range C to C+R dependent on a size of the memory cache.
According to an embodiment, the common area may include all memory banks of the memory unit.
According to an embodiment, the memory unit may comprise a plurality of memory caches each associated with a different memory bank.
According to an embodiment, each of the plurality of memory ports may be coupled to the plurality of memory caches.
According to an embodiment, each memory bank may comprise a plurality of memory words and a cache miss in a memory cache may cause a memory word to be fetched from a memory bank associated with the memory cache.
According to an embodiment, each of the plurality of PEs may comprise a plurality of data buffers to store data for each thread separately.
According to an embodiment, the method may further comprise receiving one or more configurations from a sequencer for each of the plurality of memory ports, wherein each memory port is configured to provide a same memory access pattern during one configuration.
In yet an exemplary embodiment, there is provided a processor comprising: a memory unit comprising: a plurality of memory banks; a plurality of processing elements (PEs) each having a plurality of arithmetic logic units (ALUs) that are configured to execute a same instruction in parallel threads; and a plurality of memory ports (MPs) for the plurality of PEs to access the memory unit, each of the plurality of MPs comprising an address calculation unit configured to generate respective memory addresses for each thread to access a common area across the plurality of memory banks in the memory unit.
In an exemplary embodiment, there is provided a processor comprising: a processing element (PE) having a plurality of arithmetic logic units (ALUs) that are configured to execute a same instruction in parallel threads; and a memory port (MP) for the PE to access a memory unit, the MP comprising an address calculation unit configured to generate respective memory addresses for each thread to access a common area in the memory unit.
According to an embodiment, the PE may be one of a plurality of PEs that each may have a plurality of ALUs configured to execute a same instruction in parallel threads.
According to an embodiment, the MP may be one of a plurality of MPs that each may have an address calculation unit configured to generate respective memory addresses for each thread in one of plurality of PEs to access the common area in the memory unit.
In yet another exemplary embodiment, there is provided a method comprising: generating a plurality of memory addresses by an address calculation unit in a memory port, wherein the memory port provides access to a memory unit for a processing element (PE) having a plurality of arithmetic logic units (ALUs) configured to execute a same instruction in parallel threads; and accessing a plurality of memory banks in the memory unit using the plurality of memory addresses with each thread accessing a common area in the memory unit.
According to an embodiment, the PE may be one of a plurality of PEs that each may have a plurality of ALUs configured to execute a same instruction in parallel threads.
According to an embodiment, the MP may be one of a plurality of MPs that each may have an address calculation unit configured to generate respective memory addresses for each thread in one of plurality of PEs to access the common area in the memory unit.
In yet an exemplary embodiment, there is provided a processor comprising: a plurality of processing elements (PEs) each comprising: an arithmetic logic unit (ALU); a data buffer associated with the ALU; and an indicator associated with the data buffer to indicate whether a piece of data inside the data buffer is to be reused for repeated execution of a same instruction as a pipeline stage.
According to an embodiment, the processor further comprise a plurality of memory ports (MPs) for the plurality of PEs to access a memory unit, each of the plurality of MPs may comprise an address calculation unit configured to generate respective memory addresses for each thread to access a common area in the memory unit, and one MP of the plurality of MPs responsible for loading a piece of data from the memory unit for the piece of data to be reused at a PE may be configured to load the piece of data only once.
According to an embodiment, the MP responsible for loading the piece of data to be reused may be configured to determine that the piece of data is to be reused by determining that multiple threads to be executed at the PE are loading the piece of data using a same memory address.
According to an embodiment, at least one piece of data to be reused may be an execution result generated by one PE of the plurality of PEs.
According to an embodiment, each PE of the plurality of PEs may further comprise a configuration buffer to store configurations for each PE and a reconfiguration counter to count a number of repeated execution, each configuration may specify an instruction to be executed by a respective PE and the number for the instruction to be repeated during a respective configuration.
According to an embodiment, the ALU may be a vector ALU and the data buffer may be a vector data buffer, each data buffer of the vector data buffer may be associated with one ALU of the vector ALU.
According to an embodiment, the processor may further comprise a plurality of memory ports (MPs) for the plurality of PEs to access a memory unit, each of the plurality of MPs may comprise an address calculation unit configured to generate respective memory addresses for each thread to access a common area in the memory unit, and each MP of the plurality of MPs may comprise at least one data buffer for temporarily storing data loaded from the memory unit and each of the at least one data buffer may have an indicator associated therewith to indicate whether a piece of data stored therein is to be reused for other load operation.
According to an embodiment, each PE may comprise a plurality of data buffers associated with the ALU, each of the plurality data buffers may be configured to store a separate input for the ALU and may have an associated indicator to indicate whether a respective input is to be reused for repeated execution.
In yet another exemplary embodiment, there is provided a method comprising: determining that a piece of data is to be shared and reused by all threads at a processing element (PE) of a processor during one configuration applied at the PE; loading the piece of data once into a data buffer of the PE; setting an indicator associated with the data buffer to indicate that the piece of data is to be reused; and executing a same instruction with the piece of data as an input at the PE repeatedly a number of times as a pipeline stage, the same instruction and the number being specified by the configuration.
According to an embodiment, the method may further comprise loading the piece of data from a memory unit for the piece of data to be loaded into the data buffer of the PE, wherein the processor comprises a plurality of PEs and a plurality of memory ports (MPs) for the plurality of PEs to access the memory unit, wherein one MP of the plurality of MPs responsible for loading the piece of data from the memory unit is configured to load the piece of data only once.
According to an embodiment, the method may further comprise generating the piece of data as an execution result by one of a plurality of PEs of the processor.
According to an embodiment, the method may further comprise receiving the configuration and storing the configuration in a configuration buffer of the PE, wherein the configuration may specify an instruction to be executed by the PE and the number for the instruction to be repeated.
According to an embodiment, determining that the piece of data may be to be shared and reused by all threads at the PE comprises determining that the all threads may be using a same memory address to access the piece of data.
According to an embodiment, the method may further comprise loading the piece of data once into a data buffer of a memory port, the memory port providing access to a memory unit for the PE; setting an indicator associated with the data buffer of the memory port to indicate that the piece of data is to be reused for other load operations accessing a same memory address.
In yet an exemplary embodiment, there is provided a processor comprising: a plurality of processing elements (PEs) each comprising: a vector arithmetic logic unit (ALU) including a plurality of ALUs; a plurality of data buffer associated with each of the plurality of ALU; and a plurality of indicators each associated with a separate data buffer to indicate whether a piece of data inside a respective data buffer is to be reused for repeated execution of a same instruction as a pipeline stage at a respective PE.
According to an embodiment, the processor may further comprise a plurality of memory ports (MPs) for the plurality of PEs to access a memory unit, each of the plurality of MPs may comprise an address calculation unit configured to generate respective memory addresses for each thread to access a common area in the memory unit, wherein one MP of the plurality of MPs responsible for loading a piece of data from the memory unit for the piece of data to be reused at a PE may be configured to load the piece of data only once.
According to an embodiment, the MP responsible for loading the piece of data to be reused may be configured to determine that the piece of data may be to be reused by determining that multiple threads to be executed at the PE may be loading the piece of data using a same memory address.
According to an embodiment, at least one piece of data to be reused may be an execution result generated by one PE of the plurality of PEs.
According to an embodiment, the processor may further comprise a plurality of memory ports (MPs) for the plurality of PEs to access a memory unit, each of the plurality of MPs may comprise an address calculation unit configured to generate respective memory addresses for each thread to access a common area in the memory unit, and each MP of the plurality of MPs may comprise at least one data buffer for temporarily storing data loaded from the memory unit and each of the at least one data buffer may have an indicator associated therewith to indicate whether a piece of data stored therein may be to be reused for other load operation.
According to an embodiment, each of the plurality of data buffers may be a vector data buffer having a plurality of data buffer units and a piece of data to be reused for repeated execution may be duplicated in all data buffer units of one vector data buffer.
According to an embodiment, each of the plurality of data buffers may be a vector data buffer having a plurality of data buffer units and a piece of data to be reused for repeated execution may be stored only in one data buffer unit of one vector data buffer.
In yet another exemplary embodiment, there is provided a method comprising: receiving a first configuration and a second configuration at a reconfigurable unit of a processor, the reconfigurable unit having a configuration buffer to store the first configuration and the second configuration; executing a first operation a first number of times according to the first configuration, the first configuration being part of a first physical data path for executing a first part of a sequence of instructions; and reconfiguring the reconfigurable unit to execute a second operation a second number of times according to the second configuration, the second configuration being part of a second physical data path for executing a second part of a sequence of instructions.
In another exemplary embodiment, there is provided a method comprising: executing a first instruction at a reconfigurable processing element a number of times according to a first configuration, the reconfigurable processing element being configured to be part of a first physical data path during the first configuration; delivering an execution result from the reconfigurable processing element to a gasket memory to temporality stored the execution result after each execution of the first instruction; and feeding the execution result stored in the gasket memory to a second physical data path.
Reference will now be made in detail to the embodiments of the present teaching, examples of which are illustrated in the accompanying drawings. Like elements in the various figures are denoted by like reference numerals for consistency. While the present teaching will be described in conjunction with the embodiments, it will be understood that they are not intended to limit the present teaching to these embodiments. On the contrary, the present teaching is intended to cover alternatives, modifications, and equivalents, which may be included within the spirit and scope of the present teaching as defined by the appended claims.
In addition, in the following detailed description of embodiments of the present teaching, numerous specific details are set forth in order to provide a thorough understanding of the present teaching. However, it will be recognized by one of ordinary skill in the art that the present teaching may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the embodiments of the present teaching.
It should be noted that as used herein, a “coupling” between two components, such as one component being “coupled” to another component may refer to an electronic connection between two components, which may include but not limited to, by electronic wiring, through an electronic element (e.g., a resistor, a transistor), etc. Moreover, in some embodiments, the processor 200 may be configured for massive thread level parallel processing. For example, one processing element (PE) in the PE array 214 may comprise multiple arithmetic logic units (ALUs) that may be configured to perform same operation but on different data (e.g., each in a separate thread). That is, in these embodiments with multiple ALUs, each PE may be configured to operate in a Single Instruction Multiple Threads (SIMT) fashion. In one embodiment, a PE with a vector address and a vector data input may generate vector data output. In some embodiments, a thread may also be referred to as a stream.
To provide data for multiple threads to be executed concurrently, in some embodiments, some relevant electronic connections between components of the processor 200 may be in vectors. For example, a vector address of H×G may have H number of G bits addresses, a vector data connection of K×W may have K number of W bits data. It should also be noted that although not shown in any of the figures, data or address connections between different components may be accompanied by one or more signal lines. For example, a busy signal line may exist between a first component and a second component, and may be used by the first component to send a busy signal to the second component indicating that the first component is not ready to accept valid data or address signals. Moreover, a valid signal line may also exist between the first and second components, and may be used by the second component to send a valid signal to the first component indicating that valid data or address signals have been put on the connection wires.
The configuration memory 204 may store data path programs consists of executable instructions and/or data loading instructions for one or more data paths. In one embodiment, the data path programs stored in the configuration memory 204 may be sequence(s) of compiled instructions. For example, a data path program may include instructions to be executed by the PE array 214, which represents configuration information to be executed by which PEs when conditions are met, and how each data path components may hold or transmit data
The sequencer 206 may decode the instruction stored in the configuration memory 204 and move a decoded instruction into the memory unit 212 and a physical data path. The physical data path may include various components of the PE array 214 (e.g., components of the PE array 214 that will be involved in the executing on, staging and/or movement of data) and the gasket memory 216. The decoded instruction may be delivered to various components in a package, which may be referred to as a configuration package or simply a configuration. In addition to the decoded instruction, a configuration package for one component may include some other parameters (e.g., a number specifying how many times an instruction is to be repeatedly executed or how many times data passes through a data switching unit in one configuration setting). In one embodiment, a physical data path configuration may be referred to as a physical data path program, which may comprise individual configurations for various components included in a physical data path. Although not shown, there may be a configuration bus connecting the sequencer 206 to the components of a data path for individual configurations to be delivered to these components via the bus respectively.
The memory unit 212 may be a data staging area to store data received from the external bus 230 and store execution result data generated by the PE array 214 (before these results may be transmitted away via the external bus 230). In some embodiments, the memory unit 212 may be an in processor cache for a large memory system external of the processor 200. The PE array 214 may comprise a plurality of memory ports (MPs) 220.1-220.N, a plurality of switch boxes (SBs) 222.1-222.N, a plurality of processing elements (PEs) 218.1-218.N and a plurality of inter-column switch boxes (ICSBs) 224.1-224.N. In the example shown in
The plurality of MPs 220.1-220.N may be gateways for data flow between the PE array 214 and the memory unit 212. Each MP 220.1-220.N may be coupled to the memory unit 212 respectively to read from and write to the memory unit 212. With the exception of MP 220.1 and MP 220.N, all MPs may be coupled to two adjacent MPs such that each MP may be configured to receive data from a first MP and/or transmit data to a second MP. The electronic coupling between MPs may provide a one-way flow of data (e.g., if one computation configuration specifies that data may flow from one MP to a next MP). For example, as shown in FIG.2, MP 220.1 may be coupled to MP 220.2 for one-way flow of data, MP 220.2 may be coupled to MP 220.3 for one-way flow of data. The last MP 220.N may be an exception and coupled to the gasket memory 216, which may provide a temporary storage for data. The first MP 220.1 may be another exception in that it may receive one-way flow of data from the gasket memory 216. In some embodiments, the MPs 220.1-220.N may form a data routing bus along a PE row direction. That is, data may be routed between MPs in a direction that is parallel to the direction that data may be routed between PEs. In embodiments with two-dimensional PE array 214, each MP 220.1-220.N may be shared by one column of PEs. In one embodiment, the gasket memory 216 may be used as a data buffer, for example, first-in-first-out (FIFO), to collect data from the PE array and feed it back to the PE array for a new configuration.
In some embodiments, the PEs and MPs may be statically programmed with instructions for one configuration. For example, the instructions may be programmed to the PEs and MPs as a stage of pipeline and no instructions are changed during one configuration. The address computation instructions and memory access instructions like read or store may be mapped to the memory ports (MP) and other instructions may be mapped to PEs.
As illustrated in
The SBs 222.1-222.N may be configured to provide data switching for neighboring PEs, PEs to data routing buses, and the data routing bus formed by the MPs 220.1-220.N and the data routing bus formed by the ICSBs 224.1-224.N. For example, the switch box 222.1 may be configured to provide data switching for data to be delivered to the processing element 218.1 from the gasket memory 216, the MP 220.1 and the ICSB 224.1. Moreover, the switch box 222.1 may be configured to route data between the gasket memory 216, the MP 220.1 and the ICSB 224.1. As another example, the switch box 222.2 may be configured to provide data switching for data to be delivered to the processing element 218.2 from the processing element 218.1, the MP 220.2 and the ICSB 224.2. Moreover, the switch box 222.2 may be configured to route data between the processing element 218.2, the MP 220.2 and the ICSB 224.2. In yet another example, the switch box 222.N may be configured to provide data switching for data to be delivered to the processing element 218.N from the PE 218.N−1, the MP 220.N and the ICSB 224.N. Moreover, the switch box 222.N may be configured to route data between PE 218.N−1, MP 220.N and ICSB 224.N. A SB may also be referred to as a data switching unit.
An exemplary data path may be illustrated by the exemplary internal connections of the MPs 222.1 to 222.N. For example, as shown in
To simplify wording, a MP 220 may refer to one of the MPs 220.1-220.N, a SB 222 may refer to one of the SBs 222.1-222.N, a PE 218 may refer to one of the PEs 218.1-218.N and an ICSB 224 may refer to one of the ICSB 224.1-224.N.
In addition to being individually coupled to all caches 304.1-304.N of the memory unit 300, the MPs 220.1-220.4 may be chained to form the row direction data routing bus, with the MP 220.1 and the MP 220.N being coupled at one end to the gasket memory 216 respectively (shown in
The memory unit 300 and MPs 220.1-220.N may support two accessing modes: a private memory access mode and a shared memory access mode, which may also be referred to as the private memory access method and shared memory access method. In one MP, multiple data units may be read or written using a vector address. These addresses of one vector may be different from each other. In the private memory access mode, one address in a vector address may be routed to one memory bank according to the thread index. All private data for one thread may be located in the same memory bank. In shared memory access mode, each MP may access anywhere in the defined region regardless of thread index. Data shared to all threads may be spread in all memory banks.
The memory unit structure is illustrated in Error! Reference source not found.A as one example. For each column of a PE array, it may have one MP with multiple buses going through. The memory port may be configured as shared (e.g., shared memory access mode) or private (e.g., private memory access mode). Each memory port may be further coupled to a data cache network.
In one embodiment of this first memory mapping, data units for different threads may be intended to be stored in different memory banks and wrap back to the first bank for thread N. For example, for N equal to 32, data units for the 32nd thread may be stored to memory bank 0 (e.g., data units S32(0) through S32(31) in memory bank 0), data units for the 33rd thread may be stored to memory bank 1 (e.g., data units S33(0) through S33(31) in memory bank 1), data units for the 63rd thread may be stored to memory bank N−1 (e.g., data units S63(0) through S63(31) in memory bank 0), and so on.
For the same memory structure of
In one embodiment of this second memory mapping, data units for different threads may be intended to be stored in different memory banks and wrap back to the first bank for thread N and integer multiple of N (e.g., 2N, 3N, etc.). Moreover, data units of a group of different threads with same index may be mapped to the same word of a memory bank. For example, for N equal to 32, data units for the 32nd thread may be stored to memory bank 302.1 in different words (e.g., data units S32(0) through S32(99) in memory bank 302.1 in a second column, with data units S0(m) and S32(m) in the same word, m being the index of the data unit in a thread), data units for the 33rd thread may be stored to memory bank 302.2 in different words (e.g., data units S33(0) through S33(99) in memory bank 302.2 in a second column, with data units S1(m) and S33(m) in the same word, m being the index of the data unit in a thread), data units for the 63rd thread may be stored to memory bank 302.N (e.g., data units S63(0) through S63(99) in memory bank 0, with data units S31(m) and S63(m) in the same word, m being the index of the data unit in a thread), and so on. Because each word has 32 data units, the last data unit in the first row of memory bank 302.1 may be the first data unit S992(0) of the thread 992, the last data unit in the first row of memory bank 302.2 may be the first data unit S993(0) of the thread 993, and so on until the last data unit in the first row of memory bank 302.N may be the first data unit S1023(0) of the thread 1023. It should be noted that a thread may have more than 99 data units and Si(99) (e.g., S0(99), etc.) may not be the last data units for a thread and dotted lines may represent that more data units may exist and stored in a memory bank.
The data units for thread 1024 and higher number of threads may be wrapped from the first column of memory bank 0 and so on. For example, with m being the index, data units for threads 1024, 1056 and so on until 2016 (e.g., S1024(m), S1056(m) and so on until S2016(m)) may be in one word of the memory bank 0; data units for threads 1025, 1057 and so on until 1057 (e.g., S1025(m), S1057(m) and so on until S2017(m)) may be in one word of the memory bank 1; and data units for threads 105, 1087 and so on until 2047 (e.g., S1055(m), S1087(m) and so on until S2047(m)) may be in one word of the memory bank N−1.
Regardless of private or shared memory access modes, each of the caches 304.1-304.N of a memory unit 300 may comprise multiple cache lines that each may temporarily store a memory word from a corresponding memory bank. For example, cache 304.1 may comprise multiple cache lines that each may be configured to temporarily store one word retrieved from the memory bank 302.1 (e.g., memory bank 0), cache 304.2 may comprise multiple cache lines each configured to temporarily store one word retrieved from the memory bank 302.2 (e.g., memory bank 1), cache 304.N may comprise multiple cache lines each configured to temporarily store one word retrieved from the memory bank 302.N (e.g., memory bank N−1), and so on. A cache miss may be generated when one or more data pieces (e.g., one or more data units) requested are not in the cache. In one embodiment, one memory word of a memory bank of the memory unit 300 (e.g., in either
Data storage in the memory unit 212 may be accessed by the MPs 220.1-220.N via the caches 304.1-304.N. The memory ports (MP) at each column may be configured with same components to carry out the memory operations, for example, calculating addresses and issuing read and/or store operations. In some embodiments, one cache 304 may be accessed by multiple MPs at the same time. Each of the MPs may be configured to provide the two accessing modes: the private memory access mode and the shared memory access mode. Due to the nature of SIMT, memory read or write instructions mapped to a MP for different threads belong to the same type, either shared or private. Moreover, a MP may be configured for private or shared memory access mode for a duration of a configuration.
The third input from the counter 404 may provide thread numbers (e.g., indexes) for the address calculation unit 402 and therefore, the counter 404 may be referred to as a thread counter. In one embodiment, the address vector, read data vector and write data vector may be simply split into each memory bank with a one-to-one mapping so that the data of different threads may be mapped into different memory banks. For example, the i-th address in the vector address may be for thread i (lower case letter “i” to denote a thread number, which may start from zero for the first thread), and the counter 404 may provide a thread number vector to the address calculation unit 402 so the address calculation unit 402 may generate N addresses as A_0, A_1 . . . , A_N—1 in this example corresponding to the vector size of ALU. Each address in the vector address may be mapped to an address A_i and a corresponding address output for a corresponding memory bank (e.g., A_0 coupled to the address port 410.1 for the memory bank 0 cache 304.1, A_N−1 coupled to the address port 410.N for memory bank N−1 cache 304.N, etc.). The i-th data lines in the vector write data port WData 406 may be mapped to WD_i (e.g., WD_0 coupled to the write data port 412.1 for memory bank 0 cache 304.1, WD_N−1 coupled to the write data port 412.N for memory bank N−1 cache 304.N, etc.). The i-th data lines in the vector read data port RData 408 may be mapped to RD_i (e.g., RD_0 coupled to the read data port 414.1 for memory bank 0 cache 304.1, RD_N−1 coupled to the read data port 414.N for memory bank N−1 cache 304.N, etc.). No bus switch may be needed for this configuration and there may be no memory contention at this level.
It should be noted that the number of memory bank does not need to be identical to the vector size. For example, a vector (e.g., vector ALU, vector address, vector data ports) may have a vector size=V, a PE array may have a number of columns=N, and a memory unit may have a number of memory banks=M, and V, N and M may be all different. For convenience, the capital letter N may be used herein to denote the vector size, the number of columns of PEs, and the number of memory banks, but the number represented by N may be equal or different in different components.
For thread numbers larger than the number N, the address calculation unit 402 and the counter 404 may generate a memory mapping that wraps around to N memory banks. For example, thread 32 may be mapped to memory bank 0 cache 304.1 (e.g., S32(0) to memory bank 302.1 in
Because more than one address may be selected for one memory bank, write data selection units (e.g., “Select 2” units 418.1 through 418.N) and read data selection units (e.g., “Select” units 420.1 through 420.N) may be provided to map multiple data ports from the vector data ports WData 406 and RData 408 to one memory bank. Each of the write data selection unit 418.1 through 418.N may take an input from a corresponding data selection unit 416.1 through 416.N, and map multiple write data lines from the write data lines WD_0 through WD_N−1 to a corresponding write data port for a selected memory bank (e.g., write data port 422.1 for memory bank 0 cache 304.1, write data port 422.N for memory bank N−1 cache 304.N). Each of the read data selection unit 420.1 through 420.N may take an input from a corresponding data selection unit 416.1 through 416.N passed over by a corresponding selection unit 418.1 through 418.N, and map multiple read data lines from the read data lines RD_0 through RD_N−1 to a corresponding read data port for a selected memory bank (e.g., read data port 424.1 for memory bank 0 cache 304.1, read data port 424.N for memory bank N−1 cache 304.N). In an embodiment in which up to two addresses may be selected from N addresses, the width of the address ports 426.1 through 426.N, the write data ports 422.1 through 422.N and the read data ports 424.1 through 424.N may be doubled of that of the address ports 410.1 through 410.N, the write data ports 412.1 through 412.N and the read data ports 414.N.
Embodiments of a processor may comprise a large amount of ALUs and support massive amount of parallel threads. The memory access could be very busy. It may be extremely expensive to use multiport memory to meet the requirement. The complexity may also become very high if large amount of memory banks is used. The example private memory access may reduce the complexity of memory structure and support many typical memory access patterns for parallel processing. Some typical private memory access patterns are listed below.
In some embodiments, the private memory access may allow random data access from all threads at the same time but to different memory area for each thread. This enables programmers to write software in conventional style, without complicated data vectorization and detailed knowledge of underlying processor hardware architecture. This may enable same-instruction-multiple-thread (SIMT) programming to be applicable to an embodiment of a PE array. That is, one instruction may be concurrently executed by multiple threads in one PE.
Due to the non-overlapping nature, the total throughput may be the sum of throughputs of all threads. Embodiments of the private memory access mode may support large throughput from simultaneous access from each thread. The first and second memory data mapping may allow minimum memory contention in typical private data access patterns. Embodiments of private memory access may also reduce the complexity of memory system. The number of memory banks may be significantly reduced. The parallel cache structure may also reduce the total cache size since each content in the cache may be unique. Moreover, embodiments of private memory access may significantly reduce access to the memory banks by allowing simultaneous cache access from multiple memory ports.
In one embodiment, for a PE array size with 32×32 ALUs, only 32 memory banks may be required using the private memory access configuration (e.g., as shown in
Different memory access patterns may use different mapping methods, both mappings in
The first memory mapping in
In some embodiments, register spilling may occur. Register spilling may refer to scenarios that when a compiler is generating machine code, there are more live variables than the number of registers the machine may have and thus some variables may be transferred or spilled to memory. Memory for register spilling may be private to each thread, these spilled variables may need to be stored in private memory. Due to the fact that all address offset for register spilling may be identical to each thread, they are similar to Non-unity stride mode in case-5 of Table 1 and the spilled variables may be stored using the second memory mapping as shown in
The example shared memory access mode may also reduce the complexity of memory structure and support many typical memory access patterns for parallel processing. Some typical shared memory access patterns are listed below.
In some embodiments, the shared memory access may allow random data accesses from each parallel thread at the same time. All threads may access anywhere in a common area in the memory unit. In one embodiment, the common area may be a shared memory space that includes all memory banks. In another embodiment, the common area may be a shared memory space across a plurality of memory banks. This may enable programmers to write software in conventional style, without complicated data vectorization and detailed knowledge of underlying processor hardware architecture. This may also enable SIMT programming to be applicable to an embodiment of a PE array.
Embodiments of shared memory access may reduce the complexity of memory system. The number of memory banks may be significantly reduced. The parallel cache structure may also reduce the total cache size since each content in the cache may be unique. Moreover, embodiments of shared memory access may significantly reduce access to the memory banks by allowing simultaneous cache access from multiple memory ports.
It should be noted that as shown in
In one embodiment, for a PE array size with 32×32 ALUs, only 32 memory banks may be may be needed using the shared memory access configuration (e.g., as shown in
Each data input of the SB 500 may be coupled to some data outputs. For example, the data input 502.1 may be coupled to the data outputs 506.1, 506.2, 508.2, 510.1 and 510.2; the data input 502.2 may be coupled to the data outputs 506.1, 506.2, 508.1, 510.1 and 510.2; the data input 512.1 may be coupled to the data outputs 504.1, 504.2, 506.1, 506.2, and 508.1; the data input 512.2 may be coupled to the data outputs 504.1, 504.2, 506.1, 506.2, and 508.2; the data input 514.1 may be coupled to the data outputs 504.1, 506.1, 506.2, 508.1, and 510.2; and the data input 514.1 may be coupled to the data outputs 504.2, 506.1, 506.2, 508.2,and 510.1.
Externally, depending on the location of the SB 500 in the PE array 214, the data inputs 502.1 and 502.2, and data outputs 504.1 and 504.2 may be coupled to a MP 220, or another SB 222 (e.g., in a multi-row PE array). The data inputs 514.1 and 514.2 may be coupled to a PE 218 or the gasket memory 216. The data inputs 512.1 and 512.2, and data outputs 510.1 and 510.2 may be coupled to another SB 222 (e.g., in a multi-row PE array) or an ICSB 224. The data outputs 506.1, 506.2, 508.1 and 508.2 may be coupled to a PE 218. Data signals output from the data outputs 506.1, 506.2, 508.1 and 508.2 may be denoted as A, B, C, D, and data signals input from the data inputs 514.1 and 514.2 may be denoted as X, Y. These data signals A, B, C, D, and X, Y may be the input data signals to a PE 218 and output data signals from a PE 218 as described herein.
Each of the counters 520.1-520.8 at the data outputs may be independently responsible for counting data passed. When one or more configurations may be loaded into the C-FIFO 518, each configuration may specify a number of counts. During execution of one configuration, all counters may independently count how many times data has passed through. When all the counters reach the number of counts specified in the configuration, a next configuration may be applied. A similar approach may be applied inside an ICSB 224, a PE 218, the gasket memory 216 and a memory port 220. Because these counters may facilitate configuration and reconfiguration of each component that may have a such counter, these counters may be referred to as reconfiguration counters and a component that has such a counter may be referred to as a reconfigurable unit. An embodiment of a processor 200 may provide massive parallel data processing using the various reconfigurable units and may be referred to as a reconfigurable parallel processor (RPP).
Data signals received from the data inputs 610.1, 610.2, 610.3 and 610.4 may be denoted as A, B, C, D, and data signals output from the data outputs 608.1 and 608.2 may be denoted as X, Y. In an embodiment in which the ALU 602 may be one ALU, each data input 610.1, 610.2, 610.3 or 610.4 and each data output 608.1 or 608.2 may have a width of M bits that may match the width of the ALU. For example, for an 8-bit ALU, each input and output may be 8-bit; for a 16-bit ALU, each input and output may be 16-bit; for a 32-bit ALU, each input and output may be 32-bit; and so on. And each input data signal A, B, C, D and each output signal X, Y may be M bits. In an embodiment in which the ALU 602 may be a vector of ALUs, each data input 610.1, 610.2, 610.3 or 610.4 may be a vector of N M-bit inputs, and each data output 608.1 or 608.2 may be a vector of N M-bit outputs. And each input data signal A, B, C, D and each output data signal X, Y may be N×M bits.
The data buffers 604.1-604.4 may be coupled to the inputs 610.1, 610.2, 610.3 and 610.4 to temporarily store data pieces. In some embodiments, however, the data buffers may be located as the output. The D-FIFOs 604.1-604.4 may be used to decouple the timing of PEs to allow PEs to work in dependently. In one embodiment, the buffers may be implemented as FIFOs (e.g., a D-FIFO for a data buffer, a C-FIFO for a configuration buffer).
The configuration buffer C-FIFO 614 may receive configurations from the configuration input 612, which may be coupled externally to the sequencer 206 via the configuration bus, and store the received configurations before any execution of a data path starts. The configurations for the PE 600 may be referred to as PE configurations. The PE 600 may be statically programmed with instructions for one configuration, e.g., the instructions may be programmed to the PE 600 as a stage of pipeline. No instructions may be changed during one configuration. Once configured, the operation of the ALU 602 (e.g., one ALU or vector of ALUs depending on a particular embodiment) may be triggered if D-FIFOs 610.1, 610.2, 610.3 and 610.4 have data and output ports 608.1 and 608.2 are not busy. One of the configuration parameter may be a number for a specified number of executions for an instruction. The counter 606 may be programmed with the specified number and used to count the number of times data has been processed by executing an instruction. When the number of executions has reached the specified number, a new configuration may be applied. Therefore, reconfiguration capability may be provided in each PE. In one embodiment, the specified number of execution for an instruction may be referred to as NUM_EXEC and this NUM_EXEC may be used across a data path for one configuration.
In one embodiment with a multi-row PE array 214, the PEs within each column may be functionally different from each other but the PEs along each row follow a repetitive pattern (e.g., functionally duplicative). For example, ALUs in a first row of PEs may implement a first set of instructions and ALUs in a second row of PEs may implement a second set of instructions that may be different from the first set. That is, ALU 602 in different embodiments of the PE 600 may comprise different structures or different functional components. In some embodiments, one or more rows of PEs of a processor may comprise ALUs that may be relatively simple and use less space and another row of PEs of the same processor may comprise ALUs that may be relatively more complex and use more space. The relatively simple ALUs may implement a set of instructions that may be different from a set of instructions implemented by the relatively more complex ALUs. For example, one embodiment of PE 600 may have an ALU 602 (e.g., one ALU or a vector of ALUs) that implements a set of instructions that require a relatively simple structure, such as, but not limited to, ADDITION (e.g., A+B), SUBSTRACTION (e.g., A−B), etc.; while another embodiment of PE 600 may have an ALU 602 that implements instructions that require a relatively more complex structure, such as, but not limited to, MULTIPLICATION (e.g., A times B (A*B)), MAD (for multiply−accumulate (MAC) operation) (e.g., A*B+C).
Each data input of the ICSB 700 may be coupled to some selected data outputs. For example, the data input 704.1 may be coupled to the data outputs 708.1-708.4; the data input 704.2 may be coupled to the data outputs 708.1-708.4; the data input 710.1 may be coupled to the data outputs 706.1-706.2, and 708.1; the data input 710.2.2 may be coupled to the data outputs 706.1-706.2, and 708.2; the data input 710.3 may be coupled to the data outputs 706.1-706.2, and 708.3; and the data input 710.4 may be coupled to the data outputs 706.1-706.2, and 708.4.
Externally, the data inputs 704.1 and 704.2, and data outputs 706.1 and 706.2 may be coupled to a SB 222. The data inputs 710.1-710.4 may be coupled to a neighboring ICSB 224 or the gasket memory 216. The data outputs 708.1-708.4 may be coupled to another neighboring ICSB 224 or the gasket memory 216.
Each of the counters 714.1-714.6 at the data outputs may be independently responsible for counting data passed. When one or more configurations may be loaded into the C-FIFO 702, each configuration may specify a number of counts. The configurations for the ICSB 700 may be referred to as ICSB configurations. During execution of one configuration of the PE array 214, all counters may independently count how many times data has passed through. When all the counters reach the number of counts specified in the configuration, a next configuration may be applied. This implementation may be similar to what may be applied inside a SB 222, a PE 218, the gasket memory 216 and a memory port 220.
External connections from the MP 220.N, PE 218.N and ICSB 224.N may be taken as inputs at the data inputs 814.1-814.2, 816.1-816.2, and 818.1-818.4, respectively. And external connections to the MP 220.1, SB 222.1 and ICSB 224.1 may generate outputs at the data outputs 808.1-808.2, 810.1-810.2, and 812.1-814.4, respectively. The configuration input 816 may be externally coupled to the sequencer 206 via the configuration bus for the gasket memory 800 to receive configurations from the sequencer 206. The configurations for the gasket memory 800 may be referred to as gasket memory configurations. Two types of configurations may be received from the sequencer 206: input configurations and output configurations. The input C-FIFO 804 may store input configurations for input ICSB ports 818.1-818.4 to be coupled to some Data FIFOs selected from L D-FIFOs 802.5-802.F as inputs to these selected D-FIFOs. The output C-FIFO 806 may store configurations for some data FIFOs selected from L D-FIFOs 802.5-802.F to be coupled to the ICSB ports 812.1-812.4.
The number of gasket D-FIFOs 802.5 through 802.F storing ICSB inputs may be greater or equal to the number of input or output ICSB ports. In some embodiments, as described herein, there may be a data connection that may bypass at least a portion of a physical data path. For example, an execution result generated by one PE 218 may not be needed by another PE 218 in the same physical data path configuration but may be used in a future configuration. These data signals for the execution result may be routed via a SB 222 and an ICSB 224 to the gasket memory 216 and stored in the D-FIFOs of the gasket memory 216 for the future configuration. Therefore, in some embodiments, the gasket memory 800 may have more D-FIFOs than the number of input or output ports.
Each of the input counters 820.1-820.L at the data inputs and each of the output counters 822.1-822.4 at the data outputs may be independently responsible for counting data passed. When one or more input configurations and output configurations may be loaded into the input C-FIFO 804 and output C-FIFO 806, each configuration may specify a number of counts. During execution of one configuration, all counters may independently count how many times data has passed through. When all the counters reach the number of counts specified in the configuration, a next configuration may be applied.
During operation, all concurrent threads in one PE may execute the same instruction and each instruction may be executed multiple times in one PE as a pipeline stage. That is, each PE may be configured to execute an instruction NUM_EXEC times as a pipeline stage. For example, in an embodiment that each PE may comprise a ALU vector with a vector size of one, each instruction may be configured to execute 4 times by the ALU vector at each PE. The 4 times of execution may be represented by four threads processed with each thread in a different shade. For example, in PDP1, PE0 may be configured to execute instruction A four times, PE1 may be configured to execute instruction B four times, PE2 may be configured to execute instruction C four times and PE3 may be configured to execute instruction D four times. In PDP2, PE0 may be configured to execute instruction E four times, PE1 may be configured to execute instruction F four times, PE2 may be configured to execute instruction G four times and PE3 may be configured to execute instruction H four times. In PDP3, PE0 may be configured to execute instruction I four times, PE1 may be configured to execute instruction J four times, PE2 may be configured to execute instruction K four times and PE3 may have no instruction configured. In this embodiment, because there may be data dependency between different instructions, a thread executing an instruction that depends on another instruction may be executed later in time. For example, instruction B may depend on data from instruction A's execution result and therefore, the first thread executing instruction B may follow the first thread executing instruction A in a later cycle, the second thread executing instruction B may follow the second thread executing instruction A in a later cycle, the third thread executing instruction B may follow the third thread executing instruction A in a later cycle, and the fourth thread executing instruction B may follow the fourth thread executing instruction A in a later cycle. Due to static reconfiguration scheme and dependency of the instructions, there could be some time lost during DPD reconfiguration, e.g., PE2 may have one idle cycle during PDP1 to PDP2 transition. In an embodiment in which each PE has a vector ALU with the vector size N larger than 1, each PE may execute N concurrent threads at a time, and each shaded thread in
In various embodiments, the gasket memory may provide a way to reduce the efficiency loss during reconfiguration. For example, even there may be some idle slots during reconfiguration (e.g., reconfiguration of PE2 between instruction C of PDP1 and instruction G in PDP2), if a larger number of thread are used, the idle slots may be insignificant compared to the total busy cycles.
In the example configuration process in
Error! Reference source not found.
As shown in
Because the output from PE_0 1104 may only be needed by PE_1 1106 in the first PDP, at this moment, no data may need to pass through ICSB_1 1114. Thus, although ICSB_1 1114's configuration may be programed already (e.g., it's internal connection shown in a dash dotted line), but there is no data coming to ICSB_1 1114 (e.g., its connection to SB_1 1110 in a dotted line) and ICSB_1 1114 may stay still.
At the stage shown in
In
Because in the first PDP, the input to PE_2 1108 may only come from PE_1 1106, at this moment, no data may need to pass through ICSB_2 1116. Thus, although ICSB_1 1116's configuration may be programed already (e.g., it's internal connection shown in a dash dotted line), but there is no data passing through ICSB_2 1116 (e.g., its connection to SB_2 1112 shown in a dotted line) and ICSB_2 1116 may stay still.
At the stage shown in
In some embodiments, configurations for PDPs of a VDP (e.g., for a dependency graph of an execution kernel) may be sent to the components independently while each component may be operating according to a current configuration. For example, while the PEs (e.g., PE_0 1104, PE_1 1106 and PE_2 1108), SBs (e.g., SB_1 1110 and SB_2 1112) and ICSBs (e.g., ICSB_1 1114 and ICSB_2 1116) may be operating under their respective first configuration for PDP_1, subsequent configurations of other PDPs of the same VDP for each of these components may be received from the SEQ 1102. In one embodiment, a plurality of configurations for one component may be sent via the configuration bus from a sequencer 206 in a batch as long as sending multiple configurations for one component will not slow down or block the operation of any other components.
Therefore, while PDP_1 may be carried out, all the configurations for PDP_2 may have been received by the components. As shown in
At the stage shown in
At the stage shown in
Instruction Ins_1 may be a data loading instruction “Load a[k][j]” and a memory port may be configured to be execute Ins_1 three times as a pipeline stage 1204. The data piece to be load by Ins_1 may be different for different threads and may be loaded from different addresses for different threads. For example, a[k][j] may be a j-th data piece for a k-th thread, with k may be an integer between 0 to N−1 (inclusive) for each thread in the first block of threads, between N to 2N−1 (inclusive) for each thread in the second block of threads, and between 2N to 3N−1 (inclusive) for each thread in the third block of threads.
In one embodiment, the pipeline stages 1202 and 1204 may be performed at a same memory port if the memory port is configured to carry out two data loading instructions in parallel. For example, two parallel read data lines and two parallel write data lines between each of the MPs 220 and the memory unit 212 are shown in
Instruction Ins_2 may be a multiplication instruction “y=a[k][j]*x[j]” with the data piece x[j] being loaded by Ins_0 and a[k][j] being loaded by Ins_1 and a PE may be configured to execute Ins_2 three times (e.g., NUM_EXEC being 3 with totally 3×N times for all threads) as a pipeline stage 1206. Therefore, each PE or MP may be configured to execute NUM_EXEC amount of instructions as a pipeline stage.
Instruction Ins_4 may be a data loading instruction “Load x[j+1]” and a memory port may be configured to execute Ins_4 three times as a pipeline stage 1208. The data piece x[j+1] may be common to all threads and loaded from the same address. For example, the data piece x[j+1] may be a j+1-th data piece in the vector x, and this j+1-th data piece may be used by all threads. Instruction Ins_5 may be a data loading instruction “Load a[k][j+1]” and a memory port may be configured to execute Ins_5 three times as a pipeline stage 1210. The data piece to be load by Ins_5 may be different for different threads and may be loaded from different addresses for different threads. For example, a[k][j+1] may be a j+1-th data piece for a k-th thread, with k may be an integer between 0 to N−1 (inclusive) for each thread in the first block of threads, between N to 2N−1 (inclusive) for each thread in the second block of threads, and between 2N to 3N−1 (inclusive) for each thread in the third block of threads. In one embodiment, the pipeline stages 1208 and 1210 may be performed at a same memory port if the memory port is configured to carry out two data loading instructions in parallel. In another embodiment, the pipeline stages 1208 and 1210 may be performed at two different memory ports.
Instruction Ins_6 may be a multiplication instruction “y=a[k][j+1]*x[j+1]” with the data piece x[j+1] being loaded by Ins_4 and a[k][j+1] being loaded by Ins_5 and a PE may be configured to execute Ins_6 three times as a pipeline stage 1212.
In the example pipelined instruction execution of
In some embodiments, this operation mode with reduced pipeline stage may be generalized to other instructions. In one embodiment, for an instruction that may generate the same result for different threads, the same approach can be used to reduce power consumptions. For example, a result from one PE may be used as an input for different threads in another PE in the same physical data path, or a result from a PE of one physical data path may be used as an input for different threads in a PE in another physical data path, the result may be loaded only once with the indication S set for a corresponding D-FIFO and reused.
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
The present disclosure provides apparatus, systems and methods for reconfigurable parallel processing. For example, an embodiment of a RPP may utilize a 1-D or 2-D data path which consists of a processing element (PE) array and interconnections to process massive parallel data. The data path may be made identical in each section (e.g., one column of PE(s), MP and data routing units), which may allow the dependency graph of a kernel to be mapped to a virtual data path that may be an infinite repetition of the data path sections in one dimension.
An embodiment of a RPP may also utilize a gasket memory to temporally store data output of data paths as where the virtual data path is segmented into physical data paths. The gasket memory may function like a data buffer (e.g., FIFO) to feed data back into physical data path of the next configuration.
An embodiment of a RPP may also have a one-dimensional memory unit with memory ports (MPs) connected to each column of data path. All data accessed throughout the virtual data path may be stored in the memory unit. Each time for a new configuration, a MP may be reconfigured to access the memory unit differently while the data could stay the same. An embodiment of a RPP may separate types of memory access to private memory access and shared memory access. Private memory access may be dedicated to a particular thread with no overlapping access allowed between different threads. Shared memory access may allow all threads to access common area. Instead of defining different memories for shared and private types. An embodiment of a RPP may store data into the same memory space but provides different access method. This eliminates unnecessary data movement from private memory to shared memory and vice visa.
Embodiments of a RPP may be optimized to allow massive parallelism for multithread processing. In one example, with one row of 32 PEs and each PE having 32 arithmetic and logic units (ALUs), 1024 ALUs may be included in one RPP core. In some embodiments, a multi-core processor may comprise multiple RPPs.
Embodiments of a RPP may be reconfigured according to a reconfiguration mechanism. The various components of a RPP that include one or more reconfiguration counters may be referred to as reconfigurable units. For example, each of the PEs (e.g., PE 218), the switching units (e.g., SB 222 and ICSB 224) and memory units (e.g., MP 220, gasket memory 216), may comprise one or more reconfiguration counters, such as the counter 606 in a PE, the counters 520 in a SB, the counters 714 in an ICSB, the counters 820 and 822 in a gasket memory, and similar counters in a MP (not shown in
The exemplary reconfiguration mechanism may reduce the power spent on configuration because the configuration is only switched once after all threads have been processed. This may also reduce idle time between configurations by switching each PE independently at its earliest time. By doing that, the memory required to store intermediated data may also be reduced.
In some embodiments, all threads may load data use the same address in a shared memory access mode. Due to the pipelined nature of operation, only the first data load instruction of all threads may need to be performed. The data loaded may be shared with all threads to reduce the memory access traffic and power consumption.
The techniques described herein may be implemented in one or more application specific integrated circuits (ASICs) in digital logic gates, or by a processor that execute instructions stored in a tangible processor readable memory storage media.
In one embodiment, any of the disclosed methods and operations may be implemented in software comprising computer-executable instructions stored on one or more computer-readable storage media. The one or more computer-readable storage media may include non-transitory computer-readable media (such as removable or non-removable magnetic disks, magnetic tapes or cassettes, solid state drives (SSDs), hybrid hard drives, CD-ROMs, CD-RWs, DVDs, or any other tangible storage medium), volatile memory components (such as DRAM or SRAM), or nonvolatile memory components (such as hard drives)). The computer-executable instructions may be executed on a processor (e.g., a microcontroller, a microprocessor, a digital signal processor, etc.). Moreover, an embodiment of the present disclosure may be used as a general-purpose processor, a graphics processor, a microcontroller, a microprocessor, or a digital signal processor.
While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
Claims
1-38. (canceled)
39. A processor, comprising
- a plurality of processing elements (PEs);
- a plurality of switch boxes arranged in a plurality of columns, each of the plurality of switch boxes being associated with a respective PE and configured to provide input data switching for the respective PE;
- a plurality of memory ports arranged in the plurality of columns and being coupled to a memory unit and a top switch box in each column of the plurality of columns, each of the plurality of memory port being configured to provide data access to the memory unit for one or more switch boxes in a respective column;
- a plurality of inter-column switch boxes (ICSBs) each coupled to a bottom switch box in each column of the plurality of columns; and
- a gasket memory with its input coupled to a memory port, a PE, one or more switch boxes and an ICSB in a last column of the plurality of columns, and its output coupled to a memory port, one or more switch boxes and an ICSB in a first column of the plurality of columns.
40. The processor of claim 39, further comprising a sequencer coupled to the plurality of PEs, the plurality of switch boxes, the plurality of ICSBs, the plurality of memory ports and the gasket memory to deliver configurations to these components.
41. The processor of claim 40, further comprising a configuration memory coupled to the sequencer to store compiled configurations for the sequencer to decode and deliver.
42. The processor of claim 39, further comprising a memory unit for providing data storage for the processor.
43. A processor, comprising:
- a plurality of processing elements (PEs) each comprising a configuration buffer and a plurality of arithmetic logic units (ALUs), and each configured to operate independently according to respective PE configurations stored in the configuration buffer; and
- a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.
44. The processor of claim 43, further comprising a plurality of switch boxes each comprising a configuration buffer configured to store switch box configurations, each of the plurality of switch boxes being associated with a respective PE of the plurality of PEs and configured to provide input data switching for the respective PE according to the switch box configurations.
45. The processor of claim 44, wherein the plurality of switch boxes and their associated PEs are arranged in a plurality of columns, a first switch box in a first column of the plurality of columns is coupled between the gasket memory and a first PE in the first column of the plurality of columns, and a second PE in a last column of the plurality of columns is coupled to the gasket memory.
46. The processor of claim 45, further comprising:
- a memory unit for providing data storage for the plurality of PEs; and
- a plurality of memory ports each arranged in a separate column of the plurality of columns for the plurality of PEs to access the memory unit.
47. The processor of claim 46, further comprising a plurality of inter-column switch boxes (ICSBs) each comprising a configuration buffer configured to store ICSB configurations, the plurality of ICSBs being configured to provide data switching between neighboring columns of the plurality of columns according to the ICSB configurations.
48. The processor of claim 46, wherein each of the plurality of memory ports (MPs) comprises a configuration buffer to store MP configurations and configured to operate in a private access mode or a shared access mode during one MP configuration.
49. The processor of claim 48, wherein a piece of data stored in the memory unit is accessed through the private access mode and the shared access mode in different part of a program without the piece of data being moved in the memory unit.
50. The processor of claim 46, wherein each of the plurality of columns comprises one PE, the plurality of PEs are identical and form one row of repetitive identical PEs.
51. The processor of claim 46, wherein each of the plurality of columns comprises two or more PEs and the plurality of PEs form two or more rows.
52. The processor of claim 51, wherein a first row of PEs are configured to implement a first set of instructions and a second row of PEs are configured to implement a second set of instructions, at least one instruction of the second set of instructions is not in the first set of instructions, wherein the of plurality of columns are identical and form repetitive columns.
53. The processor of claim 48, wherein each of the plurality of memory ports is configured to access the memory unit using a vector address, wherein in the private access mode, one address in the vector address is routed to one memory bank of the memory unit according to a thread index and all private data for one thread are located in a same memory bank.
54. The processor of claim 48, wherein each of the plurality of memory ports is configured to access the memory unit using a vector address, wherein in the shared access mode, one address in the vector address is routed in a defined region across memory banks regardless of the thread index and data shared to all threads are spread in all memory banks.
55. The processor of claim 43, wherein each of the plurality of PEs comprises a plurality of data buffers for the plurality of ALUs and is configured to operate independently.
56-210. (canceled)
211. A method, comprising:
- executing a first instruction at a reconfigurable processing element a number of times according to a first configuration, the reconfigurable processing element being configured to be part of a first physical data path during the first configuration;
- delivering an execution result from the reconfigurable processing element to a gasket memory to temporality store the execution result after each execution of the first instruction; and
- feeding the execution result stored in the gasket memory to a second physical data path.
212. The method of claim 211, further comprising:
- mapping an execution kernel into a virtual data path at a processor, wherein the execution kernel includes a sequence of instructions to be executed by the processor, and the processor comprises the reconfigurable processing element and the gasket memory;
- chopping the virtual data path into a plurality of physical data paths that includes the first physical data path and the second physical data path; and
- delivering configurations including the first configuration to the reconfiguration processing element and gasket memory.
212. The method of claim 211, further comprising accessing a piece of data stored in a memory unit through a private memory access mode and a shared memory access mode in the first and second physical data paths without the piece of data being moved in the memory unit.
Type: Application
Filed: Sep 13, 2019
Publication Date: Dec 9, 2021
Patent Grant number: 11226927
Inventors: Yuan Li (San Diego, CA), Jianbin Zhu (San Diego, CA)
Application Number: 16/569,749