Buffer Fusion and Layout Optimization

- SambaNova Systems, Inc.

Buffer assignment in a contiguous area in a coarse-grained reconfigurable (CGR) array is optimized by temporarily assigning a first buffer portion and a second buffer portion to first and second physical memory units, routing connections in the contiguous area, and calculating a first cost. A list of candidates for a third physical memory unit is created, and a best cost and a best candidate are initialized. For each candidate, the first and second buffer are reassigned to the candidate, connections for data and dataflow control information in the contiguous area are routed, and a second cost is calculated. If the second cost is better than the best cost, the best cost and the best candidate are updated.

Skip to: Description  ·  Claims  · Patent History  ·  Patent History
Description
CROSS-REFERENCES

This application claims the benefit of U.S. provisional patent application No. 63/345,751, entitled, “Buffer and Layout Optimizations,” filed on 25 May 2022. The priority application is hereby incorporated by reference herein for all purposes.

The following are incorporated by reference for all purposes:

  • Prabhakar et al., “Plasticine: A Reconfigurable Architecture for Parallel Patterns,” ISCA '17, Jun. 24-28, 2017, Toronto, ON, Canada; and
  • Koeplinger et al., “Spatial: A Language and Compiler for Application Accelerators,” Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Embodiment (PLDI), Proceedings of the 43rd International Symposium on Computer Architecture, 2018.

BACKGROUND Technical Field

The technology disclosed relates to placement and routing of coarse-grained reconfigurable processor integrated circuits. In particular, it relates to optimization of buffer implementations in the placement and routing of coarse-grained reconfigurable architecture (CGRA) chips.

Context

The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology.

A CGRA may include one or more arrays of coarse-grained reconfigurable (CGR) units, including, for example, fused-control memory unit (FCMUs), or pattern memory units (PMUs) and pattern compute units (PCUs). The CGR units provide programmable functionality rather than programmable circuitry, such as in a field-programmable gate array (FPGA), and their efficiency of die area usage is much higher than that of an FPGA. As a result, CGRAs provide much more processing power for highly parallel and interdependent compute operations, such as needed for machine learning and artificial intelligence applications.

To use a CGR processor for a highly parallelized application program, the application program is converted to an executable configuration file that contains the configuration and initialization data for the CGR processor, including a mapping of logical functions to physical units (PMUs and PCUs) and including routing information for the data and control information that must flow between the various physical units. The data may flow, for example, via the routing channels offered by an array-level network (ALN) and/or top-level network (TLN). The translation to the configuration file is compute intensive and may be performed prior to runtime by a compiler.

SUMMARY

Application programs such as for artificial intelligence and machine learning may be translated to executable configuration files for coarse-grained reconfigurable architecture (CGRA) processors. The translation may be performed by a compiler. The compiler may perform optimizations, including buffer fusion of multiple pattern memory units (PMUs) into a single PMU, and mapping large-depth buffers in meta-pipelined CGRAs.

In a first aspect, an implementation provides a method to optimize buffer allocation within a contiguous area in an array of reconfigurable units or coarse-grained reconfigurable (CGR) units. The array includes at least a first physical memory unit and a second physical memory unit. The method includes assigning a first buffer portion to a first physical memory unit and a second buffer portion to a second physical memory unit. It temporarily routes connections between the first physical memory unit, the second physical memory unit, and previously mapped reconfigurable units. It uses a cost function to calculate a first cost of the assignment. The method determines one or more candidate third physical memory units within the contiguous area, and initializes a best cost and a best candidate physical memory unit.

For each candidate third physical memory unit, the method temporarily re-assigns the first buffer portion and the second buffer portion to the candidate third physical memory unit, and temporarily routes connections between the candidate third physical memory unit and previously mapped CGR units within the contiguous area. It uses the cost function to calculate a second cost related to assigning the first buffer portion and the second buffer portion to the candidate third physical memory unit. It determines if the second cost is better (e.g., lower) than the best cost. If so, it updates the best cost to equal the second cost, and the best candidate physical memory unit to equal the candidate third physical memory unit.

The method may conclude by creating a configuration file that assigns both the first buffer portion and the second buffer portion to the best candidate physical memory unit and storing the configuration file in a non-transitory computer readable storage medium.

In a second aspect, an implementation provides a non-transitory computer readable storage medium storing computer program instructions to optimize buffer allocation in a contiguous area of an array of reconfigurable units or CGR units including at least a first physical memory unit and a second physical memory unit, wherein the computer program instructions, when executed on a processor, implements the method described in the first aspect.

In a third aspect, an implementation provides a system, including one or more processors coupled to a memory. The memory is loaded with a computer program whose instructions optimize buffer allocation in a contiguous area of an array of reconfigurable units or CGR units. The computer program instructions implement the method described in the first aspect.

In a fourth aspect, an implementation provides a computer-implemented method for buffer assignment within a contiguous area in an array of reconfigurable units that includes multiple physical memory units and other reconfigurable units. The method comprises assigning logical reconfigurable units other than buffers to physical reconfigurable units in the contiguous area; determining a first list of buffer portions to assign to physical memory units, wherein a buffer portion includes at least a part of a buffer; determining a second list of candidate physical memory units within the contiguous area; and iterating buffer portions from the first list. For each iterated buffer portion, the method determines a third list with candidate physical memory units from the second list and orders the third list based on an associated cost calculated for each candidate physical memory unit. The method proceeds with ordering the first list based on a best cost of buffer portions, and iterates buffer portions from the first list. For each iterated buffer portion, the method iterates candidate physical memory units from the third list. For each candidate physical memory unit, the method determines if the iterated candidate physical memory unit is available. If it is available, the method assigns the iterated buffer portion to the iterated candidate physical memory unit.

In a fifth aspect, an implementation provides a non-transitory computer readable storage medium storing computer program instructions for buffer assignment within a contiguous area in an array of reconfigurable units that includes multiple physical memory units and other reconfigurable units, wherein the computer program instructions, when executed on a processor, implements the method described in the fourth aspect.

In a sixth aspect, an implementation provides a system, including one or more processors coupled to a memory. The memory is loaded with a computer program whose instructions provide buffer assignment within a contiguous area in an array of reconfigurable units that includes multiple physical memory units and other reconfigurable units. The computer program instructions implement the method described in the fifth aspect.

In a seventh aspect, an implementation provides a computer-implemented method for buffer fusion. The method assigns buffer portions to memory units in a contiguous area of an array of reconfigurable units. The method has the following steps. It determines partially filled memory units in the contiguous area. For at least one set of two or more partially filled memory units, it determines if fusion of the set of two or more partially filled memory units is possible and desirable. If so, the method changes an assignment of a buffer portion assigned to a first memory unit in the set of two or more partially filled memory units from the first memory unit to a second memory unit in the set of two or more partially filled memory units; creates a configuration file that includes the changed assignment of the buffer portion and stores the configuration file in a computer-readable memory.

In an eighth aspect, an implementation provides a non-transitory computer readable storage medium storing computer program instructions for buffer fusion, wherein the computer program instructions, when executed on a processor, implements the method described in the seventh aspect.

In a ninth aspect, an implementation provides a system, including one or more processors coupled to a memory. The memory is loaded with a computer program whose instructions provide a method for buffer fusion. The computer program instructions implement the method described in the seventh aspect.

Particular aspects of the technology disclosed are described in the claims, specification and drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The technology will be described with reference to the drawings, in which:

FIG. 1 illustrates an example system including a coarse-grained reconfigurable (CGR) processor, a host, and a memory.

FIG. 2 illustrates an example of a computer, including an input device, a processor, a storage device, and an output device.

FIG. 3 illustrates example details of a CGR architecture including a top-level network (TLN) and two CGR arrays.

FIG. 4 illustrates an example CGR array, including an array of configurable nodes in an array-level network (ALN).

FIG. 5 illustrates an example of a pattern memory unit (PMU) and a pattern compute unit (PCU), which may be combined in a fused-control memory unit (FCMU).

FIG. 6 shows an example of a computation graph.

FIG. 7 shows an example of a dataflow graph.

FIG. 8 shows the dataflow graph of FIG. 7 with buffers and dataflow control information added.

FIG. 9 is a block diagram of a compiler stack implementation suitable for generating a configuration file for a CGR processor.

FIG. 10 shows details of an example array of CGR units.

FIG. 11 clarifies the assignment of multiple buffers or buffer portions to a single physical memory unit.

FIG. 12 illustrates an example of buffer fusion in case of large buffers.

FIG. 13 illustrates an example method of buffer optimization.

FIG. 14 illustrates an example method for buffer assignment within a contiguous area in an array of CGR units (a CGR array).

FIG. 15 illustrates another example method of buffer optimization.

In the figures, like reference numbers may indicate functionally similar elements. The systems and methods illustrated in the figures, and described in the Detailed Description below, may be arranged and designed in a wide variety of different implementations. Neither the figures nor the Detailed Description are intended to limit the scope of the claims. Instead, they merely represent examples of different implementations of the disclosed technology.

DETAILED DESCRIPTION

High-level programs incorporating artificial intelligence, machine learning, and other algorithms that require highly parallel pipelined processing are not well served by traditional complex instruction set computer (CISC) and reduced instruction set computer (RISC) machines, whose central processing unit (CPU) chips are slow and large and consume too much energy for these types of algorithms. Field-programmable gate arrays (FPGAs) could be configured with logic circuits that would optimally process such algorithms. However, the fine-grained physical implementation of such circuits is highly inefficient, wasting valuable semiconductor die area. Additionally, the fine-grained physical implementation results in a much too high consumed power, and suboptimal speed. If the circuits were implemented on an application-specific integrated circuit (ASIC), then the semiconductor die area would be much smaller, and the speed much higher, but it is no longer configurable.

A coarse-grained reconfigurable (CGR) processor is configured for its functionality (optimized for dataflow graphs) rather than for individual logic circuits. It maintains reconfigurability without sacrificing much die area, power, or speed. A CGRA configuration physically implements a dataflow graph derived from or included in the high-level programs, rather than individual logic gates. A dataflow graph may include multiple pipelines of interdependent processes that run asynchronously. A CGRA synchronizes the asynchronous processes using a dataflow control messaging system.

A compiler translates the high-level program to a configuration file that defines the physical implementation of the dataflow graph, as well as the dataflow control message routing, and the initialization of parameters. The translation includes steps with intermediate results, and optimizations for physical implementation in the coarse reconfigurable units and data channels available in the CGRA.

This document introduces methods for optimization of the physical implementation of buffers that a dataflow graph uses for, for example, synchronization of the various processes.

Terminology

As used herein, the phrase one of should be interpreted to mean exactly one of the listed items. For example, the phrase “one of A, B, and C” should be interpreted to mean any of: only A, only B, or only C.

As used herein, the phrases at least one of and one or more of should be interpreted to mean one or more items. For example, the phrase “at least one of A, B, and C” or the phrase “at least one of A, B, or C” should be interpreted to mean any combination of A, B, and/or C. The phrase “at least one of A, B, and C” means at least one of A and at least one of B and at least one of C.

Unless otherwise specified, the use of ordinal adjectives first, second, third, etc., to describe an object, merely refers to different instances or classes of the object and does not imply any ranking or sequence.

The following terms or acronyms used herein are defined at least in part as follows:

AGCU—address generator (AG) and coalescing unit (CU).

AI—artificial intelligence.

AIR—arithmetic or algebraic intermediate representation.

ALN—array-level network.

Assigning—a (part of a) logical CGR unit is assigned to a (part of a) physical CGR unit. Various parts of a logical CGR unit may be assigned to various parts of multiple physical CGR units. Likewise, parts or all of various logical CGR units may be assigned to various parts of one physical CGR unit. Once assigned, a part of a logical CGR unit and a part of a physical CGR unit may be said to be mapped to each other.

Buffer—an intermediate storage of data.

CGR—coarse-grained reconfigurable. A property of, for example, a system, a processor, an architecture (see CGRA), an array, or a unit in an array. This property distinguishes the system, etc., from field-programmable gate arrays (FPGAs), which can implement digital circuits at the gate level and are therefore fine-grained configurable.

CGRA—coarse-grained reconfigurable architecture. A data processor architecture that includes one or more arrays (CGR arrays) of CGR units.

Compiler—a translator that processes statements written in a programming language to machine language instructions for a computer processor. A compiler may include multiple stages to operate in multiple steps. Each stage may create or update an intermediate representation (IR) of the translated statements. Compiler stages are illustrated with reference to FIG. 5.

Computation graph—some algorithms can be represented as computation graphs. As used herein, computation graphs are a type of directed graphs comprising nodes that represent mathematical operations/expressions and edges that indicate dependencies between the operations/expressions. For example, with machine learning (ML) algorithms, input layer nodes assign variables, output layer nodes represent algorithm outcomes, and hidden layer nodes perform operations on the variables. Edges represent data (e.g., scalars, vectors, tensors) flowing between operations. In addition to dependencies, the computation graph reveals which operations and/or expressions can be executed concurrently.

CGR unit—a circuit that can be configured and reconfigured to locally store data (e.g., a memory unit or a PMU), or to execute a programmable function (e.g., a compute unit or a PCU). A CGR unit includes hardwired functionality that performs a limited number of functions used in computation graphs and dataflow graphs. Further examples of CGR units include a CU and an AG, which may be combined in an AGCU. Some implementations include CGR switches, whereas other implementations may include regular switches.

CU—coalescing unit.

Data Flow Graph—a computation graph that includes one or more loops that may be nested, and wherein nodes can send messages to nodes in earlier layers to control the dataflow between the layers.

Datapath—a collection of functional units that perform data processing operations. The functional units may include memory, multiplexers, ALUs, SIMDs, multipliers, registers, buses, etc.

FCMU—fused compute and memory unit—a circuit that includes both a memory unit and a compute unit.

Graph—a collection of nodes connected by edges. Nodes may represent various kinds of items or operations, dependent on the type of graph. Edges may represent relationships, directions, dependencies, etc.

IC—integrated circuit—a monolithically integrated circuit, i.e., a single semiconductor die which may be delivered as a bare die or as a packaged circuit. For the purposes of this document, the term integrated circuit also includes packaged circuits that include multiple semiconductor dies, stacked dies, or multiple-die substrates. Such constructions are now common in the industry, produced by the same supply chains, and for the average user often indistinguishable from monolithic circuits.

A logical CGR array or logical CGR unit—a CGR array or a CGR unit that is physically realizable, but that may not have been assigned to a physical CGR array or to a physical CGR unit on an IC.

ML—machine learning.

PCU—pattern compute unit—a compute unit that can be configured to perform one or more operations.

PEF—processor-executable format—a file format suitable for configuring a configurable data processor.

Pipeline—a staggered flow of operations through a chain of pipeline stages. The operations may be executed in parallel and in a time-sliced fashion. Pipelining increases overall instruction throughput. CGR processors may include pipelines at different levels. For example, a compute unit may include a pipeline at the gate level to enable correct timing of gate-level operations in a synchronous logic implementation of the compute unit, and a meta-pipeline at the graph execution level to enable correct timing of node-level operations of the configured graph. Gate-level pipelines are usually hard wired and unchangeable, whereas meta-pipelines are configured at the CGR processor, CGR array level, and/or GCR unit level.

Pipeline Stages—a pipeline is divided into stages that are coupled with one another to form a pipe topology.

PMU—pattern memory unit—a memory unit that can locally store data.

PNR—place and route—the assignment of logical CGR units and associated processing/operations to physical CGR units in an array, and the configuration of communication paths between the physical CGR units.

RAIL—reconfigurable dataflow unit (RDU) abstract intermediate language.

CGR Array—an array of CGR units, coupled with each other through an array-level network (ALN), and coupled with external elements via a top-level network (TLN). A CGR array can physically implement the nodes and edges of a dataflow graph and is sometimes referred to as a reconfigurable dataflow unit (RDU).

SIMD—single-instruction multiple-data—an arithmetic logic unit (ALU) that simultaneously performs a single programmable operation on multiple data elements delivering multiple output results.

TLIR—template library intermediate representation.

TLN—top-level network.

Implementations

The architecture, configurability and dataflow capabilities of an array of CGR units enable increased compute power that supports both parallel and pipelined computation. A CGR processor, which includes one or more CGR arrays (arrays of CGR units), can be programmed to simultaneously execute multiple independent and interdependent dataflow graphs. To enable simultaneous execution, the dataflow graphs may need to be distilled from a high-level program and translated to a configuration file for the CGR processor. A high-level program is source code written in programming languages like Spatial, Python, C++, and C, and may use computation libraries for scientific computing, ML, AI, and the like. The high-level program and referenced libraries can implement computing structures and algorithms of machine learning models like AlexNet, VGG Net, GoogleNet, ResNet, ResNeXt, RCNN, YOLO, SqueezeNet, SegNet, GAN, BERT, ELMo, USE, Transformer, and Transformer-XL.

Translation of high-level programs to executable bit files is performed by a compiler. While traditional compilers sequentially map operations to processor instructions, typically without regard to pipeline utilization and duration (a task usually handled by the hardware), an array of CGR units requires mapping operations to processor instructions in both space (for parallelism) and time (for synchronization of interdependent computation graphs or dataflow graphs). This requirement implies that a compiler for a CGRA must decide which operation of a computation graph or dataflow graph is assigned to which of the CGR units, and how both data and, related to the support of dataflow graphs, control information flows among CGR units, and to and from external hosts and storage. This process, known as “place and route”, is one of many new challenges posed to compilers for arrays of CGR units.

FIG. 1 illustrates an example system 100 including a CGR processor 110, a host 180, and a memory 190. CGR processor 110 has a coarse-grained reconfigurable architecture (CGRA) and includes an array of CGR units 120 such as a CGR array. CGR processor 110 further includes an IO interface 138, and a memory interface 139. Array of CGR units 120 is coupled with 10 interface 138 and memory interface 139 via databus 130 which may be part of a top-level network (TLN). Host 180 communicates with 10 interface 138 via system databus 185, and memory interface 139 communicates with memory 190 via memory bus 195. Array of CGR units 120 may further include compute units and memory units that are connected with an array-level network (ALN) to provide the circuitry for execution of a computation graph or a dataflow graph that may have been derived from a high-level program with user algorithms and functions. The high-level program may include a set of procedures, such as learning or inferencing in an AI or ML system. More specifically, the high-level program may include applications, graphs, application graphs, user applications, computation graphs, control flow graphs, dataflow graphs, models, deep learning applications, deep learning neural networks, programs, program images, jobs, tasks and/or any other procedures and functions that may need serial and/or parallel processing. In some implementations, execution of the graph(s) may involve using multiple units of CGR processor 110. In some implementations, CGR processor 110 may include one or more ICs. In other implementations, a single IC may span multiple CGR processors. In further implementations, CGR processor 110 may include one or more units of array of CGR units 120.

Host 180 may be, or include, a computer such as further described with reference to FIG. 2. Host 180 runs runtime processes, as further referenced herein, and may also be used to run computer programs, such as the compiler further described herein with reference to FIG. 9. In some implementations, the compiler may run on a computer that is similar to the computer described with reference to FIG. 2, but separate from host 180.

CGR processor 110 may accomplish computational tasks by executing a configuration file (for example, a PEF file). For the purposes of this description, a configuration file corresponds to a dataflow graph, or a translation of a dataflow graph, and may further include initialization data. A compiler compiles the high-level program to provide the configuration file. In some implementations described herein, a CGR array is configured by programming one or more configuration stores with all or parts of the configuration file. A single configuration store may be at the level of the CGR processor or the CGR array, or a CGR unit may include an individual configuration store. The configuration file may include configuration data for the CGR array and CGR units in the CGR array, and link the computation graph to the CGR array. Execution of the configuration file by CGR processor 110 causes the CGR array (s) to implement the user algorithms and functions in the dataflow graph.

CGR processor 110 can be implemented on a single integrated circuit die or on a multichip module (MCM). An IC can be packaged in a single chip module or a multichip module. An MCM is an electronic package that may comprise multiple IC dies and other devices, assembled into a single module as if it were a single device. The various dies of an MCM may be mounted on a substrate, and the bare dies of the substrate are electrically coupled to the surface or to each other using for some examples, wire bonding, tape bonding or flip-chip bonding.

FIG. 2 illustrates an example of a computer 200, including an input device 210, a processor 220, a storage device 230, and an output device 240. Although the example computer 200 is drawn with a single processor, other implementations may have multiple processors. Input device 210 may comprise a mouse, a keyboard, a sensor, an input port (for example, a universal serial bus (USB) port), and any other input device known in the art. Output device 240 may comprise a monitor, printer, and any other output device known in the art. Furthermore, part or all of input device 210 and output device 240 may be combined in a network interface, such as a Peripheral Component Interconnect Express (PCIe) interface suitable for communicating with CGR processor 110. Input device 210 is coupled with processor 220 to provide input data, which an implementation may store in memory 226. Processor 220 is coupled with output device 240 to provide output data from memory 226 to output device 240. Processor 220 further includes control logic 222, operable to control memory 226 and arithmetic and logic unit (ALU) 224, and to receive program and configuration data from memory 226. Control logic 222 further controls exchange of data between memory 226 and storage device 230. Memory 226 typically comprises memory with fast access, such as static random-access memory (SRAM), whereas storage device 230 typically comprises memory with slow access, such as dynamic random-access memory (DRAM), flash memory, magnetic disks, optical disks, and any other memory type known in the art. At least a part of the memory in storage device 230 includes a non-transitory computer-readable medium (CRM 235), such as used for storing computer programs.

FIG. 3 illustrates example details of a CGR architecture 300 including a top-level network (TLN 330) and two CGR arrays (CGR array 310 and CGR array 320). A CGR array comprises an array of CGR units (e.g., PMUs, PCUs, FCMUs) coupled via an array-level network (ALN), e.g., a bus system. The ALN is coupled with the TLN 330 through several AGCUs, and consequently with I/O interface 338 (or any number of interfaces) and memory interface 339. Other implementations may use different bus or communication architectures.

Circuits on the TLN in this example include one or more external I/O interfaces, including I/O interface 338 and memory interface 339. The interfaces to external devices include circuits for routing data among circuits coupled with the TLN and external devices, such as high-capacity memory, host processors, other CGR processors, FPGA devices, and so on, that are coupled with the interfaces.

Each depicted CGR array has four AGCUs (e.g., MAGCU1, AGCU12, AGCU13, and AGCU14 in CGR array 310). The AGCUs interface the TLN to the ALNs and route data from the TLN to the ALN or vice versa.

One of the AGCUs in each CGR array in this example is configured to be a master AGCU (MAGCU), which includes an array configuration load/unload controller for the CGR array. The MAGCU1 includes a configuration load/unload controller for CGR array 310, and MAGCU2 includes a configuration load/unload controller for CGR array 320. Some implementations may include more than one array configuration load/unload controller. In other implementations, an array configuration load/unload controller may be implemented by logic distributed among more than one AGCU. In yet other implementations, a configuration load/unload controller can be designed for loading and unloading configuration of more than one CGR array. In further implementations, more than one configuration controller can be designed for configuration of a single CGR array. Also, the configuration load/unload controller can be implemented in other portions of the system, including as a stand-alone circuit on the TLN and the ALN or ALNs.

The TLN is constructed using top-level switches (switch 311, switch 312, switch 313, switch 314, switch 315, and switch 316) coupled with each other as well as with other circuits on the TLN, including the AGCUs, and external I/O interface 338. The TLN includes links (e.g., L11, L12, L21, L22) coupling the top-level switches. Data may travel in packets between the top-level switches on the links, and from the switches to the circuits on the network coupled with the switches. For example, switch 311 and switch 312 are coupled by link L11, switch 314 and switch 315 are coupled by link L12, switch 311 and switch 314 are coupled by link L13, and switch 312 and switch 313 are coupled by link L21. The links can include one or more buses and supporting control lines, including for example a chunk-wide bus (vector bus). For example, the top-level network can include data, request and response channels operable in coordination for transfer of data in any manner known in the art.

FIG. 4 illustrates an example CGR array 400, including an array of CGR units in an ALN. CGR array 400 may include several types of CGR unit 401, such as FCMUs, PMUs, PCUs, memory units, and/or compute units. For examples of the functions of these types of CGR units, see Prabhakar et al., “Plasticine: A Reconfigurable Architecture for Parallel Patterns”, ISCA 2017, Jun. 24-28, 2017, Toronto, ON, Canada. Each of the CGR units may include a configuration store 402 comprising a set of registers or flip-flops storing configuration data that represents the setup and/or the sequence to run a program, and that can include the number of nested loops, the limits of each loop iterator, the instructions to be executed for each stage, the source of operands, and the network parameters for the input and output interfaces. In some implementations, each CGR unit 401 comprises an FCMU. In other implementations, the array comprises both PMUs and PCUs, or memory units and compute units, arranged in a checkerboard pattern. In yet other implementations, CGR units may be arranged in different patterns. The ALN includes switch units 403 (S), and AGCUs (each including two address generators 405 (AG) and a shared coalescing unit 404 (CU)). Switch units 403 are connected among themselves via interconnects 421 and to a CGR unit 401 with interconnects 422. Switch units 403 may be coupled with address generators 405 via interconnects 420. In some implementations, communication channels can be configured as end-to-end connections, and switch units 403 are CGR units. In other implementations, switches route data via the available links based on address information in packet headers, and communication channels establish as and when needed.

A configuration file may include configuration data representing an initial configuration, or starting state, of each of the CGR units that execute a high-level program with user algorithms and functions. Program load is the process of setting up the configuration stores in the CGR array based on the configuration data to allow the CGR units to execute the high-level program. Program load may also require loading memory units and/or PMUs.

The ALN includes one or more kinds of physical data buses, for example a chunk-level vector bus (e.g., 512 bits of data), a word-level scalar bus (e.g., 32 bits of data), and a control bus. For instance, interconnects 421 between two switches may include a vector bus interconnect with a bus width of 512 bits, and a scalar bus interconnect with a bus width of 32 bits. A control bus can comprise a configurable interconnect that carries multiple control bits on signal routes designated by configuration bits in the CGR array's configuration file. The control bus can comprise physical lines separate from the data buses in some implementations. In other implementations, the control bus can be implemented using the same physical lines with a separate protocol or in a time-sharing procedure.

Physical data buses may differ in the granularity of data being transferred. In one implementation, a vector bus can carry a chunk that includes 16 channels of 32-bit floating-point data or 32 channels of 16-bit floating-point data (i.e., 512 bits) of data as its payload. A scalar bus can have a 32-bit payload and carry scalar operands or control information. The control bus can carry control handshakes such as tokens and other signals. The vector and scalar buses can be packet-switched, including headers that indicate a destination of each packet and other information such as sequence numbers that can be used to reassemble a file when the packets are received out of order. Each packet header can contain a destination identifier that identifies the geographical coordinates of the destination switch unit (e.g., the row and column in the array), and an interface identifier that identifies the interface on the destination switch (e.g., North, South, East, West, etc.) used to reach the destination unit.

A CGR unit 401 may have four ports (as drawn) to interface with switch units 403, or any other number of ports suitable for an ALN. Each port may be suitable for receiving and transmitting data, or a port may be suitable for only receiving or only transmitting data.

A switch unit, as shown in the example of FIG. 4, may have eight interfaces. The North, South, East and West interfaces of a switch unit may be used for links between switch units using interconnects 421. The Northeast, Southeast, Northwest and Southwest interfaces of a switch unit may each be used to make a link with an FCMU, PCU or PMU instance using one of the interconnects 422. Two switch units in each CGR array quadrant have links to an AGCU using interconnects 420. The AGCU coalescing unit arbitrates between the AGs and processes memory requests. Each of the eight interfaces of a switch unit can include a vector interface, a scalar interface, and a control interface to communicate with the vector network, the scalar network, and the control network. In other implementations, a switch unit may have any number of interfaces.

During execution of a graph or subgraph in a CGR array after configuration, data can be sent via one or more switch units and one or more links between the switch units to the CGR units using the vector bus and vector interface(s) of the one or more switch units on the ALN. A CGR array may comprise at least a part of CGR array 400, and any number of other CGR arrays coupled with CGR array 400.

A data processing operation implemented by CGR array configuration may comprise multiple graphs or subgraphs specifying data processing operations that are distributed among and executed by corresponding CGR units (e.g., FCMUs, PMUs, PCUs, AGs, and CUs).

FIG. 5 illustrates an example 500 of a PMU 510 and a PCU 520, which may be combined in an FCMU 530. PMU 510 may be directly coupled to PCU 520, or optionally via one or more switches. PMU 510 includes a scratchpad memory 515, which may receive external data, memory addresses, and memory control information (write enable, read enable) via one or more buses included in the ALN. PCU 520 includes two or more processor stages, such as SIMD 521 through SIMD 526, and configuration store 528. The processor stages may include ALUs, or SIMDs, as drawn, or any other reconfigurable stages that can process data.

Each stage in PCU 520 may also hold one or more registers (not drawn) for short-term storage of parameters. Short-term storage, for example during one to several clock cycles or unit delays, allows for synchronization of data in the PCU pipeline.

FIG. 6 shows an example of a computation graph 600. Computation graphs represent mathematical expressions, and comprise nodes and directed edges. In FIG. 6, nodes are drawn as circles and directed edges are drawn as arrows. A node can represent a constant, a variable, for example from an input, an operation, an equation, or an output value. A directed edge can represent a dependency. Node 610 represents a variable A1, whose present value equals 12. Node 611 represents a variable A2, whose present value equals 251. Node 612 represents the constant n. Node 613 represents a multiplication operation. It receives its input data from node 611 via directed edge 621 and from node 612 via directed edge 622. Node 614 represents an addition operation. Node 614 receives its input data from node 610 via directed edge 620 and from node 613 via directed edge 623. Node 614 outputs its result in output node 615 via directed edge 624. Computation graph 600 as a whole represents the equation Output=A1+pi*A2.

The depicted computation graph 600 is very simple and could be implemented electronically in many ways. For example, it could be hardwired as a circuit of digital gates in an application-specific IC (ASIC), or an FPGA could be configured to emulate the circuit of digital gates, or a CGR processor could be configured to perform the addition and multiplication functions, or a CPU could run a conventional computer program to perform the functions. In all implementations, the timing is important. Node 614 is not able to calculate a valid output value until all its input values are valid. That means node 613 must be finished first. Most digital circuits are implemented as pipelines of clocked stages. If the add operation of node 614 is in a later stage than the multiplication operation of node 613, then a fixed-delay buffer may need to be inserted between node 610 and node 614 to synchronize the value of variable A1 with the result of the multiplication in node 613. The fixed-delay buffer can be added to the graph to make it physically implementable.

Most computation graphs are a-cyclic, i.e., they don't include loops. One class of computation graphs, dataflow graphs, may include loops, and even nested loops. This can make delays of operations performed by nodes variable, dependent on the data flowing through a pipeline of operations. When a high-level program includes multiple pipelines of parallel, interdependent operations, then synchronization can become highly complex. Synchronization can be further complicated when directed edges are implemented as data channels in a network, since the data channels can become congested. A CGR processor may resolve both problems by using dataflow control information, sent as messages from consuming nodes to producing nodes to indicate that the consuming node is ready to receive the information, and a credit token system that prevents congestion of the data channels between the producing and consuming nodes.

FIG. 7 shows an example of a dataflow graph 700. This example, one head of a multi-head attention module in the Transformer model first published by Vaswani, et al., “Attention Is All You Need,” 31st Conference on Neural Information Processing Systems, 2017, is well known in the industry. It includes a loop 709. Dataflow graph 700 includes four general matrix multiplications, GeMM 702, GeMM 712, GeMM 722, and GeMM 708. Loop 709 includes an ingress matrix multiplication GeMM 703, mask fill node 704, softmax node 705, dropout node 706, and egress matrix multiplication node 707.

To physically implement dataflow graph 700, an implementation may insert three types of stage buffers: (1) inter-stage buffers, (2) intra-stage buffers, and (3) interface buffers. The interface buffers are used because the granularity of communication (i.e., the size of tensors or data produced or consumed) varies between loops at different levels. Further, an implementation must add dataflow control information, to synchronize the various stages of asynchronous computation.

FIG. 8 shows the dataflow graph of FIG. 7 with buffers and dataflow control information added. A compiler in the technology presented herein can create graph 800 from dataflow graph 700, assign the nodes to compute units and memory units in a CGR array, and assign edges and dataflow control information to data channels in an array-level network that connects the compute units and memory units.

To get from dataflow graph 700 to graph 800, one compiler implementation divides the dataflow graph in stages (stages 0, 1, and 2 are shown in this example), and where there are nested loops also in substages (substages 1.0 through 1.4 are shown). The implementation inserts buffers between the stages to allow for pipelined processing in one or more parallel meta-pipelines that may interact. The buffers are shown as blocks labeled A . . . L. They are different from buffers at the electrical level, which may be single or double inverters used to boost the energy level of digital signals that need to travel through long wires or that need to drive high-capacitance loads, or which may be flipflops operated by a system clock and used to implement synchronous logic. The buffers at the meta-pipeline level may be memories, register files, shift registers, or first-in-first-out (FIFO) memories of fixed or variable length, storing one or more data items (e.g., scalars, vectors, or tensors). They may be clocked by a producer node to store data or by a consumer node to release data. They may further be controlled by dataflow control information coming from, for example, downstream nodes. FIG. 8 shows the same operation nodes as FIG. 7 (with like numbering), but the edges (solid arrows), where data flows, are interrupted by the buffers to partition the graph into stages, and dataflow control information is added (shown as dashed arrows for the main loop and dash-dot arrows for loop 809). In the example shown, data travels downstream (solid arrows from the left to the right) and dataflow control information travels upstream (dashed arrows from the right to the left).

In further preparation for a physical implementation of graph 800, an implementation may assign each operation node to one or more logical compute units or memory units, and each buffer to one or more logical memory units. Some implementations may perform further preparations and optimizations. All implementations proceed to place and route, i.e., assign the logical units to physical units in a layout of a CGR array, and (in some implementations) assign the data connections and the dataflow control information connections to data channels in the ALN in the CGR array.

FIG. 9 is a block diagram of a compiler stack 900 implementation suitable for generating a configuration file for a CGR processor. As depicted, compiler stack 900 includes several stages to convert a high-level program with user algorithms and functions, e.g., algebraic expressions and functions, to configuration data for the CGR units. Compiler stack 900 may take its input from application platform 910, or any other source of high-level program statements suitable for parallel processing, which provides a user interface for general users. It may further receive hardware description 915, for example defining the physical units in a reconfigurable data processor or CGRA processor. Application platform 910 may include libraries such as PyTorch, TensorFlow, ONNX, Caffe, and Keras to provide user-selected and configured algorithms. Application platform 910 outputs a high-level program to compiler 920, which in turn outputs a configuration file to the reconfigurable data processor or CGRA processor where it is executed in runtime processes 930. Compiler 920 may include dataflow graph compiler 921, which may handle a dataflow graph, algebraic graph compiler 922, template graph compiler 923, template library 924, and placer and router PNR 925. In some implementations, template library 924 includes RDU abstract intermediate language (RAIL) and/or assembly language interfaces for power users.

Dataflow graph compiler 921 converts the high-level program with user algorithms and functions from application platform 910 to one or more dataflow graphs. The high-level program may be suitable for parallel processing, and therefore parts of the nodes of the dataflow graphs may be intrinsically parallel unless an edge in the graph indicates a dependency. Dataflow graph compiler 921 may provide code optimization steps like false data dependency elimination, dead-code elimination, and constant folding. The dataflow graphs encode the data and control dependencies of the high-level program. Dataflow graph compiler 921 may support programming a reconfigurable data processor at higher or lower-level programming languages, for example from an application platform 910 to C++ and assembly language. In some implementations, dataflow graph compiler 921 allows programmers to provide code that runs directly on the reconfigurable data processor. In other implementations, dataflow graph compiler 921 provides one or more libraries that include predefined functions like linear algebra operations, element-wise tensor operations, non-linearities, and reductions required for creating, executing, and profiling the dataflow graphs on the reconfigurable processors. Dataflow graph compiler 921 may provide an application programming interface (API) to enhance functionality available via the application platform 910.

Algebraic graph compiler 922 may include a model analyzer and compiler (MAC) level that makes high-level mapping decisions for (sub-graphs of the) dataflow graph based on hardware constraints. It may support various application frontends such as Samba, JAX, and TensorFlow/HLO. Algebraic graph compiler 922 may also transform the graphs via autodiff and GradNorm, perform stitching between sub-graphs, interface with template generators for performance and latency estimation, convert dataflow graph operations to AIR operation, perform tiling, sharding (database partitioning) and other operations, and model or estimate the parallelism that can be achieved on the dataflow graphs.

Algebraic graph compiler 922 may further include an arithmetic or algebraic intermediate representation (AIR) level that translates high-level graph and mapping decisions provided by the MAC level into explicit AIR graphs. Key responsibilities of the AIR level include legalizing the graph and mapping decisions of the MAC, expanding data parallel, tiling, metapipe, region instructions provided by the MAC, inserting stage buffers and skip buffers, eliminating redundant operations, buffers and sections, and optimizing for resource use, latency, and throughput.

Template graph compiler 923 may translate AIR graphs into TLIR graphs, optimizing for the target hardware architecture and/or into unplaced units suitable for PNR 925. Template graph compiler 923 may add further information (name, inputs, input names and dataflow description) for PNR 925 and make the graph physically realizable through each performed step. Template graph compiler 923 may for example provide translation of AIR graphs to specific model operation templates such as for general matrix multiplication (GeMM). An implementation may convert part or all intermediate representation operations to templates, stitch templates into the dataflow and control flow, insert necessary buffers and layout transforms, generate test data and optimize for hardware use, latency, and throughput.

Implementations may use templates for common operations. Templates may be implemented using assembly language, RAIL, or similar. RAIL is comparable to assembly language in that memory units and compute units are separately programmed, but it can provide a higher level of abstraction and compiler intelligence via a concise performance-oriented domain-specific language for CGR array templates. RAIL enables template writers and external power users to control interactions between logical compute units and memory units with high-level expressions without the need to manually program capacity splitting, register allocation, etc. The logical compute units and memory units also enable stage/register allocation, context splitting, transpose slotting, resource virtualization and mapping to multiple physical compute units and memory units (e.g., PCUs and PMUs).

Template library 924 may include an assembler that provides an architecture-independent low-level programming interface as well as optimization and code generation for the target hardware. Responsibilities of the assembler may include address expression compilation, intra-unit resource allocation and management, making a template graph physically realizable with target-specific rules, low-level architecture-specific transformations and optimizations, and architecture-specific code generation.

PNR 925 translates and maps logical (i.e., unplaced physically realizable) CGR units to the physical chip level (e.g., a physical array of CGR units), determines physical data channels to allow for communication among the CGR units and between the CGR units and circuits coupled via the TLN, allocates ports on the CGR units and switches, provides configuration data and initialization data for the target hardware, and produces configuration files, e.g., processor-executable format (PEF) files. It may further provide bandwidth calculations, allocate network interfaces such as AGCUs and virtual address generators (VAGs), provide configuration data that allows AGCUs and/or VAGs to perform address translation, and control ALN switches and data routing. PNR 925 may provide its functionality in multiple steps and may include multiple modules (not shown in FIG. 9) to provide the multiple steps, e.g., a placer, a router, a port allocator, and a PEF file generator. PNR 925 may receive its input data in various ways. For example, it may receive parts of its input data from any of the earlier modules (dataflow graph compiler 921, algebraic graph compiler 922, template graph compiler 923, and/or template library 924). In some implementations, an earlier module, such as template graph compiler 923, may have the task of preparing all information for PNR 925 and no other units provide PNR input data directly.

Further implementations of compiler 920 provide for an iterative process, for example by feeding information from PNR 925 back to an earlier module, so that the earlier module can execute a new compilation step in which it uses physically realized results rather than estimates of or placeholders for physically realizable circuits. For example, PNR 925 may feed information regarding the physically realized circuits back to algebraic graph compiler 922.

Memory allocations represent the creation of logical memory spaces in on-chip and/or off-chip memories for data required to implement the dataflow graph, and these memory allocations are specified in the configuration file. Memory allocations define the type and the number of hardware circuits (functional units, storage, or connectivity components). Main memory (e.g., DRAM) may be off-chip memory, and scratchpad memory (e.g., SRAM) is on-chip memory inside an RDU. Other memory types for which the memory allocations can be made for various access patterns and layouts include cache, read-only look-up tables (LUTs), serial memories (e.g., FIFOs), and register files.

Compiler 920 binds memory allocations to unplaced memory units and binds operations specified by operation nodes in the dataflow graph to unplaced compute units, and these bindings may be specified in the configuration data. In some implementations, compiler 920 partitions parts of a dataflow graph into memory subgraphs and compute subgraphs, and specifies these subgraphs in the PEF file. A memory subgraph may comprise address calculations leading up to a memory access. A compute subgraph may comprise all other operations in the parent graph. In one implementation, a parent graph is broken up into multiple memory subgraphs and exactly one compute subgraph. A single parent graph can produce one or more memory subgraphs, depending on how many memory accesses exist in the original loop body. In cases where the same memory addressing logic is shared across multiple memory accesses, address calculation may be duplicated to create multiple memory subgraphs from the same parent graph.

Compiler 920 generates the configuration files with configuration data (e.g., a bit stream) for the placed positions and the routed data and control networks. In one implementation, this includes assigning coordinates and communication resources of the physical CGR units by placing and routing unplaced units onto the array of CGR units while maximizing bandwidth and minimizing latency.

A first example of accelerated deep learning is using a deep learning accelerator implemented in a CGRA to train a neural network. A second example of accelerated deep learning is using the deep learning accelerator to operate a trained neural network to perform inferences. A third example of accelerated deep learning is using the deep learning accelerator to train a neural network and subsequently perform inference with any one or more of the trained neural network, information from the trained neural network, and a variant of the same.

Examples of neural networks include fully connected neural networks (FCNNs), recurrent neural networks (RNNs), graph neural networks (GNNs), convolutional neural networks (CNNs), graph convolutional networks (GCNs), long short-term memory (LSTM) networks, autoencoders, deep belief networks, and generative adversarial networks (GANs).

An example of training a neural network is determining one or more weights associated with the neural network, such as by back-propagation in a deep learning accelerator. An example of making an inference is using a trained neural network to compute results by processing input data using the weights associated with the trained neural network. As used herein, the term ‘weight’ is an example of a ‘parameter’ as used in various forms of neural network processing. For example, some neural network learning is directed to determining parameters (e.g., through back-propagation) that are usable for performing neural network inferences.

A neural network processes data according to a dataflow graph comprising layers of neurons. Example layers of neurons include input layers, hidden layers, and output layers. Stimuli (e.g., input data) are received by an input layer of neurons and the computed results of the dataflow graph (e.g., output data) are provided by an output layer of neurons. Example hidden layers include rectified linear unit (ReLU) layers, fully connected layers, recurrent layers, graphical network layers, long short-term memory layers, convolutional layers, kernel layers, dropout layers, and pooling layers. A neural network may be conditionally and/or selectively trained. After being trained, a neural network may be conditionally and/or selectively used for inference.

Examples of ICs, or parts of ICs, that may be used as deep learning accelerators, are processors such as central processing unit (CPUs), CGR processor ICs, graphics processing units (GPUs), FPGAs, ASICs, application-specific instruction-set processor (ASIP), and digital signal processors (DSPs). The disclosed technology implements efficient distributed computing by allowing an array of accelerators (e.g., reconfigurable processors) attached to separate hosts to directly communicate with each other via buffers.

FIG. 10 shows details of an example array of CGR units (CGR array 1000). CGR array 1000 may be included in a CGR processor. CGR array 1000 comprises multiple physical memory units (for example, M00 . . . MU33) and may comprise other CGR units, such as physical compute units CU00 . . . CU33. The physical memory units and other CGR units may be arranged in a regular pattern, for example in repeated rows of alternating physical memory units and physical compute units, as drawn. In a common situation, a (part of) a graph including logical memory units (or buffers) and logical compute units, is limited to a contiguous area 1010, which includes a subset of the physical memory units in CGR array 1000, for example MU00, MU01, MU02, MU10, MU11, MU12, and MU21. The physical compute units may already be mapped to logical compute units in contiguous area 1010. Buffers inserted by the implementation, for example inter-stage buffers, intra-stage buffers, and interface buffers may still need to be assigned to the physical memory units in contiguous area 1010. Placement (assignment) of the buffers carries a cost. The cost may be dependent on the high-level program and on other factors, and may include parameters related to the weight and/or bandwidth of routing channels connecting the buffers, the traffic density of the routing channels, traffic latency, the efficiency of semiconductor die area usage, energy use, and any other parameters that affect the performance of executing a high-level program with user algorithms and functions in CGR array 1000. In some cases, some buffers may be small enough that they can be combined in a single physical memory unit. In other cases, buffers may be larger than a physical memory unit, but parts of buffers may be combined in a physical memory unit. In some implementations, the contiguous area may be a square or a rectangle, and in some cases, it may have another shape, for example a polygon shape as drawn. The contiguous area may include part or all of CGR array 1000.

FIG. 11 clarifies the assignment 1100 of multiple buffers or buffer portions to a single physical memory unit. In this case, a dataflow graph may include a first buffer portion 1115 and a second buffer portion 1125, which need to be placed in (assigned to) physical memory units in a CGR array, or in physical memory units in a contiguous area in the CGR array. Combining two or more buffer portions in a single physical memory unit may allow a higher performance operation of the CGR array. Combination may be possible if the sum of the sizes of first buffer portion 1115 and second buffer portion 1125 is less than the size of a physical memory unit.

In an implementation, first buffer portion 1115 and second buffer portion 1125 have been temporarily assigned to first physical memory unit 1110 and second physical memory unit 1120. First physical memory unit 1110 is located near second physical memory unit 1120, i.e., within the contiguous area. In this situation, it may be beneficial to re-assign first buffer portion 1115 and second buffer portion 1125 to a third physical memory unit 1130 (in the contiguous area) to replace first physical memory unit 1110 and second physical memory unit 1120. Third physical memory unit 1130 may be first physical memory unit 1110, or second physical memory unit 1120, or yet another memory unit within the contiguous area.

FIG. 12 illustrates an example 1200 of buffer fusion in case of large buffers. Parts of an array of CGR units are mapped to a first buffer 1210 and a second buffer 1220. Both buffers span multiple memory units. For example, first buffer 1210 ends in memory unit 1212, where it is assigned to partial memory area 1218. Second buffer 1220 ends in memory unit 1222, where it is assigned to partial memory area 1228. The combined size of partial memory area 1218 and partial memory area 1228 is smaller than the size of memory unit 1212 or memory unit 1222, so a memory unit can be freed by moving the tail end of second buffer 1220 from memory unit 1222 into memory unit 1212, where it can occupy partial memory area 1219, as drawn. Alternatively, an implementation may move the tail end of first buffer 1210 from memory unit 1212 into memory unit 1222. In either case, a memory unit can be freed up, which may allow for more efficient use of the CGR array.

FIG. 13 illustrates an example method 1300 of buffer optimization. An array of CGR units includes at least a first physical memory unit and a second physical memory unit, both located within a contiguous area. The method determines an optimum placement for (a portion or all of) a first buffer and (a portion or all of) a second buffer, whose cumulative size is less than the size of a physical memory unit. Although this implementation gives an example for (portions of) two buffers, other implementations may optimize for any number of (portions of) buffers. Method 1300 includes the following steps:

Step 1310—temporarily assigning the first buffer portion to the first physical memory unit and the second buffer portion to the second physical memory unit. The first buffer portion includes a part or all of the first buffer. The second buffer portion includes a part or all of the second buffer.

Step 1320—temporarily routing connections for data and flow control information between the first physical memory unit, the second physical memory unit, and previously mapped CGR units within the contiguous area, and calculating a first cost CF1-2 related to the temporary assignment of the first buffer to the first physical memory unit and the second buffer to the second physical memory unit. An implementation may calculate first cost CF1-2 within the contiguous area. To calculate the cost, the implementation uses a cost function that can weigh one or more parameters, such as parameters related to the weight and/or bandwidth of edges connecting the buffer nodes, the traffic density of available routing channels between the physical memory units and their neighbors, traffic latency in those channels, the efficiency of semiconductor die area usage, energy use, and/or any other parameters that affect the performance of executing a high-level program with user algorithms and functions in the CGR array.

Step 1330—determining one or more candidates for a third physical memory unit within the contiguous area and selecting an initial candidate third physical memory unit. The size of each candidate third physical memory unit is larger than the accumulated size of the buffer portions (in this example, the size of the first buffer portion plus the size of the second buffer portion). A list of candidates may include the first physical memory unit and the second physical memory unit.

Step 1340—initializing a best cost CFB to equal CF1-2, and initializing a best candidate. Some implementations may initialize the best candidate to “none”. Other implementations may initialize the best candidate to the first physical memory unit. Further implementations may have a first best candidate associated with the first buffer and initialized to the first physical memory unit, and a second best candidate associated with the second buffer and initialized to the second physical memory unit.

Step 1350—temporarily reassigning the first buffer and the second buffer to the candidate third physical memory unit, temporarily routing connections for data and flow control information between the candidate third physical memory unit and previously mapped CGR units within the contiguous area, and calculating a second cost CF3 associated with the CGR units in the contiguous area.

Step 1360—determining if the second cost CF3 is better than the best cost CFB. If yes, proceeding to Step 1370. If no, proceeding to Step 1380. In some implementations, a better cost is a lower cost. In other implementations, a better cost is a higher cost.

Step 1370—updating the best cost CFB to equal the second cost CF3, and updating the best candidate to equal the third physical MU.

Step 1380—determining if another candidate third buffer is available. If yes, updating the candidate third buffer and returning to Step 1350. If no, proceeding to Step 1390.

Step 1390—assigning the first buffer and the second buffer to the best candidate (physical memory unit). If no best candidate has been found or if the best cost CFB equals CF1-2, the implementation assigns the first buffer to the first physical memory unit and the second buffer to the second physical memory unit. Method 1300 creates and stores a configuration file that assigns the first buffer and the second buffer to the best candidate physical memory unit.

Some implementations may use variations of method 1300, for example leaving out some steps, or adjusting steps for working with both the first best candidate and the second best candidate. Some implementations may perform method 1300 as a step following the PNR activities in PNR 625, whereas other implementations may perform method 1300 as an integral part of the PNR activities executed in PNR 625.

The costs CF1-2 and CF3 in method 1300 may be calculated using the same cost function used in other functions related to PNR 925.

The disclosed technology includes method 1300 as a computer-implemented method. It also includes a non-transitory computer readable storage medium storing computer program instructions that, when executed on a processor, implement method 1300. It further includes a system with one or more processors coupled to a memory that is loaded with computer program instructions that when executed on the one or more processors, implement actions included in method 1300.

FIG. 14 illustrates an example method 1400 for buffer assignment within a contiguous area in an array of CGR units. CGR units may, for example, include memory units and compute units. The array of CGR units may include multiple compute units and memory units. The buffers are assigned to memory units (or PMUs) in a way that allows for optimization of a cost that is relevant for a high-level application being executed on the array of CGR units. Multiple buffers may be assigned to a single memory unit if the memory unit has sufficient space, and such an assignment is beneficial for the high-level application. Method 1400 includes the following steps.

Step 1410—assigning logical CGR units other than memory units or buffers to physical CGR units in the contiguous area in the CGR array. The logical CGR units and physical CGR units may include compute units, as well as CUs, AGCUs, and other CGR units.

Step 1412—determining a first list, including buffer portions to be placed in the contiguous area, i.e., to be assigned to memory units in the contiguous area. A buffer portion includes at least a part of a buffer. Typically, each buffer portion is small enough to be assigned to a physical memory unit. For example, if a buffer requires 2.5 times the memory available in a physical memory unit, the buffer may be split into three portions. Two portions may be small enough to exactly fit in one physical memory unit, and the third portion may fit in one half of a physical memory unit. Step 1412 further includes determining a second list that includes candidate physical memory units within the contiguous area.

Step 1420—iterating buffer portions from the first list by selecting a first iterated buffer portion and sequentially selecting next iterated buffer portions. For each iterated buffer portion, method 1400 performs Step 1422 and Step 1424.

Step 1422—for the iterated buffer portion, determining a third list, which lists candidate physical memory units from the second list that have sufficient memory for the iterated buffer portion, along with respective costs associated with assigning the buffer to the candidate physical memory unit. To calculate the associated cost, an implementation may ignore the impact of any other buffer portions in the first list, for example buffer portions that have not yet been assigned to candidate physical memory units. To calculate the associated cost, the implementation temporarily maps the candidate physical memory unit to the iterated buffer portion. It may allocate physical communication channels to connections between the iterated buffer portion and prior mapped CGR units. The implementation uses a cost function that can weigh one or more parameters, such as parameters related to the weight and/or bandwidth of edges connecting the buffer nodes, the traffic density of available routing channels between the physical memory units and their neighbors, traffic latency in those channels, the efficiency of semiconductor die area usage, energy use, and/or any other parameters that affect the performance of executing a high-level program with user algorithms and functions in the CGR array.

Step 1424—ordering the third list based on the associated costs. After ordering, the third list may start with a candidate physical memory unit that has the lowest (or best) cost and end with a candidate physical memory unit that has the highest (or worst) cost. The iteration determines the best cost for the iterated buffer portion, e.g., the lowest cost related to a candidate physical memory unit on the third list.

Step 1430—ordering the first list based on the best cost of buffer portions in the first list. For example, the first list may be ordered based on the costs of memory units with which the respective third lists start (since each buffer on the first list has a third list with memory units and costs). Buffer portions that are part of the same buffer may be grouped, and the first list order may be based on their collective best cost. For example, if the first list is based on three buffers B1, B2, and B3, with respective sizes of 1, 2.5, and 1 physical memory unit, then the first list may include 5 memory portions: B1-1, B2-1, B2-2, B2-3, and B3-1, with respective sizes of 1, 1, 1, 0.5, and 1 physical memory unit. If their respective associated costs are (for example) 18, 6, 6, 2, and 12, then buffer B2 has a cumulative associated cost of 6+6+2=14, and the first list may be reordered to B1-3, B2-1, B2-2, B2-3, and B1-1 with associated costs of 12, 6, 6, 2, and 18.

Step 1440—iterating buffers from the first list by selecting a first iterated buffer with an associated third list and sequentially selecting a next iterated buffer with an associated third list. For each iterated buffer, method 1400 performs Step 1450.

Step 1450—iterating candidate memory units from the associated third list by sequentially selecting an iterated candidate memory unit from the associated third list. An iteration includes Step 1452, and Step 1454 or Step 1456.

Step 1452—determining if the iterated candidate physical memory unit is available, i.e., if it has not been mapped before, or if it has sufficient unmapped memory for the iterated buffer portion. Some implementations may assign an iterated buffer portion only to a whole iterated candidate memory unit, whereas other implementations may assign a sufficiently small iterated buffer portion to a partial iterated candidate memory unit, leaving room for other sufficiently small buffer portions to be assigned.

Step 1454—upon determining that the iterated candidate physical memory unit is available, assigning the iterated buffer portion to the iterated candidate physical memory unit. The implementation stops iterating candidate memory units from the associated third list, and proceeds to a next iteration in Step 1440. If no next iteration is available, method 1400 may create a configuration file that includes the assigned buffers in the first list and store the configuration file in a non-transitory computer readable storage medium, or method 1400 may end.

Step 1456—upon determining that the iterated candidate physical memory unit is available, proceeding with a next iteration in Step 1450. If no next iteration is available, an implementation may dispatch an error, or a warning, and/or may mark the iterated buffer portion as unmapped.

The disclosed technology includes method 1400 as a computer-implemented method. It also includes a non-transitory computer readable storage medium storing computer program instructions that, when executed on a processor, implement method 1400. It further includes a system with one or more processors coupled to a memory that is loaded with computer program instructions that when executed on the one or more processors, implement actions included in method 1400.

FIG. 15 illustrates another example method 1500 of buffer optimization. Method 1500 may be limited to buffers and/or buffer portions in a contiguous area of an array of CGR units (a CGR array). A buffer portion may be part or all of a buffer. CGR units may, for example, include memory units and compute units. In some implementations, buffer portions have been provisionally assigned to logical memory units (or logical PMUs). That is, the buffer portions may have been assigned to logical memory units, or physically realizable memory units, prior to their assignment to physical memory units. In other implementations, the buffers have been assigned to physical memory units, but optimization of usage is needed. Method 1500 includes the following steps.

Step 1510—finding memory units that are partially filled, i.e., partially mapped to one or more buffer portions. An implementation may list memory units that are planned for or located within a contiguous area of the CGR array.

Step 1520—Selecting a set of two or more partially filled memory units.

Step 1530—determining if fusion of buffers in the memory units in the set of partially filled memory units is possible and desirable. An implementation may consider many parameters to determine if fusion is possible and desirable. For example, the implementation may determine if the combined memory allocation is no more than the size of one memory unit; the implementation may determine if an estimated distance within the set of memory units is less than a predetermined or calculated threshold; the implementation may determine if an estimated distance between a moved (reassigned) part of a buffer and the remainder of that buffer is less than a predetermined or calculated threshold; and the implementation may determine if fusion isn't likely to cause network congestion.

In response to determining that fusion of buffers in the memory units in the set of partially filled memory units is not possible or not desirable, method 1500 skips Step 1540.

Step 1540—in response determining that fusion of buffers in the memory units in the set of partially filled memory units is possible and desirable, fusing the buffers. The implementation may fuse the buffers by changing an allocation of at least a part of a buffer assigned to a first memory unit (e.g., partial memory area 1228 in memory unit 1222 in FIG. 12) in the set of partially filled memory units from the first memory unit (e.g., partial memory area 1228) to a second memory unit (e.g., memory unit 1212) in the set of partially filled memory units.

Step 1550—if a next set of partially filled memory units is available, an implementation may repeat Step 1530 and Step 1540 for the next set of partially filled memory units. Otherwise, method 1500 may create a configuration file that includes the changed allocation and store the configuration file in a computer-readable memory.

Considerations

The technology disclosed can be practiced as a system, method, or article of manufacture. One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.

Although the description has been described with respect to particular implementations thereof, these particular implementations are merely illustrative, and not restrictive. The description may reference specific structural implementations and methods, and does not intend to limit the technology to the specifically disclosed implementations and methods. The technology may be practiced using other features, elements, methods and implementations. Implementations are described to illustrate the present technology, not to limit its scope, which is defined by the claims. Those of ordinary skill in the art recognize a variety of equivalent variations on the description above.

All features disclosed in the specification, including the claims, abstract, and drawings, and all the steps in any method or process disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive. Each feature disclosed in the specification, including the claims, abstract, and drawings, can be replaced by alternative features serving the same, equivalent, or similar purpose, unless expressly stated otherwise.

Although the description has been described with respect to particular implementations thereof, these particular implementations are merely illustrative, and not restrictive. For instance, many of the operations can be implemented on a printed circuit board (PCB) using off-the-shelf devices, in a System-on-Chip (SoC), application-specific integrated circuit (ASIC), programmable processor, or in a programmable logic device such as a field-programmable gate array (FPGA), obviating a need for at least part of the dedicated hardware. Implementations may be as a single chip, or as a multi-chip module (MCM) packaging multiple semiconductor dies in a single package. All such variations and modifications are to be considered within the ambit of the present invention the nature of which is to be determined from the foregoing description.

Any suitable technology for manufacturing electronic devices can be used to implement the circuits of particular implementations, including CMOS, FinFET, BiCMOS, bipolar, JFET, MOS, NMOS, PMOS, HBT, MESFET, etc. Different semiconductor materials can be employed, such as silicon, germanium, SiGe, GaAs, InP, GaN, SiC, graphene, etc. Circuits may have single-ended or differential inputs, and single-ended or differential outputs. Terminals to circuits may function as inputs, outputs, both, or be in a high-impedance state, or they may function to receive supply power, a ground reference, a reference voltage, a reference current, or other. Although the physical processing of signals may be presented in a specific order, this order may be changed in different particular implementations. In some particular implementations, multiple elements, devices, or circuits shown as sequential in this specification can be operating in parallel.

Any suitable programming language can be used to implement the routines of particular implementations including Spatial, Python, C++, C, Java, JavaScript, compiled languages, interpreted languages and scripts, assembly language, machine language, etc. Different programming techniques can be employed such as procedural or object oriented. Methods embodied in routines can execute on a single processor device or on a multiple processor system. Although the steps, operations, or computations may be presented in a specific order, this order may be changed in different particular implementations. In some particular implementations, multiple steps shown as sequential in this specification can be performed at the same time.

Particular implementations may be implemented in a tangible, non-transitory computer-readable storage medium for use by or in connection with the instruction execution system, apparatus, board, or device. Particular implementations can be implemented in the form of control logic in software or hardware or a combination of both. The control logic, when executed by one or more processors, may be operable to perform that which is described in particular implementations. For example, a tangible non-transitory medium such as a hardware storage device can be used to store the control logic, which can include executable instructions.

Particular implementations may be implemented by using a programmed general-purpose digital computer, application-specific integrated circuits, programmable logic devices, field-programmable gate arrays, optical, chemical, biological, quantum or nanoengineered systems, etc. Other components and mechanisms may be used. In general, the functions of particular implementations can be achieved by any means as is known in the art. Distributed, networked systems, components, and/or circuits can be used. Cloud computing or cloud services can be employed. Communication, or transfer, of data may be wired, wireless, or by any other means.

It will also be appreciated that one or more of the elements depicted in the drawings/figures can also be implemented in a more separated or integrated manner, or even removed or rendered as inoperable in certain cases, as is useful in accordance with a particular application.

Thus, while particular implementations have been described herein, latitudes of modification, various changes, and substitutions are intended in the foregoing disclosures, and it will be appreciated that in some instances some features of particular implementations will be employed without a corresponding use of other features without departing from the scope and spirit as set forth. Therefore, many modifications may be made to adapt a particular situation or material to the essential scope and spirit.

Claims

1. A computer-implemented method for optimizing buffer allocation within a contiguous area in an array of reconfigurable units including at least a first physical memory unit and a second physical memory unit, the method comprising:

temporarily assigning a first buffer portion comprising at least a part of a first buffer to the first physical memory unit and a second buffer portion comprising at least a part of a second buffer to the second physical memory unit;
temporarily routing connections between the first physical memory unit, the second physical memory unit, and previously mapped reconfigurable units within the contiguous area;
using a cost function to calculate a first cost related to assigning the first buffer portion and the second buffer portion to the first physical memory unit and the second physical memory unit;
determining one or more candidate third physical memory units within the contiguous area, wherein the size of each candidate third physical memory unit is larger than the size of the first buffer portion plus the size of the second buffer portion;
initializing a best cost and initializing a best candidate physical memory unit;
for each candidate third physical memory unit: temporarily reassigning the first buffer portion and the second buffer portion to the candidate third physical memory unit, temporarily routing connections between the candidate third physical memory unit and previously mapped reconfigurable units within the contiguous area, and using the cost function to calculate a second cost related to assigning the first buffer portion and the second buffer portion to the candidate third physical memory unit; determining if the second cost is better than the best cost; updating, in response to determining that the second cost is better than the best cost, the best cost to equal the second cost, and the best candidate physical memory unit to equal the candidate third physical memory unit; and
creating a configuration file that assigns both the first buffer portion and the second buffer portion to the best candidate physical memory unit and storing the configuration file in a non-transitory computer readable storage medium.

2. The computer-implemented method of claim 1, wherein:

the array of reconfigurable units is an array of coarse-grained reconfigurable (CGR) units.

3. The computer-implemented method of claim 1, wherein:

the candidate third physical memory units include the first physical memory unit and/or the second physical memory unit.

4. The computer-implemented method of claim 1, wherein:

determining if the second cost is better than the best cost comprises determining if the second cost is less than the best cost.

5. The computer-implemented method of claim 1, wherein:

parameters for the cost function include at least one of: a weight of an edge connected to a buffer node, a bandwidth of an edge connected to a buffer node, a traffic density of an available routing channel between a physical memory unit and a neighbor, a traffic latency in an available routing channel between a physical memory unit and a neighbor, a semiconductor die area usage, or an energy usage.

6. The computer-implemented method of claim 1, wherein:

the array of reconfigurable units is included in a CGR architecture (CGRA) processor;
a reconfigurable unit comprises a physical memory unit;
a physical memory unit is a pattern memory unit (PMU); and
the best cost is initialized to equal the first cost.

7. A non-transitory computer readable storage medium storing computer program instructions to optimize buffer allocation in a contiguous area of an array of reconfigurable units including at least a first physical memory unit and a second physical memory unit, wherein the computer program instructions, when executed on a processor, implement a method comprising:

temporarily assigning a first buffer portion comprising at least a part of a first buffer to the first physical memory unit and a second buffer portion comprising at least a part of a second buffer to the second physical memory unit;
temporarily routing connections between the first physical memory unit, the second physical memory unit, and previously mapped reconfigurable units within the contiguous area;
using a cost function to calculate a first cost related to assigning the first buffer portion and the second buffer portion to the first physical memory unit and the second physical memory unit;
determining one or more candidate third physical memory units within the contiguous area, wherein the size of each candidate third physical memory unit is larger than the size of the first buffer portion plus the size of the second buffer portion;
initializing a best cost and initializing a best candidate physical memory unit;
for each candidate third physical memory unit: temporarily reassigning the first buffer portion and the second buffer portion to the candidate third physical memory unit, temporarily routing connections between the candidate third physical memory unit and previously mapped reconfigurable units within the contiguous area, and using the cost function to calculate a second cost related to assigning the first buffer portion and the second buffer portion to the candidate third physical memory unit; determining if the second cost is better than the best cost; updating, in response to determining that the second cost is better than the best cost, the best cost to equal the second cost, and the best candidate physical memory unit to equal the candidate third physical memory unit; and assigning both the first buffer portion and the second buffer portion to the best candidate physical memory unit.

8. The non-transitory computer readable storage medium of claim 7, wherein the array of reconfigurable units is an array of coarse-grained reconfigurable (CGR) units.

9. A system including one or more processors coupled to a memory, the memory loaded with computer program instructions to optimize buffer allocation in a contiguous area of an array of reconfigurable units including at least a first physical memory unit and a second physical memory unit, wherein the computer program instructions, when executed on the one or more processors, implement actions comprising:

temporarily assigning a first buffer portion comprising at least a part of a first buffer to the first physical memory unit and a second buffer portion comprising at least a part of a second buffer to the second physical memory unit;
temporarily routing connections between the first physical memory unit, the second physical memory unit, and previously mapped reconfigurable units within the contiguous area;
using a cost function to calculate a first cost related to assigning the first buffer portion and the second buffer portion to the first physical memory unit and the second physical memory unit;
determining one or more candidate third physical memory units within the contiguous area, wherein the size of each candidate third physical memory unit is larger than the size of the first buffer portion plus the size of the second buffer portion;
initializing a best cost and initializing a best candidate physical memory unit;
for each candidate third physical memory unit: temporarily reassigning the first buffer portion and the second buffer portion to the candidate third physical memory unit, temporarily routing connections between the candidate third physical memory unit and previously mapped reconfigurable units within the contiguous area, and using the cost function to calculate a second cost related to assigning the first buffer portion and the second buffer portion to the candidate third physical memory unit; determining if the second cost is better than the best cost; updating, in response to determining that the second cost is better than the best cost, the best cost to equal the second cost, and the best candidate physical memory unit to equal the candidate third physical memory unit; and assigning both the first buffer portion and the second buffer portion to the best candidate physical memory unit.

10. The system of claim 9, wherein the array of reconfigurable units is an array of coarse-grained reconfigurable (CGR) units.

Patent History
Publication number: 20230409233
Type: Application
Filed: Nov 15, 2022
Publication Date: Dec 21, 2023
Applicant: SambaNova Systems, Inc. (Palo Alto, CA)
Inventors: Nathan Francis SHEELEY (Austin, TX), Raghu PRABHAKAR (San Jose, CA), David Alan KOEPLINGER (Egg Harbor, NJ)
Application Number: 17/987,628
Classifications
International Classification: G06F 3/06 (20060101);