# PROCESSOR AND METHOD FOR OUTER PRODUCT ACCUMULATE OPERATIONS

A processor and method for performing outer product and outer product accumulation operations on vector operands requiring large numbers of multiplies and accumulations is disclosed.

## Latest MicroUnity Systems Engineering, Inc. Patents:

- Processor and method for executing wide operand multiply matrix operations
- System and methods for expandably wide processor instructions
- Processor and method for outer product accumulate operations
- SYSTEM AND METHODS FOR EXPANDABLY WIDE PROCESSOR INSTRUCTIONS
- System and methods for expandably wide processor instructions

**Description**

**CROSS REFERENCES TO RELATED APPLICATIONS**

This application is a Continuation of U.S. patent application Ser. No. 15/224,176, filed Jul. 29, 2016, the disclosure of which is hereby incorporated by reference in its entirety for all purposes.

**BACKGROUND OF THE INVENTION**

This invention relates to computer technology, and to a processor and methods for performing outer product and outer product accumulate operations.

Communications products require increased computational performance to process digital signals in software on a real time basis. Increases in performance in the past twenty years have come through improvements in transistor technology and processor design. Transistor counts have doubled in accordance with Moore's law about every two years, increasing thousand fold from a few million to a few billion transistors per chip. Processor design has improved peak performance per instruction by architectural innovations that enabled effectively doubling datapath width about every four years, increasing from 32 bits (e.g. Intel's Pentium) to 1024 bits (e.g. Qualcomm's Hexagon HVX) over about the past twenty years.

Digital communications typically rely on linear algorithms that multiply and add data with 32 bits of precision or less. In fact, digital video and radio processing typically operate on 16 bit or even 8 bit data. As datapath width has increased far beyond these data widths, substantially peak usage has been maintained by partitioning operands and datapaths using a variety of methods, treated extensively, for example, in our commonly assigned U.S. Pat. Nos. 5,742,840; 5,794,060; 5,794,061; 5,809,321; and 5,822,603.

These patents describe systems and methods for enhancing the utilization of a processor by adding classes of instructions. These classes of instructions use registers as data path sources, partition the operands into symbols of a specified size, perform operations in parallel, catenate the results and place the catenated results into a register. These patents, as well as other commonly assigned patents, describe processors optimized for processing and transmitting data streams using significant parallelism.

In our earlier U.S. Pat. No. 5,953,241, we describe group multiply and sum operations (column 4 therein) which each one of four multiplier operands a, b, c, and d is multiplied by a corresponding one of four multiplicand operands e, f, g, and h to produce products a*e, b*f, c*g, and d*h. See, e.g.

Others have developed a processor in which a vector-by-scalar multiply reduction is performed. See, e.g. The Qualcomm HVX architecture with SIMD extensions. This processor allows a group of four vector operands to be multiplied by one scalar operand with the four results being summed. See, e.g.

Emerging applications such as 5G communications, virtual reality, and neural networks, however, create an appetite for digital processing many orders of magnitude faster and more power efficient than these technologies. Moore's law is slowing as gate widths below 10 nm span fewer than 200 silicon lattice spacings. Advances in processor design are becoming more essential to accommodate the power performance needs of these applications.

Existing processor datapaths typically consume a small fraction of total processor power and area, so doubling their width doubles peak performance more efficiently than doubling the number of processor cores. There are practical constraints, however, on the number of doublings of the width of registers. The register complex typically comprises the central traffic interchange of the processor, operating at high clock rates. These registers have many input and output ports tightly coupled through a bypass network to multiple execution units. Wider execution units must avoid bottlenecks and sustain a large fraction of peak performance on targeted applications. These processor designs and methods must be capable of sustaining a large fraction of peak performance for algorithms needed by emerging applications such as 5G communications, virtual reality, and neural networks, yet at the same time be highly efficient in area and power. Thus, there is a need for processor designs and methods that enable orders of magnitude increases in peak performance without greatly complicating the register complex. In particular many practical applications for such processors, e.g. machine learning and image processing, would benefit from a processor capable of performing an outer product. In an outer product each element of one vector is multiplied by each element of another vector. For example, given vectors U and V:

The outer product of vectors U and V is:

**SUMMARY OF THE INVENTION**

Our invention provides a processor and method for calculating outer products and serially accumulating each product. In a preferred embodiment of the method of the invention, the processor has a register file with a bit width of r bits, and an array of multipliers arranged in rows and columns. Each multiplier has an accumulator associated with it within the array. To perform an outer product of a vector multiplier operand and a vector multiplicand operand the processor loads a multiplier operand and a multiplicand operand into each of the multipliers resulting in the multiplier at location [i, j] in the array receiving multiplier operand i and multiplicand operand j. At each multiplier a first multiplication of the multiplier operand i with the multiplicand operand j is performed to produce a first multiplication result i*j which is wider than r bits. The first multiplication result is then provided to the accumulator associated with that multiplier, where it is added to any previous multiplication result stored in the accumulator. The result is a multiplication of every element of the vector multiplier operand with every element of the vector multiplicand operand. When all of the desired multiplications and accumulations are complete, the results are copied out of the array.

In a preferred embodiment the processor of the invention has a register file with a bit width of r bits. Each of the plurality n of multiplier operands has a bit width of b bits, providing an aggregate width of r bits where r=n*b, and each of the plurality n of multiplicand operands has a bit width of b bits, also providing an aggregate width of r bits where r=n*b. An array of multipliers is arranged in rows and columns, with each column coupled to receive one multiplier operand, and each row coupled to receive one multiplicand operand. Each multiplier thus receives a multiplier operand and a multiplicand operand. The multipliers in the processor multiply the operands to provide a plurality n^{2 }of multiplication results having an aggregate bit width greater than r bits. The processor also includes a corresponding array of accumulators arranged in rows and columns, each accumulator being coupled to a corresponding multiplier. The processor uses the accumulators to add sequential multiplications from each multiplier. When the desired operations are complete, the results from each accumulator are shifted out of the array.

The invention also includes implementation circuitry for each multiplier and accumulator “tile” in the array, as well as techniques and circuitry for loading data into and shifting data out of the array.

**BRIEF DESCRIPTION OF THE DRAWINGS**

*a *to 6*c *

*a *to 7*d *

*a *and 8*b *

**DETAILED DESCRIPTION**

When the size of an execution unit result is constrained, it can limit the amount of computation that reasonably can be performed in response to a single instruction. As a consequence, algorithms are implemented in a series of single instruction steps in which all intermediate results can be represented within the constraints. By eliminating this limitation, however, instruction sets can be developed in which a larger portion of an algorithm is implemented as a single instruction. If at least some of these intermediate results are not required to be retained upon completion of the larger component of an algorithm, a processor will provide improved performance and reduced power consumption by not storing and retrieving intermediate results from the general register file. When the intermediate results are not retained in the general register file, processor instruction sets and implemented algorithms are also not constrained by the size of the general register file.

This invention is particularly related to multiplication and addition operations. For example, in image processing and deep learning applications, large numbers of multiplications and additions are often required. With conventional processors these operations are undesirably slow, restricting the usefulness of the particular applications. The invention described here enables efficiently performing a particular pattern of multiplications and additions known as an outer product accumulation. The result of applying an outer product operation to a multiplier and multiplicand, each a pair of a one-dimensional vectors is a two-dimensional matrix. An outer product has two important properties with respect to the invention described here. First, all possible multiplications between the multiplier and multiplicand operands are performed within an outer product. Second, none of these results are added together. Thus the computation of an outer product can be performed in completely parallel fashion. The outer product results are each serially accumulated to calculate a sum of products. These sums of products form a result, an accumulation of outer products. At this point this final result may be further processed, for example, as by rounding, shifting, truncating or performing unary operations on these results as they are read out of the outer product accumulation array.

Array Structure

**10** includes an arbitrarily large array **15** of multipliers **16**. Each multiplier **16** is coupled to receive a multiplier operand and a multiplicand operand from associated registers **11** and **12**. These registers have previously been loaded with the vector operands from a cache memory, external system memory or other source. The registers are illustrated as being divided into byte wide (or other width) segments **11***a, ***11***b *. . . **11***n *and **12***a, ***12***b, *. . . **12***n *(where n is an arbitrarily large number). Each multiplier **16** multiplies the received operands and provides a result. For example, the multiplier at location **19** will receive operands x[i] and y[j] and multiply them together. Herein we refer to each tile in the array that includes a multiplier as a “tile.” As will be described below there are various embodiments for each tile, with the choice of components depending upon the particular application for the processor.

In a preferred embodiment as also illustrated in **16** has associated with it an accumulator **18**. The accumulator **18** stores a running sum of the sequential multiplication results from multiplier **16**. Each accumulator **18** will thus ultimately store the sum of the individual multiplications computed by its associated multiplier **16**. More generally, multiplier-accumulator **19** will multiply operands x[i] and y[j] together, then add that result to the previous contents in the accumulator. By virtue of the multiplications and running sum, the result stored in the accumulator will be significantly wider than the width of the input operands. This is discussed further below, together with a more detailed description of multiplier-accumulator interface with surrounding circuitry. The detailed circuit design for multipliers and accumulators is well known and not described further herein.

Typically data processing operations will have been carried out on the operands before they are stored in registers **11** and **12**. In addition, as specified by the instructions controlling the processor, further operations can be performed on the results of the multiplication-accumulation. Herein we describe the typical circumstance in which the outer product is computed with the same number of multipliers and multiplicands, thus resulting in a square array **15**. Using the techniques described herein, however, other shapes of arrays, e.g. rectangular, can also be implemented.

The size of the result of an outer product is larger than the input operands x[i] and y[j]. When multiplying two binary values of size B bits, in registers **11** and **12**, representing either a signed value of range −2^{(B−1)}. . . 2^{(B−1)}−1, or an unsigned value of range 0 . . . 2^{(B−1)}, 2B bits are generally required to represent the range of values of the product. In a square array, using N multipliers n[0] . . . n[N−1] and N multiplicands y[0] . . . y[N−1], produces an outer product with N^{2 }results x[i]*y[j], where i is 0 . . . N−1 and j is 0 . . . N−1.

If B is 16 bits, and N is 8, the multiplier and multiplicand are each 128 bits (B*N), and the outer product will be 2048 bits (2B*N^{2}). While the multiplier and multiplicand may fit in a register file that supports operands of **128** bits, the outer product is too large to fit in a typical register file. Even if the size of the register file is extended to, for example, 1024 bits, with B=16 bits, N can be 64, then the outer product can perform 4096 (N^{2}) multiplications. This, however, yields 131072 bits of results (2B*N^{2}). To fit this result into a 1024 bit register file would require **128** registers, a number larger than the largest register files normally employed in a general-purpose processor.

The outer product result, however, can be stored in the system memory associated with a processor—as opposed to the registers associated with the processor. With this approach, a single instruction, referred to herein as ‘OuterProduct’, can specify an operation code for the instruction, together with the register file (or system memory) addresses for the multiplier and multiplicand operands. The resulting outer product can then be stored back into system memory. In the system memory, the outer product is accessible by a memory address and/or a size specified from a register in a register file.

In a alternative embodiments, the vector register file and the general register file may be combined together into a single register file. Further, the specification of the values of B and N may be implicit, as the design may fix these values according to the precision of the multipliers and the size of the register file, or variable as specified by a field in the instruction, or specified by the contents of a register (or a sub-field of the contents of a register) specified by the instruction. Also, the values of B and N may be specified as part of an operand specifier. Additionally, the operand specifier may encode both the address of the result and the size of the result, or alternatively a value from which the values of B and N may be computed or inferred.

In an alternative embodiment, any portion of the multiplier circuit that depends only upon either the multiplier or multiplicand alone can be placed at the periphery of the multiplier array to reduce the number of copies of that portion from N^{2 }to N in the array.

For example, to reduce the number of partial products to be added together, the multiplier may encode the multiplier operand using Booth or other encoding. In such an embodiment, a single Booth encoding circuit of the operand may suffice, as the Booth-encoded value may be presented to the transmission wires and/or/circuit to reach the multiplier, thus reducing the number of copies of the Booth encoding circuit from N^{2 }to N in the array.

While radix-4 Booth encoding combines multiples of the multiplicand that can be computed as shifts and complements of the original operand (−2x, −x, 0, x, 2x), some multiplier circuits in an alternative embodiment may require the computation of a small multiple of the multiplicand, as for example radix-8 Booth encoding, which requires computation of a 3x multiple. As each of the N multiplicands in the outer product are transmitted to N multipliers, computation of a small multiple of each multiplicand can be accomplished in a single circuit per multiplicand and the result transmitted to N multipliers, thus reducing the number of copies of the small multiple circuit from N^{2 }to N in the array.

The format of the outer product instruction in a preferred embodiment is shown in **21** in the instruction specifies an opcode for the operation to be performed, e.g. OuterProduct, OuterProductAdd, OuterProductExtract, etc., as discussed further below. Field **22** specifies the location of the first vector (e.g. the multiplier) and field **23** specifies the location of the second vector (e.g. the multiplicand). Field **24** specifies a location for storage of the result. Field **25** stores other information that may be needed in conjunction with the other fields, e.g. B, F, H, K, L, M, and N as discussed herein. Note that the instruction fields can be addresses of locations in memory, register identification information, pointers to memory, etc., and the other information above may be specified as part of the instruction, part of the source operands, or part of the result operand.

Further flexibility for implementation of the processor is provided by allowing acceptable values of N for a single instruction to be larger than the value enabled by the multiplier-accumulator array **15** in physical hardware H. In this circumstance, the outer product operation can be performed by successive operation of the physical hardware over H-by-H-sized portions of the multiplier and multiplicand values. In such an embodiment, the extracted or processed result may be expeditiously copied from within the array to the memory system or caches thereof, so that the physical storage of results within the array is limited to a single one, or a small number of, H-by-H-sized portions of the accumulated, the extracted, or the processed results.

In another implementation, the source operands for the outer product multiplication operation are specified as a single instruction specifying an instruction opcode and two register-sized operands. In this case one register from an R-bit register file contains N multipliers, and the other register from the R-bit register file contains N multiplicands, each multiplier or multiplicand using B bits, with the individual values catenated together to fill the register.

Alternatively, the multiplier and multiplicand values can be specified by larger operands to fill the registers, R=N*B. The value of B can be specified as a component of the instruction, by a register or bit field of a register specified by the instruction, or by a bit field of a specifier block specified by a portion of an operand. In other embodiments of the invention, the instruction contains a bit-field that specifies the format of the multiplier and multiplicand operands, for example, as signed integers, unsigned integers, or floating-point values. Alternatively, the formats can be specified by a bit-field, by bit values in a register, or by reference to a location in memory.

For the arrangement of multipliers shown in **19** receives multiplicand x[i] and multiplier y[j], producing the outer product p[i][j], an example of code to implement the operation is:

Foreach *i:=[*0 . . . *N−*1], *j:=[*0 . . . *N−*1]

*p[i][j]:=x[i]*y[j]; *

It should be understood that in the above notation, the preferred embodiment has sufficient resources to perform all the indicated multiplications at one time; the computation for all the values of i and j and the values dependent upon i and j are performed independently and in parallel. Alternatively, the parallelism may reflect the physical hardware array size, as described by the H-by-H array above.

As mentioned above, the multiplication result is typically too large for the register file. By storing the outer product result into memory, the product is retained as memory-mapped state. Thus, were the processor's normal operation be altered by an interrupt or context switch, the outer product is retained without need for further instructions to copy the value.

In applications such as image processing, it is desirable to compute sums of outer product values. After starting a first OuterProduct instruction, a second instruction can be started using distinct multiplier and/or multiplicand values, then adding that result to the previous outer product result, producing a sum of outer product values. We refer to this instruction as ‘OuterProductAdd’. This instruction is similar to OuterProduct, specifying inputs in the same way, and specifying that the result be used as an input value for the addition operation. Thus this instruction computes the sum of two outer product values. Further use of the OuterProductAdd can sum an arbitrary number of outer product values together, herein designated D for depth of summation. Because the sum of two values may be larger than the values individually, and the sums of D outer product values may be larger than the 2B bits required for each product, or the 2B*N^{2 }bits overall, an additional log_{2}D bits may be required for each sum of products, or log_{2}D*N^{2 }bits overall. To avoid overflowing the outer product result, such values may be extended by an amount, E bits, which also may be specified by the OuterProduct and OuterProductAdd instruction, either implicitly, for example, to double the size of each result making the result size 4B*N^{2}, or by some other amount fixed by the implementation, explicitly in subfield **25** of the instruction, or in an operand of the instruction, or a subfield of one of the operands. Alternatively, the outer product results may be limited in range to handle the possibility of overflow when E<log_{2}D.

An example of code for implementing the OuterProductAdd operation in which the result of each multiplication is added to the previous sum of outer products a[i][j], producing a new value for a[i][j] is:

Foreach *i:=[*0 . . . *N−*1], *j:=[*0 . . . *N−*1]

*p[i][j]:=x[i]*y[*_{j]; }

*a[i][j]:=a[i][j]+p[i][j]; *

Because these sums are formed of successive outer products, the sums can be computed without need for wiring for interconnections among the geographically separate multipliers.

Between uses of the OuterProductAdd operation to compute running sums, the accumulators a[i][j] may be cleared with operation that performs a[i][j]:=0, or alternatively an OuterProduct operation that performs a[i][j]=x[i]*[j] as the result.

**30** of the array shown in **11***i *and **12***j, *are provided over buses to the multiplier **16**. Multiplier **16** multiplies the received operands and the **2**B bit length product is provided to accumulator **18**. If an OuterProductAdd instruction is being performed, adder **33** adds the result of the multiplication to the existing sum stored in accumulator **18** and received over bus **37** to adder **33**. The multiplication and addition produces a result having 2B+E bits. Once all operations are complete, the final result ‘Result’ is provided to output register **35** where it is transferred out to the buses interconnecting the array. (Multiplexer **38** is used for loading and unloading data from the array of tiles. This operation is described further below.)

OuterProduct or OuterProductAdd operations such as illustrated in

The K addresses of operands can be tracked, and when an OuterProduct instruction (as distinct from an OuterProductAdd) instruction is performed to an operand address not previously tracked, one of the K accumulator locations is allocated for this operand address, for example, one that has not been previously used, or one least recently used. Further, when an OuterProductAdd instruction is performed on an operand that is not present in the accumulator, one set of accumulators of the K can be allocated. In other alternative embodiments, the choice of accumulator, i.e. the value of K, may be specified by the instruction in subfield **25** of the instruction, a subfield in a register, in memory, or otherwise. In another embodiment of the invention, an instruction can specify at least two associated opcodes, one specifying that an outer product is produced, and a second specifying that the outer product is added to a previous result, forming an accumulation of sum-of-outer products. In another embodiment of the invention, the accumulator may be cleared by a separate OuterProductClear instruction or instruction that combines this operation with other operations (such as OuterProductExtract detailed below), so that repeated use of the OuterProductAdd instruction alone computes the accumulation of sum-of-outer products.

The additional precision of the accumulated sums of outer products is necessary to compute an accurate sum without concern for overflow or premature rounding. Once the sums of outer products have been computed, however, many algorithms using those results only require a portion of the result, rounded or truncated to a lower precision. In such circumstances, an additional instruction, OuterProductExtract may be performed to extract the needed portion of the result, or produce the result in a lower precision than the originally accumulated sum. Such operations may be implemented using an optional additional circuit **39** as shown in **18**, rounding of those results, or other processing as described below. Note that by placing the other circuitry after the multiplexer it can also further process results from a nearby tile provided via switching circuit **38**. If it is desired to only process results from the individual tile in which it is situated, then such other circuitry **39** can be placed between the accumulator **18** and the switching circuit **38**. Also, this other circuitry can be placed along the edges of the array rather than at each tile in the array. By placing circuitry **39** at the edge of the array, data from the accumulators can be rounded, extracted, or otherwise processed as it is shifted out of the array. This enables the use of fewer copies of circuitry **39** and reduces circuit complexity within the array.

The selection of the particular portion of the result or the method used to extract typically will be specified as a field, e.g. field **25**, in the OuterProductExtract instruction. The operation invoked by the OuterProductExtract instruction may also include a unary operation, for example, an arctangent or hyperbolic tangent function, or it may map negative values to a zero value (as in a ReLU—Rectified Linear Unit function) or other fixed value, and/or convert the result to a lower precision floating-point value.

The OuterProductExtract instruction is normally performed after computing the sums of outer products, e.g. using the OuterProductAdd instruction. In one embodiment, the OuterProductExtract instruction is performed on each value in the accumulators in the array and places the result in the same location as the input, thus overwriting it. In an alternative embodiment, the OuterProductExtract instruction specifies distinct locations for the input and output of the operation, with the size of the result being smaller than the accumulated sums of outer products. In another implementation, the necessary state for the sums of outer products may be divided into two portions, one large enough to contain the final extracted results, and the other making up the remainder of the necessary size. The OuterProductExtract, OuterProduct, and OuterProductAdd instructions can then specify both portions of the operand to access the sums of outer products, with the result being an operand that contains the final extracted results. If the final extracted results are F bits per operand (F*N^{2 }results overall), the additional portion will be at least (2B+E−F)*N^{2 }bits. In an alternative embodiment, the additional portion is released from memory allocation upon execution of the OuterProductExtract instruction, eliminating needlessly copying it to the memory system. As mentioned above, the OuterProductExtract operation may in an alternative embodiment, clear the accumulator value upon extraction, so that subsequent OuterProductAdd instructions can be used to compute a subsequent sum of outer products.

In further embodiments of the invention, successive values of multiplier operands are catenated together into a larger operand, and similarly, successive values of multiplicand operands are catenated together into another larger operand. These catenated operands typically will be stored in the memory system. When so arranged, a single instruction may specify the operand multiplier, operand multiplicand, and outer product result, as well as other parameters as needed, e.g. B, N, F, etc. As discussed above, the instruction may also perform an extraction or further processing of the sum of outer products, specifying the result to be the extracted or processed sum of outer products. The extracted or processed sum of outer products may be smaller than the accumulated sums of outer products. When so specified, the single instruction may operate over multiple clock cycles to perform the entire operation. Alternatively, this operation may be carried out in parallel with other operations or instructions of the processor, and may be synchronized with an operation that utilizes or copies the wide operand result of this operation.

In some applications it is desirable for the outer product to be added to a previously accumulated sum-of-outer product results by one instruction, with a separate instruction clearing the sum-of-outer product results, or alternatively setting them to a fixed value. The result of a single multiplication of a B bit multiplier and a B-bit multiplicand in fixed-point arithmetic requires 2B bits to represent, and there are N^{2 }values in the outer product. Because the outer product result therefore requires 2B*N^{2 }bits, these instructions cannot immediately return a result to a register in the register file, nor to a series of registers in the register file. To overcome this limitation, the results can be maintained between instructions as an additional program state, for example, by being stored in memory as specified by the instructions. Later the results can be copied to and from dedicated storage locations near the outer product computation and accumulation circuits.

If the multiplier and multiplicand values exceed the size of the R-bit general-purpose registers, the contents of registers can be catenated together. Then a series of outer product multiplications can be performed, each using one of the series of R-bit multiplicand values and one of the series of R-bit multiplier values. Extracting, limiting, rounding, truncating, normalization, etc. as specified by the instruction can then reduce the size of the results. Thus a single instruction may specify the multiplicand and multiplier operands within memory and return the processed sum-of-outer product result.

In alternative embodiments the outer product instruction may specify these operands individually in bit-fields of the instruction, or may specify other operands that incorporate these operands, as well as other information, such as the size and format of the operands. Alternatively, contiguous or non-contiguous portions of the multiplier and/or multiplicand values may select the successive multiplier and/or multiplicand values. For example, a selection of fields of the multiplier may perform a convolution operation in sequential outer product multiplications.

For several of the operations above, operands may be presented to the array in interleaved form, that is, the N-element vector presented as multiplier or multiplicand may be formed using single elements from N vector or matrix values. For these operations, there will be an interleaved operand presented to the outer product array. A transpose circuit shown in ^{2 }values are clocked into a storage array with N-way parallelism along one dimension, for example from register **12**, and then are read out of the array with N-way parallelism along the orthogonal dimension to register **11**, now in interleaved form. The output of this circuit provides N operands in an orthogonal dimension to the input. Thus it can be used with the outer product array, which requires the multiplier and multiplicand to be presented to the array simultaneously in orthogonal dimensions. The data can be transposed before being provided to the array, or the transpose circuit may be embedded within the multiplier array. If embedded within the array the multiplier and multiplicand values can be presented along a single dimension, one of the operands being interleaved or transposed upon entering the combined circuit. Because this interleaving circuit imposes a delay, the operands to be transposed preferably will be entered earlier enabling the multiplier and multiplicand operands to meet simultaneously at the multiplier circuits.

**15** to illustrate transposing the input data, as well as to transfer out the results of the parallel multiplications and accumulations. In the portion of the array illustrated, an array tile **40** is coupled to the surrounding tiles **41**, **42** and **43** by data buses, represented as arrows in the figure. Data to be used in the multiplications ultimately will be loaded into the multiplier in every tile in the array. This can be performed by clocking the data into every row in parallel so that all tiles in all rows are filled with data, or by loading the data sequentially one row at a time. Whichever approach is used, eventually two operands will be loaded into every tile in the array.

To transpose the input data, or to transfer out the results, each tile, e.g. tile **42**, in the array preferably includes a multiplexer **38** (see **38**, to enable selecting between the data on the two input buses to that tile. In the case of tile **42**, the multiplexer **38**, in response to a control signal, chooses between receiving the data on bus **44** and bus **45**. When the tiles in that row are being loaded with data the mux **38** will be used to select data on bus **44**. When it is desired to shift the data upward, for example, to transfer results to storage **11**, mux **38** will be used to select data on bus **45**. As evident, by use of the multiplexers and the vertical buses between the tiles in each vertical column of the array, ultimately all of the data from the computations in the array can be loaded into storage **11**. Once the top row (with tile **42**) is filled, all data in that row can be clocked out to storage **11**, preferably in a parallel operation so data in all tiles in that row is transferred out simultaneously. In an alternative embodiment, additional storage can be provided at the end of each row, e.g. opposite storage **12**, to allow shifting of results data out of the array to that opposite side.

Physical Layout Considerations

Typically, as shown in ^{2 }parallelism.

In appropriate applications, rounding or other circuitry is provided at the edge of the array, e.g. incorporated within blocks **11** and **12** in ^{2 }array at one time. In this way, only N copies of the rounding or other circuitry are required. If all such circuitry is moved to the edge, the accumulated result, in redundant form, may require 2*(E+2*B) bits for each operand to be communicated to the edge of the array, requiring N*2*(E+B) total bits to be communicated, as with separate wires or differential pairs of wires. Depending on the power and area required for such additional wires over the first approach, it may be favorable to use N^{2 }or N copies of any or all of the carry-propagate adders, rounding, limiting and unary function circuitry with the N^{2 }multiplier-accumulator array. In addition the number of wires can be reduced, at pain of requiring more than one cycle to return a set of N operands. For example, if only N*B wires are present to return results, the N^{2 }values may still be communicated, but requiring B cycles per set of N operands, where E<=B and the values are provided from the accumulator in redundant form. In the case of the convolution computation described below, this does not slow the rate of result production as long as 6N<=FX*FY.

Alternatively, a partial carry-propagation may be performed at the output of the multiplier and/or the output of the accumulator. For example, carries may be propagated within bytes, producing 8+e bits of output per byte, where e is a number of additional bits of carries that result from adding two or more bytes together. Adding a byte of carries (shifted one bit to the left) to a byte of sums (not shifted) may require 2 additional bits to represent the sum. Even so, these 10 bits are less than the 16 bits that a fully redundant result would require. If the number of wires per byte of result is 8+e, for the N values to the communicated to the edge of the array, the number of cycles could then be returned to 3 cycles per set of N operands, or more or less, depending on the number of wires and the degree to which carries are propagated. As we can see from this alternative embodiment, there is no requirement that these intermediate values be carried with an integral number of wires per bit of result, and the number of cycles required to communicate a set of N operands may be any useful figure that makes good utilization of the wires available.

When the multiplier and multiplicand operands are large in comparison to the size of a general register of the processor, e.g., if the operands are 1024 bits in a processor and the general register size is 128 bits, the invention may take into consideration the delay involved in transmitting these operands across the multiplier array. While the diagrams above nominally assume that the multiplier and multiplicand are shifted across the entire array in a single clock cycle, speed-of-light propagation delay and resistive-capacitive (RC) delay may limit the clock speed. Each row of tiles can be considered to consist of an RC network with each tile resistively connected to neighboring tiles, and each tile imposing a capacitive load on the bus connecting the tiles. With large numbers of tiles in a row, the RC loading will be detrimental to those tiles furthest from the location where data is first supplied.

One solution to this is to provide for sub-groups of tiles, or for every tile, amplification, latching, or other processing of the signals between groups of tiles. This approach is illustrated in **10** divided into a set of N×M “supertiles” **50**. Each supertile **50** is divided into a set of n×m tiles **30** such as described above. (Note that in the case of the supertile **50** including only a single tile **30**, n×m will be 1.) Associated with each supertile **50** are sets of appropriate latches, converters, amplifiers, signal processors, etc. **58**, examples of which are described below. These circuits **58** process the signals on the row and column buses as necessary before providing those signals to the group of tiles **30** within the supertile **50**. Such processing can also be provided to the results signals from each supertile, as necessary.

*a, ***6***b *and **6***c *illustrate typical signal processing that may be included within the circuit **58** illustrated in *a ***51** are provided between the row bus **44** and the supertiles **50**. Alternatively, one or more cycles of delay may be removed to allow some multiplier circuits to begin operations sooner than others. When it comes time to read out the result of the accumulators, adding additional pipeline stages in the path from the accumulators to the output can compensate for these cycles. Alternatively, the effective latency of the entire array can be reduced, by reading out portions of the array sooner. If, as described above, sequential circuits are employed to read the array output with order N parallelism, this effectively reduces the latency of the result, as the remainder of the array can be read in a continuous flow following the earlier results.

Another signal processing approach, shown in *b, *is to provide for each supertile **50** a register **53** to hold the data when it arrives, then in response to later clock signals, transfer that data to the tiles. *c ***50***a, *for example, is separated from the input side of the array by three registers—registers **54**, **57** and **58**. Three registers of delay from the input—registers **56** and **58**, also separate supertile **50***b. *Similarly there are three registers **55** of delay between the first supertile **50***c *in the row and the input.

An alternative approach addressing this issue can be used where it is presumed that in a single clock cycle, only G of the N values can be transmitted in a single cycle. This distance G may be enhanced by the fact that only a single receiver loads the transmission wire in this alternative design. To address this, additional clock delays can be inserted for values that would otherwise arrive before their counterpart. This distribution network delays both the multiplier and multiplicand values by equal amounts in reaching the multiplier circuit. The choice of which of the techniques described above to use will depend on the size of the array, its intended uses, and the extent of RC issues.

Alternative Tiles

*a *to 7*d **a ***71** are used to clock the data into and out of the tile **30**, with a multiplexer **72** used to select between the input data path and the output from the MAC. In *b ***71** are again used but with different clock signals C**1** and C**2**, and an output enable circuit **73** controlled by signal OE, used to select the data path. *c **b, *but with the input data being provided to the next tile with a bypass bus **74**. *d **a, *but also employs a bypass bus **74**.

*a *and 8*b ***30**. In *a ***71** stores the result from the accumulator **81** and provides that to the output. In *b ***73** is provided between the flip-flop **71** and the output terminal.

Convolution Operations

The processor described here can also perform convolution operations. The multiplier array **10** shown in ^{2 }multiplications in parallel, consequently performing the convolution in RX*RY*FX*FY/N cycles, plus some additional cycles to copy the result from the accumulators.

An example of code to implement this operation is:

Foreach *k:=[*0 . . . *RX−*1], 1:=[0 . . . *RY−*1], *m:=[*0 . . . *N−*1], *i:=[*0 . . . *FX−*1], *j:=[*0 . . . *FY−*1]

*R[k,l,m]:=R[k,l,m]:=*sum[*I[k+i,l+j]*F[i,j,m]]*

The inner loop of operation (a singular input value and N filter values) presents a single one-dimensional vector as the multiplier selected from a variably shifted subject of the input value, and a set of N values at the multiplicand operand, selected from the N filter values. By iterating over each of the filter values in sequential cycles, N^{2 }sums representing a portion of the entire convolution are computed, using FX*FY cycles. Specifically, on a single pass computing N^{2 }sums comprising R[k, l, m] where k ranges from k . . . k+N−1, (assuming that N<=RX), **1** is a particular value in the range [0 . . . RY−1], and m ranges from 0 . . . N−1: N values from the I array are selected on each cycle and presented to the sum-of-outer product array as the multiplier X. Assuming that N<IX, these may be consecutive values with a common value of 1+j in the y-coordinate, and values of k+i . . . k+i+N−1 in the x-coordinate, to match up with a filter values F[i, j, m] of a particular value of i, and j, with m ranging from [0 . . . N−1], these filter values presented to the sum-of-outer product array as the multiplicand Y.

An example of code to implement this operation is:

Foreach *k:=[*0 . . . *RX−*1, by *N], *1:=[0 . . . *RY−*1]

Foreach *i:=[*0 . . . *FX−*1], *j:=[*0 . . . *FY−*1]

*X[n]:=I[k+i+n,l+j], n:=[*0 . . . *N−*1]

*Y[m]:=F[I,j,m]m:=[*0 . . . *N−*1]

*a[n,m]:=a[n,m]+X[n]*y[m], n:=[*0 . . . *N−*1], *m:=[*0 . . . *N−*1]

*R[k+n,l,m*]:=Extract[*a[n,m]], n:=[*0 . . . *N−*1], *m:=[*0 . . . *N−*1]

Because the complete sums representing the convolution are computed in the accumulators, they can be copied out of the array using a combination of parallel and sequential transfers. For example, if a data path of width B*N is available, on each cycle B bits from N accumulators can be copied out of the array. As we have described earlier, the entire sums, comprising E+2B bits, or a subfield, computed after extracting (e.g. by rounding, limiting and/or shifting) of the accumulated sums may be the results copied out of the array. When copying the entire values, if E<B, 3 cycles would suffice to copy N values from the array, and 3N cycles for the entire set of N^{2 }sums comprising the entire array. The circuitry for copying the result out of the array may operate concurrently with the computation of a successive set of sums of outer products, and would not require additional cycles so long as 3N<=FX*FY. In alternative embodiments, if an extracted result that only required 1 cycle to copy N values from the array, no additional cycles would be needed so long as N<=FX*FY.

As we have shown that the convolution operations are performed with N^{2 }multiplications in parallel for N D-dimensional filter arrays in parallel, it should be apparent that this mechanism can be extended for more than N filter arrays, simply by selecting N filter arrays for a first computation, and another N filter array for a second computation and so on. Thus this technique can be extended for a larger number of filter arrays than N. Similarly, this technique can be extended for arrays of dimensions other than 2, such as for 1-dimensional array, by removing iterations over l and j, or for 3-or-more dimensions, by further sequentially iterating over the third or more dimensions in R and F.

Minor alterations to the code can utilize the full array even if RX<N. Because the computation of the R[k, l, m] values are independent, it only requires that the X-operand selection choose an appropriate modification of the k+i+n and l+j subscripts for values of n, where k+i+n>IX, and that the R-value output modify the k+n and l subscripts for values of k+n>RX.

**91**, **92**, **93**, **94**, **95**, and **96** represent successive cycles of the outer product array. The intersection points of the operands (small spheres in the figure) each represent a multiply-accumulate operation. Thus, during a first operation filter value F**5** and input value x**0** are multiplied together. During subsequent operations filter value F**4** and value x**1** are multiplied together, then filter value F**3** and input value x**2** are multiplied together, etc. The lines fy**0**, fy**1**, fy**2**, and fy**3** extending through the planes 61, 62 . . . 66 represent the summations for filter F.

Other columns of summations are performed in parallel, that is, the next column of summations gy0, gy1, gy2, and gy3 for the G filter are performed at the same time. These summations for the G, H and I filters are not labeled to keep the figure from becoming unreadable. After initial loading of the starting 4 values, only a single new input per cycle is needed along the X input column. Depending on the presence of any edge processors, an optional shift register along that dimension may be added. Furthermore, if desired, the internal multipliers can use any adjacent-element shift fabric to send the values up. Doing it this way requires only a single element by having a broadcast along the bottom.

Matrix Multiplication Operations

The multiplier array can perform N^{2 }multiplications as portions of a Matrix-to-Matrix multiplication, with two input operands that are each at least two-dimensional arrays and at least one dimension of the two operands match one-for-one. This operation multiplies a first operand, a D1-dimensional array with a second operand, a D2-dimensional array, with DC dimensions in common. The result will be an array with DR-dimensions, such that DR=D1+D2−DC. For such an operation, the utilization of the array will be 100% if the product of dimensions of the first operand not in common (D1−DC) are at least N, and the dimensions of the second operand not in common (D2−DC) are at least N. Such an operation proceeds by presenting, for each N-by-N subset of the result, all of the corresponding N values of the first and second operands, over a number of cycles equal to the size of all the DC-dimensions in common, producing N^{2 }sums of products. For illustrative purposes, we show an example of a first 2-dimensional array of dimensions IX-by-IY multiplied by a second 2-dimensional array of dimensions FX-by-FY, where a single common dimension, denoted by IY and FX is combined to form the outer product, a 2-dimensional array R described as RX-by RY, where RX=IX and RY=FY.

An example of code to implement this operation is:

Foreach *k:=[*0 . . . *RX−*1, by *N], *1:=[0 . . . *RY−*1, by *N]*

Foreach *i:=[*0 . . . *IY−*1]

*X[n]:=I[k+n,i], n:=[*0 . . . *N−*1]

*Y[m]:=F[i,l+m]m:=[*0 . . . *N−*1]

*a[n,m]:=a[n,m]+X[n]*y[m], n:=[*0 . . . *N−*1], *m:=[*0 . . . *N−*1]

*R[k+n,l+m]*:=Extract[*a[n,m]], n:=[*0 . . . *N−*1], *m:=[*0 . . . *N−*1]

The inner loop of the operation presents a single one-dimensional vector as the multiplier, selected from the first input matrix I, and a set of N values as the multiplicand, selected from the second input matrix F. By iterating over the common dimension (or dimensions in the general case) in selecting the vector subsets, N^{2 }sums of products are computed, representing a portion of the output matrix R, using IY cycles.

**101**, **102**, **103**, and **104** represents one use of the multiplier array. The four such stacked planes **101**, **102**, **103**, and **104** are four successive uses of the unit, one after another in time. The input multiplicand operands a[i][j] ranging from a[**0**][**3**] to a[**3**][**3**] are shown along the left side of the figure, while the multiplier operands b[i][j] ranging from b[**0**][**0**] to b[**3**][**3**] are shown at the back of the figure. These input operands are matrices, and the figure illustrates a 4×4 matrix-multiply using 4 cycles of a 4×4 multiplier array.

Each intersection of the a[i][j] operand and the b[i][j] operand represents one multiply-accumulate operation at that location. The vertical lines **105** represent the summation direction with the results r[i][j] ranging from r[**0**][**0**] to r[**3**][**3**] shown in the lower portion of

Notice that in

This description of the invention has been presented for illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form described, and many modifications and variations are possible in light of the teaching above. The embodiments were chosen and described to explain the principles of the invention and its practical applications. This description will enable others skilled in the art to best utilize and practice the invention in various embodiments and with various modifications as are suited to a particular use. The scope of the invention is defined by the following claims.

## Claims

1. (canceled)

2. A system comprising:

- a memory storing a first plurality of multiplier operands and a second plurality of multiplicand operands;

- a processor coupled to the memory to receive the first plurality of multiplier operands and the second plurality of multiplicand operands and to multiply each one of the first plurality of multiplier operands with every one of the second plurality of multiplicand operands, the processor including an array of circuit tiles arranged in rows and columns, each circuit tiles in the array including: a multiplier circuit coupled to receive one multiplier operand and one multiplicand operand and provide a multiplication result; an adder circuit coupled to the multiplier circuit to receive the multiplication result, add it to any previous multiplication result stored in an accumulator circuit and provide a sum; and the accumulator circuit coupled to the adder circuit to store the sum as an accumulation result.

3. A system as in claim 2 wherein:

- the array of circuit tiles has a same number n of rows and columns;

- the processor includes a register file having a bit width of r bits; and

- each one of a plurality of multiplier operands and each one of the plurality of multiplicand operands has a bit width of b bits, and an aggregate width of r bits where r=n*b.

4. A system as in claim 2 wherein the accumulator circuit stores more than one copy of the accumulation result.

5. A system as in claim 2 wherein each circuit tile in the array further includes an output stage circuit coupled to the accumulator circuit for storing the accumulation result before transfer of the accumulation result from the array.

6. A system as in claim 5 wherein each circuit tile in the array further includes a switching circuit coupled between the accumulator circuit and the output stage circuit for controlling data provided to the output stage circuit.

7. A system as in claim 6 wherein the switching circuit is also coupled to another accumulator circuit elsewhere in the array of circuit tiles having data to be provided to the output stage circuit.

8. A system as in claim 7 wherein each circuit tile in the array also includes a data processing circuit for performing further operations on data in the accumulator circuit, the data processing circuit coupled between the accumulator circuit and the output stage circuit.

9. A system as in claim 8 wherein the data processing circuit provides at least one of extraction of a portion of the data in the accumulator circuit, rounding of the data in the accumulator circuit, or application of a function to the data in the accumulator circuit.

10. A system as in claim 9 wherein the data processing circuit is coupled to at least two accumulator circuits.

11. In a system for multiplying each one of a plurality of multiplier operands with every one of a plurality of multiplicand operands:

- a memory system for storing all of the multiplier operands and multiplicand operands;

- a processor including a plurality of tiles, each tile including: a multiplier circuit coupled to receive one of the plurality of multiplier operands from the memory system and one of the plurality of multiplicand operands from the memory system and multiply them together to provide a multiplication result; an adder circuit coupled to the multiplier circuit to receive the multiplication result and add it to a previous multiplication result to provide an addition result; and an accumulator circuit coupled to the adder circuit to store the addition result.

12. A tile as in claim 11 further including an output stage circuit coupled to the accumulator circuit for storing the addition result.

13. A tile as in claim 12 further comprising a switching circuit coupled between the accumulator circuit and the output stage circuit for selecting data to be provided to the output stage circuit.

14. A tile as in claim 13 wherein the switching circuit enables data from another tile to be provided to the output stage circuit.

15. In a system having a memory storing a vector multiplier and a vector multiplicand and a processor that includes an array of circuit tiles, each circuit tile in the array including a multiplier circuit, an adder circuit and an accumulator circuit, a method of performing an outer product accumulation of the vector multiplier and the vector multiplicand comprising:

- retrieving the vector multiplier and the vector multiplicand from the memory;

- transmitting operands from the vector multiplier and operands from the vector multiplicand to each circuit tile in the array, the multiplier circuit at location [i, j] in the array receiving vector multiplier operand i and vector multiplicand operand j;

- at each multiplier circuit in the array performing a multiplication of the vector multiplier operand and the vector multiplicand operand to produce a multiplication result;

- providing the multiplication result to an adder circuit;

- adding the multiplication result to a previous multiplication result stored in the accumulator circuit to provide a new accumulated result; and

- storing the new accumulated result in the accumulator circuit.

16. A method as in claim 15 wherein a single instruction causes the processor to determine the outer product accumulation of the vector multiplier and the vector multiplicand.

17. A method as in claim 16 wherein each circuit tile in the array further includes an output stage circuit coupled to the accumulator circuit, and the method further comprises storing the new accumulation result in the output stage circuit.

18. A method as in claim 17 wherein each circuit tile in the array includes a switching circuit coupled between the accumulator circuit and the output stage circuit and coupled to another accumulator circuit the method further comprises operating the switching circuit to select data provided to the output stage circuit.

19. A method as in claim 18 wherein each circuit tile in the array includes a data processing circuit coupled to the accumulator circuit and the method further comprises using the data processing circuit to perform at least one of:

- extraction of a portion of data in the accumulator circuit;

- rounding of data in the accumulator circuit; and

- application of a function to data in the accumulator circuit.

**Patent History**

**Publication number**: 20190065149

**Type:**Application

**Filed**: Oct 2, 2018

**Publication Date**: Feb 28, 2019

**Applicant**: MicroUnity Systems Engineering, Inc. (Santa Clara, CA)

**Inventors**: Craig Hansen (Los Altos, CA), John Moussouris (San Francisco, CA), Alexia Massalin (Sunnyvale, CA)

**Application Number**: 16/150,224

**Classifications**

**International Classification**: G06F 7/523 (20060101); G06F 7/504 (20060101); G06F 7/544 (20060101);