Multiplier-Accumulator Circuitry and Pipeline using Floating Point Data, and Methods of using Same
An integrated circuit including a multiplier-accumulator execution pipeline including a plurality of multiplier-accumulator circuits to, in operation, perform multiply and accumulate operations, wherein each multiplier-accumulator circuit includes: (i) a multiplier to multiply first input data, having a first floating point data format, by a filter weight data, having the first floating point data format, and generate and output a product data having a second floating point data format, and (ii) an accumulator, coupled to the multiplier of the associated MAC circuit, to add second input data and the product data output by the associated multiplier to generate sum data. The plurality of multiplier-accumulator circuits of the multiplier-accumulator execution pipeline may be connected in series and, in operation, perform a plurality of concatenated multiply and accumulate operations.
Latest Flex Logix Technologies, Inc. Patents:
- Multiplier-accumulator processing pipelines and processing component, and methods of operating same
- Multiplier-accumulator processing pipeline using filter weights having gaussian floating point data format
- Process of routing tile-to-tile interconnects of an FPGA, and method of manufacturing an FPGA
- Multiplier-accumulator processing pipelines and processing component, and methods of operating same
- MAC processing pipelines, circuitry to control and configure same, and methods of operating same
This non-provisional application claims priority to and the benefit of U.S. Provisional Application No. 62/865,113, entitled “Processing Pipeline having Floating Point Circuitry and Methods of Operating and Using Same”, filed Jun. 21, 2019. The '113 provisional application is hereby incorporated herein by reference in its entirety.
INTRODUCTIONThere are many inventions described and illustrated herein. The present inventions are neither limited to any single aspect nor embodiment thereof, nor to any combinations and/or permutations of such aspects and/or embodiments. Importantly, each of the aspects of the present inventions, and/or embodiments thereof, may be employed alone or in combination with one or more of the other aspects of the present inventions and/or embodiments thereof. All combinations and permutations thereof are intended to fall within the scope of the present inventions.
In one aspect, the present inventions are directed to one or more integrated circuits having multiplier-accumulator circuitry (and methods of operating such circuitry) for data processing (e.g., image filtering) wherein the multiplier circuitry and/or the accumulator circuitry thereof implement the multiplication and/or accumulation operations, respectively, using floating point data and/or based on a floating point data format. In one embodiment, the floating point data format of the multiplier circuitry is the same as the floating point data format of the accumulator circuitry (e.g., such as 16, 24 and 32 bits). In another embodiment, the floating point data format of the multiplier circuitry is different from the floating point data format of the accumulator circuitry. For example, the multiplier circuitry may include a 16 bit floating point multiplier and the accumulator circuitry may include a 24 or 32 bit floating point adder or accumulator.
Notably, the multiplier-accumulator circuitry of the present inventions may be implemented in an execution or processing pipeline including execution circuitry employing one or more floating point data formats. Here, the multiplier circuitry may be a floating point multiplier and/or the accumulator circuitry may be a floating point accumulator. In one embodiment, the execution or processing pipeline includes a plurality of multiplier-accumulator circuits, each circuit including a floating point multiplier and/or a floating point accumulator. For example, the plurality of multiplier-accumulator circuits (each having floating point processing circuitry) may be interconnected (in series) to perform the multiply and accumulate operations and/or the pipelining architecture or configuration implemented via connection of multiplier-accumulator circuits. In this pipeline architecture, for example, the plurality of multiplier-accumulator circuits may concatenate the multiply and accumulate operations of the data processing.
The floating point data formats may be user or system defined and/or may be one-time programmable (e.g., at manufacture) or more than one-time programmable (e.g., (i) at or via power-up, start-up or performance/completion of the initialization sequence/process sequence, and/or (ii) in situ or during normal operation). In one embodiment, the execution circuitry (e.g., the multipliers and/or the accumulators) of the data processing pipelines includes adjustable/programmable floating point precision—which is one-time programmable (e.g., at manufacture) or more than one-time programmable.
In addition thereto, or in lieu thereof, the processing circuitry of the execution pipelines may concurrently process data to increase throughput of the pipeline. For example, in one implementation, the present inventions may include a plurality of separate multiplier-accumulator circuits (referred to herein, at times, as “MAC” or “MAC circuits”) and a plurality of registers (including, in one embodiment, a plurality of shadow registers) that facilitate pipelining of the multiply and accumulate operations wherein the circuitry of the execution pipelines concurrently process data to increase throughput of the pipeline.
Notably, the present inventions may employ and/or be implemented in conjunction with the circuitry and techniques described and/or illustrated in U.S. patent application Ser. No. 16/545,345 and U.S. Provisional Patent Application No. 62/725,306. Here, the multiplier-accumulator circuitry described and/or illustrated in the '345 and '306 applications facilitate concatenating the multiply and accumulate operations, and reconfiguring the circuitry thereof and operations performed thereby (see, e.g., the exemplary embodiments illustrated in FIGS. 1A-1C of U.S. patent application Ser. No. 16/545,345); in this way, a plurality of multiplier-accumulator circuits may be configured and/or re-configured to process data (e.g., image data) in a manner whereby the processing and operations are performed more rapidly and/or efficiently. The '345 and '306 applications are incorporated by reference herein in their entirety.
Further, the present inventions may also be employed or be implemented in conjunction with the circuitry and techniques multiplier-accumulator execution or processing pipelines (and methods of operating such circuitry) having circuitry to implement Winograd type processes to increase data throughput of the multiplier-accumulator circuitry and processing—for example, as described and/or illustrated in U.S. patent application Ser. No. 16/796,111 and U.S. Provisional Patent Application No. 62/823,161, both of which are hereby incorporated by reference in its entirety.
In addition thereto, or in lieu thereof, the present inventions may also be employed and/or be implemented in conjunction with the circuitry and techniques multiplier-accumulator execution or processing pipelines (and methods of operating such circuitry) having circuitry and/or architectures to process data, concurrently or in parallel, to increase throughput of the pipeline—for example, as described and/or illustrated in U.S. patent application Ser. No. 16/816,164 and U.S. Provisional Patent Application No. 62/831,413; the '164 and '413 applications are hereby incorporated by reference in its entirety. Here, a plurality of processing or execution pipelines may concurrently process data to increase throughput of the data processing and overall pipeline.
Notably, the integrated circuit(s) may be, for example, a processor, controller, state machine, gate array, system-on-chip (SOC), programmable gate array (PGA) and/or FPGA and/or a processor, controller, state machine and SoC including an embedded FPGA. A field programmable gate array or FPGA means both a discrete FPGA and an embedded FPGA.
The present inventions may be implemented in connection with embodiments illustrated in the drawings hereof. These drawings show different aspects of the present inventions and, where appropriate, reference numerals, nomenclature, or names illustrating like circuits, architectures, structures, components, materials and/or elements in different figures are labeled similarly. It is understood that various combinations of the structures, components, materials and/or elements, other than those specifically shown, are contemplated and are within the scope of the present inventions.
Moreover, there are many inventions described and illustrated herein. The present inventions are neither limited to any single aspect nor embodiment thereof, nor to any combinations and/or permutations of such aspects and/or embodiments. Moreover, each of the aspects of the present inventions, and/or embodiments thereof, may be employed alone or in combination with one or more of the other aspects of the present inventions and/or embodiments thereof. For the sake of brevity, certain permutations and combinations are not discussed and/or illustrated separately herein. Notably, an embodiment or implementation described herein as “exemplary” is not to be construed as preferred or advantageous, for example, over other embodiments or implementations; rather, it is intended reflect or indicate the embodiment(s) is/are “example” embodiment(s).
Notably, the configurations, block/data width, data path width, bandwidths, data lengths, values, processes, pseudo-code, operations, and/or algorithms described herein and/or illustrated in the FIGURES, and text associated therewith, are exemplary. Indeed, the inventions are not limited to any particular or exemplary circuit, logical, block, functional and/or physical diagrams, number of multiplier-accumulator circuits employed in an execution pipeline, number of execution pipelines employed in a particular processing configuration, organization/allocation of memory, block/data width, data path width, bandwidths, values, processes, pseudo-code, operations, and/or algorithms illustrated and/or described in accordance with, for example, the exemplary circuit, logical, block, functional and/or physical diagrams.
Moreover, although the illustrative/exemplary embodiments include a plurality of memories (e.g., L3 memory, L2 memory, L1 memory, L0 memory) which are assigned, allocated and/or used to store certain data and/or in certain organizations, one or more of memories may be added, and/or one or more memories may be omitted and/or combined/consolidated—for example, the L3 memory or L2 memory, and/or the organizations may be changed, supplemented and/or modified. The inventions are not limited to the illustrative/exemplary embodiments of the memory organization and/or allocation set forth in the application. Again, the inventions are not limited to the illustrative/exemplary embodiments set forth herein.
Again, there are many inventions described and illustrated herein. The present inventions are neither limited to any single aspect nor embodiment thereof, nor to any combinations and/or permutations of such aspects and/or embodiments. Each of the aspects of the present inventions, and/or embodiments thereof, may be employed alone or in combination with one or more of the other aspects of the present inventions and/or embodiments thereof. For the sake of brevity, many of those combinations and permutations are not discussed or illustrated separately herein.
DETAILED DESCRIPTIONIn one aspect, the present inventions are directed to one or more integrated circuits having multiplier-accumulator circuitry (and methods of operating such circuitry) for data processing (e.g., image filtering) wherein the multiplier circuitry performs multiplication operations and/or the accumulator circuitry perform accumulation operations using floating point data and/or based on a floating point data format. The floating point data format of the multiplier circuitry is the same as the floating point data format of the accumulator circuitry (e.g., such as 16, 24 and 32 bits). In another embodiment, the floating point data format of the multiplier circuitry is different from the floating point data format of the accumulator circuitry. For example, the multiplier circuitry may include a 16 bit floating point multiplier and the accumulator circuitry may include a 24 or 32 bit floating point adder or accumulator.
The multiplier-accumulator circuitry may be implemented in an execution or processing pipeline including execution circuitry (i.e., multiplier-accumulator circuits) employing one or more floating point data formats. Here, the multiplier circuitry may be a floating point multiplier and/or the accumulator circuitry may be a floating point accumulator. In one embodiment, the execution or processing pipeline includes a plurality of multiplier-accumulator circuits, each circuit including a floating point multiplier and/or a floating point accumulator. For example, the plurality of multiplier-accumulator circuits (each having floating point processing circuitry) may be interconnected (in series) to perform the multiply and accumulate operations and/or the pipelining architecture or configuration implemented via connection of multiplier-accumulator circuits. In this pipeline architecture, for example, the plurality of multiplier-accumulator circuits may concatenate the multiply and accumulate operations of the data processing.
The floating point data formats may be user or system defined and/or may be one-time programmable (e.g., at manufacture) or more than one-time programmable (e.g., (i) at or via power-up, start-up or performance/completion of the initialization sequence/process sequence, and/or (ii) in situ or during normal operation). In one embodiment, the execution circuitry (e.g., the multipliers and/or the accumulators) of the data processing pipelines includes adjustable/programmable floating point precision—which is one-time programmable (e.g., at manufacture) or more than one-time programmable.
In one embodiment, the present inventions are implemented in one or more execution or processing pipelines (e.g., for image filtering) having multiplier-accumulator circuitry—for example, circuitry disposed on an integrated circuit. With reference to
Further, during processing, the Yijlk MAC values are rotated through all 64 processing elements during the 64 execution cycles after being loaded from the Yijk shifting chain (see YMEM memory), and will be unloaded with the same shifting chain.
Further, in this exemplary embodiment, “r” (e.g., 64 in the illustrative embodiment) MAC processing circuits in the execution pipeline operate concurrently whereby the multiplier-accumulator processing circuits perform r×r (e.g., 64×64) multiply-accumulate operations in each r (e.g., 64) cycle interval (here, a cycle may be nominally 1 ns). Thereafter, a next set of input pixels/data (e.g., 64) is shifted-in and the previous output pixels/data is shifted-out during the same r (e.g., 64) cycle interval. Notably, each r (e.g., 64) cycle interval processes a Dd/Yd (depth) column of input and output pixels/data at a particular (i,j) location (the indexes for the width Dw/Yw and height Dh/Yh dimensions). The r (e.g., 64) cycle execution interval is repeated for each of the Dw*Dh depth columns for this stage. In this exemplary embodiment, the filter weights or weight data are loaded into memory (e.g., the L1/L0 SRAM memories) from, for example, an external memory or processor before the stage processing started (see, e.g., the '345 and '306 applications). In this particular embodiment, the input stage has Dw=512, Dh=256, and Dd=128, and the output stage has Yw=512, Yh=256, and Yd=64. Note that only 64 of the 128 Dd input are processed in each 64×64 MAC execution operation.
With continued reference to
Indeed, the method illustrated in
Notably, these techniques, which generalize the applicability of the 64×64 MAC execution pipeline, may also be utilized or extend to the generality of the additional methods that will be described in later sections of this application. Indeed, this application describes an inventive method or technique to design a floating point execution unit/circuit in a standard description language (e.g., Verilog language). The design may be scalable through a wide range of precisions (a 6:1 ratio). In this way, the area/cost of the execution unit/circuit may be minimized and/or reduced for the numeric accuracy requirements. In one embodiment, the scaling may be implemented in a way that is compatible with the back-end logic synthesis and place/route software tool suite.
With reference to
For the purposes of illustration, a 24 bit floating point format (FP24) and a 32 bit floating point format (FP32) formats are employed to describe certain circuitry and/or methods of certain aspects of certain features of the present inventions. Moreover, such FP24 and FP32 formats are often described herein in the context of the addition operation. The inventions, however, are not limited to (i) particular floating point format(s), operations (e.g., addition, subtraction, etc.), block/data width, data path width, bandwidths, values, processes and/or algorithms illustrated, nor (ii) the exemplary logical or physical overview configurations, exemplary module/circuitry configuration and/or exemplary Verilog code.
as mentioned above, the present inventions may be implemented in multiplier-accumulator circuits of one or more multi-bit MAC execution pipelines, wherein the multiplier-accumulator circuits include floating point data processing circuitry (e.g., multiplier circuitry and/or accumulator circuitry that process data in a floating point data format). In one embodiment, the execution or processing pipeline includes a plurality of multiplier-accumulator circuits, each circuit including a floating point multiplier and/or a floating point accumulator. For example, the plurality of multiplier-accumulator circuits (each having floating point processing circuitry) may be interconnected (in series) to perform the multiply and accumulate operations and/or the pipelining architecture or configuration implemented via connection of multiplier-accumulator circuits. In this pipeline architecture, for example, the plurality of multiplier-accumulator circuits may concatenate the multiply and accumulate operations of the data processing.
In one embodiment, the multiplier-accumulator circuits (employing floating point multiplier circuitry and/or floating point accumulator circuitry) are interconnected into execution or processing pipelines as described and/or illustrated in the '111 application. In one embodiment, the circuitry configures and controls a plurality of separate multiplier-accumulator circuits (which may be referred to, at times, as “MAC” or “MAC circuits”) or rows/banks of interconnected (in series) multiplier-accumulator circuits (referred to, at times, as clusters) to pipeline multiply and accumulate operations. In one embodiment, the plurality of multiplier-accumulator circuits (e.g., having the floating point multiplier and accumulator circuitry described above) may include a plurality of registers (including a plurality of shadow registers) wherein the circuitry also controls such registers to implement or facilitate the pipelining of the multiply and accumulate operations performed by the multiplier-accumulator circuits to increase throughput of the multiplier-accumulator execution or processing pipelines in connection with processing the related data (e.g., image data). (See, e.g., '345 and '306 applications).
In another embodiment, the interconnection of the pipeline or pipelines, (each including a plurality of MAC circuits implementing the floating point accumulator circuitry and/or the floating point multiplier circuitry of the present inventions) are configurable or programmable to provide different forms of pipelining. (See, e.g., the '111 application). Here, the pipelining architecture provided by the interconnection of the plurality of multiplier-accumulator circuits (e.g., having the floating point multiplier and accumulator circuitry) may be controllable or programmable. In this way, a plurality of multiplier-accumulator circuits may be configured and/or re-configured to form or provide the desired processing pipeline(s) to process data (e.g., image data).
For example, with reference to the '111 application, in one embodiment, control/configure circuitry may configure or determine the multiplier-accumulator circuits having floating point processing circuitry, or rows/banks of interconnected multiplier-accumulator circuits having floating point processing circuitry are interconnected (in series) to perform the multiply and accumulate operations and/or the pipelining architecture or configuration implemented via connection of multiplier-accumulator circuits (or rows/banks of interconnected multiplier-accumulator circuits). Thus, in one embodiment, the control/configure circuitry configures or implements an architecture of the execution or processing pipeline by controlling or providing connection(s) between multiplier-accumulator circuits and/or rows of interconnected multiplier-accumulator circuits—each of which include one or more floating point multiplier circuitry embodiments and/or one or more floating point accumulator circuitry embodiments described herein.
With reference to
Briefly, with continued reference to
With continued reference to
Notably, the embodiment of
Notably, although the illustrative or exemplary embodiments described and/or illustrated a plurality of different memories (e.g., L3 memory, L2 memory, L1 memory, L0 memory) which are assigned, allocated and/or used to store certain data and/or in certain organizations, one or more of other memories may be added, and/or one or more memories may be omitted and/or combined/consolidated—for example, the L3 memory or L2 memory, and/or the organizations may be changed. All combinations are intended to fall within the scope of the present inventions.
Moreover, in the illustrative embodiments set forth herein (text and drawings), the multiplier-accumulator circuitry and/or multiplier-accumulator pipeline is, at times, labeled “NMAX”, “NMAX pipeline”, “MAC”, or “MAC pipeline”.
With continued reference to
Notably, the X1 component may also include interface circuitry (e.g., PHY and/or GPIO circuitry) to interface with, for example, external memory (e.g., DRAM, MRAM, SRAM and/or Flash memory).
In one embodiment, the MAC execution pipeline may be any size or length (e.g., 16, 32, 64, 96 or 128 multiplier-accumulator circuits). Indeed, the size or length of the pipeline may be configurable or programmable (e.g., one-time or multiple times—such as, in situ (i.e., during operation of the integrated circuit) and/or at or during power-up, start-up, initialization, re-initialization, configuration, re-configuration or the like).
In another embodiment, the one or more integrated circuits include a plurality of components or X1 components (e.g., 2, 4, . . . ), wherein each component includes a plurality of the clusters having a plurality of MAC execution pipelines. For example, in one embodiment, one integrated circuit includes a plurality of components or X1 components (e.g., 4 clusters) wherein each cluster includes a plurality of execution or processing pipelines (e.g., 16, 32 or 64) which may be configured or programmed to process, function and/or operate concurrently to process related data (e.g., image data) concurrently. In this way, the related data is processed by each of the execution pipelines of a plurality of the clusters concurrently to, for example, decrease the processing time of the related data and/or increase data throughput of the X1 components.
As discussed in the '164 and '413 applications, both of which are incorporated by reference herein in their entirety, a plurality of execution or processing pipelines of one or more clusters of a plurality of the X1 components may be interconnected to process data (e.g., image data) In one embodiment, such execution or processing pipelines may be interconnected in a ring configuration or architecture to concurrently process related data. Here, a plurality of MAC execution pipelines (each including a plurality of MAC circuits implementing the floating point accumulator circuitry and/or the floating point multiplier circuitry of the present inventions) of one or more (or all) of the clusters of a plurality of X1 components (which may be integrated/manufactured on a single die or multiple dice) may be interconnected in a ring configuration or architecture (wherein a bus interconnects the components) to concurrently process related data. For example, a plurality of MAC execution pipelines of one or more (or all) of the clusters of each X1 component are configured to process one or more stages of an image frame such that circuitry of each X1 component processes one or more stages of each image frame of a plurality of image frames. In another embodiment, a plurality of MAC execution pipelines of one or more (or all) of the clusters of each X1 component are configured to process one or more portions of each stage of each image frame such that circuitry of each X1 component is configured to process a portion of each stage of each image frame of a plurality of image frames. In yet another embodiment, a plurality of MAC execution pipelines of one or more (or all) of the clusters of each X1 component are configured to process all of the stages of at least one entire image frame such that circuitry of each X1 component is configured to process all of the stage of at least one image frame. Here, each X1 component is configured to process all of the stages of one or more image frames such that the circuitry of each X1 component processes a different image frame.
With reference to
In one embodiment, input data (e.g., image pixel values) are accessed in or read from memory (e.g., an L2 memory). (See, e.g.,
With continued reference to
The input filter weight, in one exemplary embodiment, are accessed in or read from L0 memory. In one embodiment, the filter weights may be previously loaded from L2 memory to L1 memory, and then from L1 memory to L0 memory. (See
Alternatively, in one embodiment, the filter weights are stored in memory (e.g., L2 memory) in an FP16 format (16 bits for sign, exponent, fraction). The filter weight values, in this embodiment, are read from memory (L2—SRAM memory) and directly stored in the L1 and L0 memory levels (i.e., without conversion). Thereafter, the filter weights are loaded into the filter weight register “F” and are available/accessible to the multiplier circuitry to implement the multiplication operation of the execution circuitry/process of the data processing circuitry. In yet another embodiment, the filter weight values are read from memory (e.g., L2 or L1—SRAM memory) and directly loaded into the filter weight register “F” for use by the multiplier circuitry of the execution circuitry/process of the data processing circuitry.
Note that other numerical precisions and/or data formats may be made for the various values which are to be processed—the values that are shown in this exemplary embodiment represent the precision (e.g., minimum precision) that is practical for a floating point format.
With continued reference to
In one embodiment, a plurality of outputs of the accumulator circuitry may be accumulated. That is, after each result “Y” has accumulated a plurality of products, the accumulation totals may be parallel-loaded into the “MAC-SO” registers. Thereafter, the accumulation data may be serially shifted out (i.e., output) during a subsequent or the next execution sequence (e.g., to memory).
Notably, with reference to
With reference to
With reference to
- [1] comparing the two exponents, and optionally swapping the two operands,
- [2] right-shift (align) the mantissa of the operand with the smaller exponent,
- [3] add (or subtract) the two mantissas,
- [4] normalize sum of the mantissas with priority-encode and left-shift and exponent adjust,
- [5] round the normalized mantissa and exponent adjust, and
- [6] generate constants for exponent and mantissa for special cases.
These processing operations/steps may be performed or implemented, in one embodiment, using an assortment of logical elements (e.g., disposed on one or more integrated circuits). For example, a 2-to-1 multiplexer is the one of logical element which selects one of two inputs as a function of a third control input. The second element may be logic circuits or gates (e.g., basic logic circuits or gates such as, for example, AND, OR, and/or XOR) which are typically used to implement the control logic. The third element is the shifting structures/circuits—which may be constructed from multiplexers, but also include large amounts of wiring for transporting bits horizontally. The fourth element is add/subtract blocks. This category also includes increment and decrement blocks—basically any block with horizontal carry propagation. The fifth element is the priority encoder block. Moreover, the shift structures/circuit and priority encoder structures/circuit also transmits/transports control information horizontally.
Note that although the operands and result have a 24 bit width, the internal mantissa paths are 27 bit wide. This is intended to provide guard bits for rounding. As a result, data on the right hand edge of the mantissa path at a number of bit positions must be extracted and used by the control logic. If it is necessary to support more than one precision size (e.g. the FPADD32 and FPADD24 examples are illustrative in this analysis), it may be useful to modify certain sections of the description language (e.g., Verilog code) which control, dictate or drive the synthesis and place/route tools.
In one aspects, the present inventions, in one embodiment, are directed to generating a single version of Verilog description of the floating point module/circuitry. The Verilog description may be employed (with the synthesis and place/route tools) to generate a floating point addition FPADDxx design with a precision that can be selected from a continuous range (e.g., an extensive continuous range). In the examples described and illustrated, the FPxx range is from FP14 to FP39, corresponding to mantissa precision of 6 bits to 31 bits (a 5× range), or a precision of 5 bits to 30 bits if the hidden bit is discounted (a 6× range).
Notably, in one embodiment, a floating point subtraction operation (FPSUB) may be implemented using circuitry corresponding to the logic overview of
In one embodiment, the accumulation circuit may include one or more pipeline registers to facilitate implementation in connection with a plurality of execution paths. (See,
With reference to
With continued reference to
The FPADD24 architecture, module and/or operation, on the other hand, has a set of wNN parameters that are, for example, exactly “8” smaller than the FPADD32 parameters. An example of a parameter declaration and the parameter usage for the FPADD24 example is below:
parameter w26=18; // parameter declaration for FPADD24
wire [0:w26] MW=EAgeEB ? MA[0:w26]:MB[0:w26]; // parameter usage
This method may be defined for the module sizes of FP14 to FP39. The mantissa width(s) for these sizes are 6 bit to 31 bit (5 bit to 30 bit not counting the hidden/implicit bit). If one column from this parameter table is inserted into the FPADD module, then it may be adjusted for the corresponding size.
An alternative to pasting the column of parameter values into the module is to use an “include” directive. This Verilog command causes a file with Verilog code to be inserted at the position of the include directive in the description code of the module. This would facilitate a new FPADD size to be generated by modifying a single file. Notably, the included code would be identical to the code illustrated in
With reference to
Notably, an alternative to this use of parameters is the use of a “macro” definition. A macro may be defined with a name (label) and a text string value in the description code for the module. When the module is compiled, every instance of the macro name is replaced with the text string value. This provides the same degree of adjustability as the parameter method, and could be used as an alternate method.
With reference to
The first example also illustrates how the specification of adjustable constants. The repeat operator can accept a static parameter value, so that an operand of the form “{w27{invMSp}}” creates a vector that is “27” bits wide for the FPADD24 precision, and a scaled width for the other precision alternatives. A constant operand (not shown) would take the form “{w27{1′b1}}”—this would specify a vector of 27 logical one values in the case of FPADD32 precision, and a scaled vector width for the other precision alternatives.
Further, the first example (illustrated in
Notably, an alternate method for decomposing a variable-width addition into basic logical operations will be illustrated and discussed in detail below. Such techniques may be employed in connection with this aspect of the inventions. For example, this would allow the logic synthesis to be performed from a scalable high-level (Verilog) design that has a uniform low-level of description.
With reference to
In the context of area summary for elements of an exemplary FPADD32 embodiment, some applications or implementations that utilize floating point execution pipeline circuitry/hardware may have varying precision requirements. In some applications or implementations, there will be many execution blocks used—and, as such, it may become important to adjust the precision during the silicon design (e.g., at each place in the silicon design) to enhance silicon area, execution power and execution delay.
Notably, these aforementioned examples are estimates for CMOS components at a 16 nm process node. The area values are expressed in units of microns-squared (u{circumflex over ( )}2). The tables are separated vertically into the various exponent and mantissa sections, and horizontally into the six basic element types. The left section of the table summarizes the number of each element type in each section, and the right section of the table multiplies the number of elements in each section times an (approximate) area parameter to give area sub-totals.
The exponent sections correspond to the blocks depicted in
As noted above, although certain of the exemplary embodiments and features of the inventions are illustrated and/or described in the context of floating point addition (FPADD) operation/module/circuit having 24 and 32 bit precision (i.e., FPADD24 and FPADD32), the embodiments and inventions are applicable of other precisions (e.g., FPxx where: xx is an integer and 14≤xx≤39). For the sake of brevity, those other precisions will not be illustrated/described separately but will be quite clear to one skilled in the art based on, for example, this application.
Upon inspection of
With reference to
Note, a difference in the widths of the two left-shift modules/circuits is 8 bit positions (the difference of the external FP32 and FP24 formats) as well as the five bit control bus LS[4:0] to be generated in the control logic with information from the priority encode unit. Moreover, note that the FPADD24 embodiment does not include as large a shifting range relative to FPADD32 because the FPADD24 embodiment performs shifts in the range of 0 to 17 bit positions. With that in mind, in one embodiment, the shift stage for FPADD32 embodiment that is directed to or handles a 0 or 16 bit position shift may be replaced by a smaller unit that shifts 0 or 2 bit positions (i.e. both the LS[1] and LS[4] rows perform a 0 or 2 bit shift).
With continued reference to
With reference to
With reference to
assign result [ ]=select ? operand-true [ ]: operand-false [ ].
Moreover, the logical value of the “select” signal determines which of “operand-true” and “operand-false” is applied to or driven onto the “result” signal line or conductor. The “result”, “operand-true” and “operand-false” may be vectors. The “select” control signal and, in one embodiment, is a single signal.
Notably, the five rows of multiplexers use the “w26” parameter to specify the width of the operand and result signal vectors.
With reference to
The five rows of multiplexers also employ the “w26” parameter to specify the width of the operand and result signal vectors in the exemplary FPADD24 implementation. Notably, the LS4_mux row is at the bottom of the left-shift logic, as was discussed with the schematic diagram of the left-shift block for FPADD24 (see,
With reference to
Notably, in
With continued reference to
Notably, in this embodiment, a four-bit-look-ahead structure is employed to mimic the traditional carry-look-ahead structure that is being utilized by the addition block that is producing the IN[0:26] value. As such, the final PEN[4:0] that is produced on the left will settle shortly after the IN[0:26] signals from the addition block settle. A value of “11111” on the PEN[4:0] signals at the left indicate that no “1” was detected on the IN[0:26] vector. This will be true for configurations with 31 or fewer input bits i.e. IN[0:30] or less), the max PEN[4:0] code indicates no ones were found: NoOne<=(PEN[4:0]=31). For the configuration with 32 input bits (i.e. IN[0:31]) the max PEN[4:0] code indicates either (i) no ones were found, or (ii) IN[31] was the only input bit that was a one. This case of no ones “1” is detected by including or adding a gate to the control logic: NoOne<=AND (NOT(IN[31]), (PEN[4:0]=31)).
If a different width of priority encode block is employed (i.e. if an IN[0:18] width is employed for a accumulator circuit implementing an FPADD24 format) then the “A” and “B” cells may be removed from the right hand side (e.g., manually removed). In this way, two different strides may be used for the bit indexes. The “B” cells need bit indexes that change from [i+1] to [i], and the “A” cells need bit indexes that change from [i+4] to [i]. The vector indexes used for the continuous assignment signals may not evaluate an expression the way that the procedural assignment statements evaluate an expression. Instead, an alternate method can be used with static parameter values to create a priority encode module/circuit that will adjust to the required width by changing the parameter value at compile time, as discussed below.
With reference to
With reference to
With reference to
As noted above, although several of the exemplary embodiments and features of the inventions are illustrated in the context of floating point addition (FPADD) operation/module/circuit having 24 and 32 bit precision (i.e., FPADD24 and FPADD32), the embodiments and inventions are applicable of other precisions (e.g., FPxx where: 14≤xx≤39). For the sake of brevity, those precisions will not be illustrated separately but will be quite clear to one skilled in the art based on, for example, this application.
With reference to
With continued reference to
With reference to
Notably, in the FPADD32 implementation, the extra At[27] and Bt[27] signals may be LO/LO because the CCIN[27] is not used (always LO). If an application didn't use the At[27] and Bt[27] signals, but did use CCIN[27] (i.e. CCIN[27] may be employed to dynamically insert a carry-in of LO or HI), then the extra At[27] and Bt[27] signals will be LO/HI to allow the global carry-in to propagate to the first bit position with real data.
With reference to
With reference to
As noted above, although several of the exemplary embodiments and features of the inventions are described and/or illustrated in the context of floating point addition operation/module/circuit having 24 and 32 bit precision (i.e., FPADD24 and FPADD32), the embodiments and inventions are applicable of other precisions (e.g., FPxx), including FP20, FP28, FP36 (see, e.g.,
An exemplary right-shift module/circuitry of FPADD24 circuitry, in one embodiment, is a cut down from the FPADD32 right-shift logic (like that described above in relation to the left-shift circuitry—see
Notably, a difference in the widths of the two left-shift blocks is 8 bit positions (the difference of the external FP32 and FP24 formats) and the five bit control bus RS[4:0] is generated in the control logic with information from the exponent compare unit/circuit.
In one embodiment, the shifting range for the FPADD24 circuitry may be smaller than the shifting range of the FPADD32 circuitry because the right-shift logic of the FPADD24 implementation performs shifts in the range of 0 to 17 bit positions. Consequently, in one embodiment, the shift stage of the right-shift logic employed in the FPADD32 implementation that handles a 0 or 16 bit position shift may be replaced by a smaller unit that shifts 0 or 2 bit positions (i.e. both the RS[1] and RS[4] rows perform a 0 or 2 bit shift).
In addition, the RS[4] row of the right-shift logic in the FPADD24 circuitry is moved to the bottom of the right-shift block. This allows the shift wires of the largest-shift-row to be at the top (RS[3] for FPADD24, RS[4] for FPADD32) which thereby allows the wire capacitance thereof to be driven by the previous block while the RS[4:0] control signals settle (note—the data on the data lines is valid before the control on the control lines).
With reference to
With reference to
assign result [ ]=select ? operand-true [ ]: operand-false [ ].
Moreover, the logical value of the “select” signal determines which of “operand-true” and “operand-false” is applied to or driven onto the “result” signal line or conductor. The “result”, “operand-true” and “operand-false” may be vectors. The “select” control signal and, in one embodiment, is a single signal.
Notably, the five rows of multiplexers use the “w26”, “w25”, “w24”, “w22”, “w18”, and “w10” parameters to define or specify the width of the operand and result signal vectors. The sticky logic uses the “w26”, “w25”, “w23”, “w19”, and “w11” parameters to define or specify the width of the operand vectors.
With reference to
The five rows of multiplexers also employ the “w26”, “w25”, “w24”, “w22”, “w18”, and “w10” parameters to define or specify the width of the operand and result signal vectors. The sticky logic uses the “w26”, “w25”, “w23”, “w19”, and “w11” parameters to define or specify the width of the operand vectors. Also note that the LS4_mux row is at the bottom of the right-shift logic, as was previously discussed (see,
With reference to
With continued reference to
In the case of the actual Verilog code for the FPADD32 circuit/unit, the Verilog code for the FPADD24 circuit/unit would be commented out (not shown). The commenting would be switched for the Verilog code for the FPADD24 circuit/unit (also not shown). As mentioned earlier, this switching may be handled automatically with the use of “include” statements (the additional code would be inserted from an external file). The two alternatives are functionally equivalent.
There are many inventions described and illustrated herein. While certain embodiments, features, attributes and advantages of the inventions have been described and illustrated, it should be understood that many others, as well as different and/or similar embodiments, features, attributes and advantages of the present inventions, are apparent from the description and illustrations. As such, the embodiments, features, attributes and advantages of the inventions described and illustrated herein are not exhaustive and it should be understood that such other, similar, as well as different, embodiments, features, attributes and advantages of the present inventions are within the scope of the present inventions.
Indeed, the present inventions are neither limited to any single aspect nor embodiment thereof, nor to any combinations and/or permutations of such aspects and/or embodiments. Moreover, each of the aspects of the present inventions, and/or embodiments thereof, may be employed alone or in combination with one or more of the other aspects of the present inventions and/or embodiments thereof.
As noted herein, although several of the exemplary embodiments and features of the inventions are described and/or illustrated in the context of a processing pipeline (including multiplier circuitry) as well as floating point addition (FPADD) operation/module/circuit having 24 and 32 bit precision (i.e., FPADD24 and FPADD32), the embodiments and inventions are applicable in other contexts as well as other precisions (e.g., FPxx where: xx is an integer and is greater than or equal to 14 and less than or equal to 39). For the sake of brevity, those other contexts and precisions will not be illustrated separately but will be quite clear to one skilled in the art based on, for example, this application. For example, such inventive circuitry/processes and data formats (e.g., FP24 and FP32) are often described herein in the context of the addition operation preceded by multiplication operation. The inventions, however, are not limited to (i) particular floating point format(s), operations (e.g., addition, subtraction, etc.), block/data width, data path width, bandwidths, values, processes and/or algorithms illustrated, nor (ii) the exemplary logical or physical overview configurations of the particular circuitry and/or overall pipeline, and/or exemplary module/circuitry configuration, overall pipeline and/or exemplary Verilog code.
In addition, although the conversion circuitry, in the illustrative exemplary embodiments, increases the bit width of the floating point format of the input data and filter weights (see, e.g.
The conversion circuitry, in one embodiment, includes an adder circuit (e.g., a floating point adder) to implement or assist in connection with conversion of the data format of the data applied to the conversion circuitry (e.g., filter weight data and/or input data such as image data). The data format (e.g., the precision) of the adder circuit implemented in the conversion circuitry may be the same as to different from the accumulator or adder implemented in the multiplier-accumulator circuits of, for example, the execution pipeline (see, e.g.,
In one embodiment, the conversion circuitry, including the adder, may be disposed in the NLINK or NLINK circuit. (See, e.g.,
Aspects, features and embodiments of the NLINK and NLINK circuits are discussed in detail in '111 application and, for the sake of brevity, are not set forth again here. Moreover, the NLINK and NLINK circuits are also discussed in detail in the '345 and −306 applications (i.e., U.S. patent application Ser. No. 16/545,345 and U.S. Provisional Patent Application No. 62/725,306)—which, as mentioned above, are also incorporated by reference herein in their entirety. As indicated above, the inventions described and/or illustrated herein may be employed in conjunction with the aspects, features and embodiments of the NLINK and NLINK circuits in the '345 and '306 applications (which is referred to as NLINX therein). For example, the floating point multiplier-accumulator circuits of the present inventions may be employed in connection with the function and layout of the NLINKS (or NLINX) as described and/or illustrated in the '345 and '306 applications.
Notably, the design or architecture of the adder in the conversion circuitry may be the same as or different from the accumulator or adder implemented in the multiplier-accumulator circuits. In one embodiment, both circuits are or include parameterized architectures and may employ parameters and design/configuration techniques outlined or set forth in
As noted above, the present inventions are not limited to (i) particular floating point format(s), particular fixed point format(s), operations (e.g., addition, subtraction, etc.), block/data width or length, data path width, bandwidths, values, processes and/or algorithms illustrated, nor (ii) the exemplary logical or physical overview configurations, exemplary module/circuitry configuration and/or exemplary Verilog code.
Notably, various circuits, circuitry and techniques disclosed herein may be described using computer aided design tools and expressed (or represented), as data and/or instructions embodied in various computer-readable media, in terms of their behavioral, register transfer, logic component, transistor, layout geometries, and/or other characteristics. Formats of files and other objects in which such circuit, circuitry, layout and routing expressions may be implemented include, but are not limited to, formats supporting behavioral languages such as C, Verilog, and HLDL, formats supporting register level description languages like RTL, and formats supporting geometry description languages such as GDSII, GDSIII, GDSIV, CIF, MEBES and any other formats and/or languages now known or later developed. Computer-readable media in which such formatted data and/or instructions may be embodied include, but are not limited to, non-volatile storage media in various forms (e.g., optical, magnetic or semiconductor storage media) and carrier waves that may be used to transfer such formatted data and/or instructions through wireless, optical, or wired signaling media or any combination thereof. Examples of transfers of such formatted data and/or instructions by carrier waves include, but are not limited to, transfers (uploads, downloads, e-mail, etc.) over the Internet and/or other computer networks via one or more data transfer protocols (e.g., HTTP, FTP, SMTP, etc.).
Indeed, when received within a computer system via one or more computer-readable media, such data and/or instruction-based expressions of the above described circuits may be processed by a processing entity (e.g., one or more processors) within the computer system in conjunction with execution of one or more other computer programs including, without limitation, net-list generation programs, place and route programs and the like, to generate a representation or image of a physical manifestation of such circuits. Such representation or image may thereafter be used in device fabrication, for example, by enabling generation of one or more masks that are used to form various components of the circuits in a device fabrication process.
Moreover, the various circuits, circuitry and techniques disclosed herein may be represented via simulations using computer aided design and/or testing tools. The simulation of the circuits, circuitry, layout and routing, and/or techniques implemented thereby, may be implemented by a computer system wherein characteristics and operations of such circuits, circuitry, layout and techniques implemented thereby, are imitated, replicated and/or predicted via a computer system. The present inventions are also directed to such simulations of the inventive circuits, circuitry and/or techniques implemented thereby, and, as such, are intended to fall within the scope of the present inventions. The computer-readable media corresponding to such simulations and/or testing tools are also intended to fall within the scope of the present inventions.
Notably, reference herein to “one embodiment” or “an embodiment” (or the like) means that a particular feature, structure, or characteristic described in connection with the embodiment may be included, employed and/or incorporated in one, some or all of the embodiments of the present inventions. The usages or appearances of the phrase “in one embodiment” or “in another embodiment” (or the like) in the specification are not referring to the same embodiment, nor are separate or alternative embodiments necessarily mutually exclusive of one or more other embodiments, nor limited to a single exclusive embodiment. The same applies to the term “implementation.” The present inventions are neither limited to any single aspect nor embodiment thereof, nor to any combinations and/or permutations of such aspects and/or embodiments. Moreover, each of the aspects of the present inventions, and/or embodiments thereof, may be employed alone or in combination with one or more of the other aspects of the present inventions and/or embodiments thereof. For the sake of brevity, certain permutations and combinations are not discussed and/or illustrated separately herein.
Further, an embodiment or implementation described herein as “exemplary” is not to be construed as ideal, preferred or advantageous, for example, over other embodiments or implementations; rather, it is intended convey or indicate the embodiment or embodiments are example embodiment(s).
Although the present inventions have been described in certain specific aspects, many additional modifications and variations would be apparent to those skilled in the art. It is therefore to be understood that the present inventions may be practiced otherwise than specifically described without departing from the scope and spirit of the present inventions. Thus, embodiments of the present inventions should be considered in all respects as illustrative/exemplary and not restrictive.
The terms “comprises,” “comprising,” “includes,” “including,” “have,” and “having” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, circuit, article, or apparatus that comprises a list of parts or elements does not include only those parts or elements but may include other parts or elements not expressly listed or inherent to such process, method, article, or apparatus. Further, use of the terms “connect”, “connected”, “connecting” or “connection” herein should be broadly interpreted to include direct or indirect (e.g., via one or more conductors and/or intermediate devices/elements (active or passive) and/or via inductive or capacitive coupling)) unless intended otherwise (e.g., use of the terms “directly connect” or “directly connected”).
The terms “a” and “an” herein do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced item. Further, the terms “first,” “second,” and the like, herein do not denote any order, quantity, or importance, but rather are used to distinguish one element/circuit/feature from another.
In addition, the term “integrated circuit” means, among other things, any integrated circuit including, for example, a generic or non-specific integrated circuit, processor, controller, state machine, gate array, SoC, PGA and/or FPGA. The term “integrated circuit” also means any integrated circuit (e.g., processor, controller, state machine and SoC)—including an embedded FPGA.
Further, the term “circuitry”, means, among other things, a circuit (whether integrated or otherwise), a group of such circuits, one or more processors, one or more state machines, one or more processors implementing software, one or more gate arrays, programmable gate arrays and/or field programmable gate arrays, or a combination of one or more circuits (whether integrated or otherwise), one or more state machines, one or more processors, one or more processors implementing software, one or more gate arrays, programmable gate arrays and/or field programmable gate arrays. The term “data” means, among other things, a current or voltage signal(s) (plural or singular) whether in an analog or a digital form, which may be a single bit (or the like) or multiple bits (or the like).
In the claims, the term “MAC circuit” means a multiplier-accumulator circuit having a multiplier circuit coupled to an accumulator circuit. For example, a multiplier-accumulator circuit is described and illustrated in the exemplary embodiment of FIGS. 1A-1C of U.S. patent application Ser. No. 16/545,345, and the text associated therewith. Notably, however, the term “MAC circuit” is not limited to the particular circuit, logical, block, functional and/or physical diagrams, block/data width, data path width, bandwidths, and processes illustrated and/or described in accordance with, for example, the exemplary embodiment of FIGS. 1A-1C of U.S. patent application Ser. No. 16/545,345, which, as indicated above, is incorporated by reference.
Notably, the limitations of the claims are not written in means-plus-function format or step-plus-function format. It is applicant's intention that none of the limitations be interpreted pursuant to 35 USC § 112, ¶6 or § 112(f), unless such claim limitations expressly use the phrase “means for” or “step for” followed by a statement of function and void of any specific structure.
Claims
1. An integrated circuit comprising:
- a multiplier-accumulator execution pipeline, including a plurality of multiplier-accumulator circuits to, in operation, perform multiply and accumulate operations, wherein each multiplier-accumulator circuit includes: a multiplier to multiply first input data, having a first floating point data format, by a filter weight data, having the first floating point data format, and to generate and output product data having a second floating point data format, and an accumulator, coupled to the multiplier of the associated multiplier-accumulator circuit, to add second input data and the product data output by the associated multiplier to generate sum data; and
- wherein, the plurality of multiplier-accumulator circuits of the multiplier-accumulator execution pipeline are connected in series and, in operation, perform a plurality of concatenated multiply and accumulate operations.
2. The integrated circuit of claim 1 wherein:
- the first floating point data format is a floating point having a first precision and the second floating point data format is a floating point having a second precision.
3. The integrated circuit of claim 1 wherein:
- the accumulator of each multiplier-accumulator circuit of the plurality of multiplier-accumulator circuits adds the second input data and the product data in the second floating point data format.
4. The integrated circuit of claim 1 wherein:
- the sum data includes the second floating point data format.
5. The integrated circuit of claim 1 further including:
- first conversion circuitry, coupled to the plurality of multiplier-accumulator circuits of the multiply-accumulator execution pipeline, to convert filter weight data having a first format to the filter weight data having the first floating point data format, and
- a memory, coupled the first conversion circuitry, to store the filter weight data having the first format.
6. The integrated circuit of claim 1 wherein:
- first conversion circuitry, coupled to the plurality of multiplier-accumulator circuits of the multiply-accumulator execution pipeline, to convert first input data having a first format to the first input data having the first floating point data format.
7. The integrated circuit of claim 1 wherein:
- the first conversion circuitry includes an adder.
8. The integrated circuit of claim 1 wherein:
- (i) the first floating point data format is 16 bit and the second floating point format is 24 or 32 bit or (ii) the first floating point data format is 16 bit or 24 bit and the second floating point format is 32 bit.
9. An integrated circuit comprising:
- a multiplier-accumulator execution pipeline, coupled to first memory, including a plurality of multiplier-accumulator circuits to, in operation, perform multiply and accumulate operations, wherein each multiplier-accumulator circuit includes: a multiplier, coupled to the first memory, to multiply first input data, having a first floating point data format, by filter weight data, having the first floating point data format, and to generate and output product data having a second floating point data format, and an accumulator, coupled to the multiplier of the associated multiplier-accumulator circuit, to add second input data and the product data output by the associated multiplier to generate sum data; and
- wherein, the plurality of multiplier-accumulator circuits of the multiplier-accumulator execution pipeline are connected in series to form a ring architecture and, in operation, perform a plurality of concatenated multiply and accumulate operations.
10. The integrated circuit of claim 9 wherein:
- each multiplier-accumulator circuit of the plurality of multiplier-accumulator circuits of the multiplier-accumulator execution pipeline is connected to two multiplier-accumulator circuits of the plurality of multiplier-accumulator circuits including: a connected first multiplier-accumulator circuit to receive the second input data therefrom, and a connected second multiplier-accumulator circuit to output the sum data thereto.
11. The integrated circuit of claim 10 wherein:
- the second input data is the sum data output by the accumulator of the connected first multiplier-accumulator circuit.
12. The integrated circuit of claim 11 wherein:
- the first floating point data format is a floating point having a first precision and the second floating point data format is a floating point having a second precision.
13. The integrated circuit of claim 11 wherein:
- the accumulator of each multiplier-accumulator circuit of the plurality of multiplier-accumulator circuits adds the second input data and the product data in the second floating point data format.
14. The integrated circuit of claim 11 wherein:
- the sum data includes the second floating point data format.
15. An integrated circuit comprising:
- first memory to store data;
- first conversion circuitry, coupled to the first memory, to receive and convert the data from the first memory to first input data having a first floating point data format;
- second memory to store filter weight data having a second data format;
- second conversion circuitry, coupled to the second memory, to receive and convert the filter weight data from the second memory to filter weight data format having the first floating point data format;
- a multiplier-accumulator execution pipeline, coupled to the first conversion circuitry and the second conversion circuitry, wherein the multiplier-accumulator execution pipeline includes a plurality of multiplier-accumulator circuits to, in operation, perform multiply and accumulate operations, wherein each multiplier-accumulator circuit includes: a multiplier, coupled to the first memory, to multiply the first input data, having the first floating point data format, by the filter weight data, having the first floating point data format, and to generate and output product data having a second floating point data format, and an accumulator, coupled to the multiplier of the associated multiplier-accumulator circuit, to add second input data and the product data output by the associated multiplier to generate sum data; and
- wherein, the plurality of multiplier-accumulator circuits of the multiplier-accumulator execution pipeline are connected in series to form a ring architecture and, in operation, perform a plurality of concatenated multiply and accumulate operations.
16. The integrated circuit of claim 15 wherein:
- each multiplier-accumulator circuit of the plurality of multiplier-accumulator circuits is connected to two multiplier-accumulator circuits of the plurality of multiplier-accumulator circuits including: a connected first multiplier-accumulator circuit to receive the second input data therefrom, and a connected second multiplier-accumulator circuit to output the sum data thereto.
17. The integrated circuit of claim 16 wherein:
- the second input data is the sum data output by the accumulator of the connected first multiplier-accumulator circuit.
18. The integrated circuit of claim 17 wherein:
- the first floating point data format is a floating point having a first precision and the second floating point data format is a floating point having a second precision.
19. The integrated circuit of claim 15 wherein:
- the sum data includes the second floating point data format, and
- the first floating point data format is a floating point having a first precision and the second floating point data format is a floating point having a second precision.
20. The integrated circuit of claim 19 wherein:
- the first floating point data format is 16 bit and the second floating point format is 24 bit or 32 bit.
21. The integrated circuit of claim 19 wherein:
- the first conversion circuitry and the second conversion circuitry each includes an adder.
Type: Application
Filed: Jun 12, 2020
Publication Date: Dec 24, 2020
Applicant: Flex Logix Technologies, Inc. (Mountain View, CA)
Inventors: Frederick A. Ware (Los Altos Hills, CA), Cheng C. Wang (San Jose, CA), Fang-Li Yuan (San Jose, CA), Nitish U. Natu (Santa Clara, CA)
Application Number: 16/900,319