Library for computer-based tool and related system and method
A library includes one or more circuit templates and an interface template. The one or more circuit templates each define a respective circuit operable to execute a respective algorithm or portion thereof. And the interface template defines a hardware layer operable to interface one of the circuits to pins of a programmable logic circuit when the layer and the one circuit are instantiated on the programmable logic circuit. Such a library may shorten the time and reduce the effort that an engineer expends designing a circuit for instantiation on a PLIC or ASIC by allowing the engineer to build the circuit from templates of previously designed and debugged circuits.
Latest Patents:
This application claims priority to U.S. Provisional Application Ser. Nos. 60/615,192, 60/615,157, 60/615,170, 60/615,158, 60/615,193, and 60/615,050, filed on Oct. 1, 2004, which are incorporated by reference.
CROSS REFERENCE TO RELATED APPLICATIONSThis application is related to U.S. patent application Ser. Nos. ______ (Attorney Docket Nos. 1934-21-3, 1934-23-3, 1934-24-3, 1934-25-3,1934-26-3, 1934-31-3, and 1934-36-3), which have a common filing date and assignee and which are incorporated by reference.
BACKGROUNDElectronics engineers often instantiate circuits, such as logic circuits, on programmable logic integrated circuits (PLICs) such as field-programmable gate arrays (FPGAs), and on application-specific integrated circuits (ASICs). Because an engineer typically configures with firmware the circuit components and interconnections inside of a PLIC, he can modify a circuit instantiated on the PLIC merely by modifying and reloading the firmware. An example of a computer architecture that exploits the ability to configure and reconfigure circuitry within a PLIC with firmware is described in U.S. Patent Publication No. 2004/0133763, which is incorporated herein by reference.
But unfortunately, it is often difficult and time consuming to design a circuit for instantiation on a PLIC, and an increase in the level of design difficulty and the time required to complete the design often accompany the routing resources, component density, and component variety on a PLIC.
Comparatively, when a software programmer writes source code for a software application, he can often save time by incorporating into the application previously written and debugged software objects from a software-object library. Suppose the programmer wishes to write a software application that solves for y in the following equation:
y=x2+Z3 (1)
Further suppose that a software-object library includes a first software object for squaring a value (here x), a second software object for cubing a value (here z), and a third software object for summing two values (here x2 and z3). By incorporating pointers to these three objects in the source code, a compiler effectively merges these objects into the software application while compiling the source code. Therefore, the object library allows the programmer to write the software application in a shorter time and with less effort because the programmer does not have to “reinvent the wheel” by writing and debugging pieces of source code that respectively square x, cube z, and sum x2 and z3. Furthermore, if the programmer needs to modify the software application, he can do so without modifying and re-debugging the first, second, and third software objects.
In contrast, there are typically no time- or effort-saving equivalents of software objects available to a hardware engineer who wishes to design a circuit for instantiation on a PLIC; consequently, when a hardware engineer designs a circuit for instantiation on a PLIC, he typically must write the source code (e.g., Verilog Hardware Description Language (VHDL)) “from scratch.” Suppose that an engineer wishes to design a logic circuit that solves for y equation (1). Because there are typically no hardware equivalents of the first, second, and third software objects described in the preceding paragraph, the engineer may write source code that describes first and second portions of a circuit for solving equation (1). The first circuit portion squares x, cubes z, and sums x2 and z3, and the second circuit portion interfaces the first circuit portion to the external pins of the PLIC. The engineer then compiles the source code with PLIC design tool (typically provided by the PLIC manufacturer), which synthesizes and routes the circuit and then generates the configuration firmware that, when loaded into the PLIC, instantiates the circuit. Next, the engineer loads the firmware into the PLIC and debugs the instantiated circuit. Unfortunately, the synthesizing and routing steps are often not trivial, and may take a number of hours or even days depending upon the size and complexity of the circuit. And even if the engineer makes only a minor modification to a small portion of the circuit, he typically must repeat the synthesizing, routing, and debugging steps for the entire circuit.
Another factor that may add to the time and effort that an engineer expends while designing a circuit for instantiation on a PLIC is that a PLIC design tool typically recognizes only hardware-specific source code. Suppose that a mathematician, who writes an equation using mathematical symbols (e.g., “+,” “−,” “≦,” “Σ,” “δ,” “σ,” “x2,” “z3,” and “√,”), wishes to instantiate on a PLIC a circuit that solves for a variable in a complex equation that includes, e.g., partial derivatives and integrations. Because a PLIC design tool typically recognizes few, if any, mathematical symbols, the mathematician often must explain the equation and the desired operating parameters (e.g., latency and precision) of the circuit to a hardware engineer, who then translates the equation and operating parameters into source code that the design tool recognizes. These explanation and translation steps are often time consuming and difficult for the engineer, particularly where the equation is mathematically complex or the circuit has stringent operating parameters (e.g., high speed, high precision).
Therefore, a need has arisen for a new methodology and for a new tool for designing a circuit for instantiation on a PLIC.
SUMMARYAccording to an embodiment of the invention, a library includes one or more circuit templates and an interface template. The one or more circuit templates each define a respective circuit operable to execute a respective algorithm or portion thereof. And the interface template defines a hardware layer operable to interface one of the circuits to pins of a programmable logic circuit when the layer and the one circuit are instantiated on the programmable logic circuit.
Such a library may shorten the time and reduce the effort that an engineer expends designing a circuit for instantiation on a PLIC or ASIC by allowing the engineer to build the circuit from templates of previously designed and debugged circuits.
BRIEF DESCRIPTION OF THE DRAWINGS
Introduction
A computer-based circuit design tool according to an embodiment of the invention is discussed below in conjunction with
But first is presented in conjunction with
Overview Of Concepts Related To Design Tool
Still referring to
The host processor 12 includes a processing unit 32 and a message handler 34, and the processor memory 16 includes a processing-unit memory 36 and a handler memory 38, which respectively serve as both program and working memories for the processor unit and the message handler. The processor memory 36 also includes an accelerator-configuration registry 40 and a message-configuration registry 42, which store respective configuration data that allow the host processor 12 to configure the functioning of the accelerator 14 and the structure of the messages that the message handler 34 sends and receives.
The pipelined accelerator 14 includes at least one PLIC (
Generally, in one mode of operation of the peer-vector computing machine 10, the pipelined accelerator 14 receives data from one or more software applications running on the host processor 12, processes this data in a pipelined fashion with one or more logic circuits that execute one or more mathematical algorithms, and then returns the resulting data to the application(s). As stated above, because the logic circuits execute few if any software instructions, they often process data one or more orders of magnitude faster than the host processor 12. Furthermore, because the logic circuits are instantiated on one or more PLICs, one can modify these circuits merely by modifying the firmware stored in the memory 52; that is, one need not modify the hardware components of the accelerator 14 or the interconnections between these components. The operation of the peer-vector machine 10 is further discussed in previously incorporated U.S. Patent Publication No. 2004/0133763, the functional topology and operation of the host processor 12 is further discussed in previously incorporated U.S. Patent Publication No. 2004/0181621, and the topology and operation of the accelerator 14 is further discussed in previously incorporated U.S. Patent Publication No. 2004/0136241.
The unit 50 includes a circuit board 52 on which are disposed the firmware memory 22, a plafform-identification memory 54, a bus connector 56, a data memory 58, and a PLIC 60.
As discussed above in conjunction with
The platform memory 54 stores a value that identifies the one or more platforms with which the pipeline unit 50 is compatible. Generally, a platform specifies a unique set of physical attributes that a pipeline unit may possess. Examples of these attributes include the number of external pins (not shown) on the PLIC 60, the width of the bus connector 56, the size of the PLIC, and the size of the data memory. Consequently, a pipeline unit 50 is compatible with a platform if the unit possesses all of the attributes that the platform specifies. So a pipeline unit 50 having a bus connector 56 with thirty-two bits is incompatible with a platform that specifies a bus connector with sixty-four bits. Some platforms may be compatible with the peer vector machine 10 (
The bus connector 56 is a physical connector that interfaces the PLIC 60, and perhaps other components of the pipeline unit 50, with the pipeline bus 20 of
The data memory 58 acts as a buffer for storing data that the pipeline unit 50 receives from the host processor 12 (
Instantiated on the PLIC 60 are logic circuits that compose the hardwired pipeline(s) 44 and a hardware interface layer 62, which interfaces the hardwired pipelines to the external pins (not shown) of the PLIC 60, and which thus interfaces the pipelines to the pipeline bus 20 (via the connector 56), the firmware and plafform-identification memories 22 and 54, and the data memory 58. Because the topology of interface layer 62 is primarily dependent upon the attributes specified by the platform(s) with which the pipeline unit 50 is compatible, one can often modify the pipeline(s) 44 without modifying the interface layer. For example, if a platform with which the pipeline unit 50 is compatible specifies a thirty-two-bit bus, then the interface layer 62 provides a thirty-two-bit bus connection to the bus connector 60 regardless of the topology or other attributes of the pipeline(s) 44. Consequently, as discussed below in conjunction with
Still referring to
A pipeline unit similar to the unit 50 is discussed in previously incorporated U.S. Patent Publication No. 2004/0136241.
Still referring to
A communication interface 80 and an optional industry-standard bus interface 82 compose the interface-adapter layer 70, and a controller 84, exception manager 86, and configuration manager 88 compose the framework-services layer 72.
The communication interface 80 transfers data between a peer, such as the host processor 12 (
The controller 84 synchronizes the hardwired pipelines 441-44n and monitors and controls the sequence in which they perform the respective data operations in response to communications, i.e., “events,” from other peers. For example, a peer such as the host processor 12 may send an event to the pipeline unit 50 via the pipeline bus 20 to indicate that the peer has finished sending a block of data to the pipeline unit and to cause the hardwired pipelines 441-44n to begin processing this data. An event that includes data is typically called a message, and an event that does not include data is typically called a “door bell.”
The exception manager 86 monitors the status of the hardwired pipelines 441-44n, the communication interface 80, the communication shell 74, the controller 84, and the bus interface 82 (if present), and reports exceptions to the host processor 12 (
The configuration manager 88 sets the “soft” configuration of the hardwired pipelines 441-44n, the communication interface 80, the communication shell 74, the controller 84, the exception manager 86, and the interface 82 (if present) in response to soft-configuration data from the host processor 12 (
The communication interface 80, optional industry-standard bus interface 82, controller 84, exception manager 86, and configuration manager 88 are further discussed in previously incorporated U.S. Patent Publication No. 2004/0136241.
Referring again to
The hardware-description file 100 includes a top-level template 101, which includes respective top-level definitions 102, 104, and 106 of the interface-adapter layer 70, the framework-services layer 72, and the communication shell 74 (collectively the hardware-interface layer 62) of the PLIC 60 (
The top-level definition 102 of the interface-adapter layer 70 (
Similarly, the top-level definition 104 of the framework-services layer 72 (
Likewise, the top-level definition 106 of the communication shell 74 (
The top-level definition 113 of the hardwired pipeline(s) 44 (
Moreover, the communication-shell template 112 may incorporate a hierarchy of one or more lower-level templates 116 and even lower-level templates (not shown) such that all portions of the communication shell 74 other than the hardwired pipeline(s) 44 are, at some level of the hierarchy, defined in terms of circuit components (e.g., flip-flops, logic gates) that the PLIC synthesizing and routing tool recognizes.
Still referring to
One or more of the templates 101, 108, 110, 112, 114 and the lower-level templates (not shown) incorporate the parameters defined in the configuration template 118. The PLIC synthesizer and router tool (not shown) configures the interface-adapter layer 70, the framework-services layer 72, the communication shell 74, and the hardwired pipeline(s) 44 (
Alternate embodiments of the hardware-description file 100 are contemplated. For example, although described as defining circuitry for instantiation on a PLIC, the file 100 may define circuitry for instantiation on an ASIC.
The library 120 has m+1 sections: m sections 1221-122m for the respective m platforms that the library supports, and a section 124 for the hardwired-pipelines 44 (
For example purposes, the library section 1221 is discussed in detail, it being understood that the other library sections 1222-122m are similar.
The library section 1221 includes a top-level template 1011, which is similar in structure to the template 101 of
In this embodiment, we assume that there is only one version of the interface-adapter layer 70 and one version of the framework-services layer 72 available for each platform m, and, therefore, that the library section 1221 includes only one interface-adapter-layer template 1081 and only one framework-services-layer template 1101. But in an embodiment that includes multiple versions of the interface-adapter layer 70 and multiple versions of the framework-services layer 72 for each platform m, the library section 1221 would include multiple interface-adapter- and framework-services-layer templates 108 and 110.
The library section 1221 also includes n communication-shell templates 1121,1-1121,n, which respectively correspond to the hardwired-pipeline templates 1441-144n in the library section 124. As stated above in conjunction with
In addition, the library section 1221 includes a configuration template 1181, which defines configuration constants having designer-selectable values as discussed above in conjunction with the configuration template 118 of
Furthermore, each template within the library section 1221 includes, or is associated with, a respective description 1261-1341. The descriptions 1261-1321,n describe the operational and other parameters of the circuitry that the respective templates 1011, 1081, 1101, and 1121,1-1121,n define. Similarly, the description 1341 describes the settable parameters in the configuration template 1181, the values that these parameters can have, and the meanings of these values. The design tool discussed below in conjunction with
Each of the descriptions 1261-1341 may be embedded within the respective template 1011, 1081, 1101, 1121-1121,n, and 1181 to which it corresponds. For example, the description 1281 may be embedded within the template 1081 as extensible markup language (XML) tags or comments that are readable by both a human and the tool discussed below in conjunction with
Alternatively, each description 1261-1341 may be disposed in a separate file that is linked to the template to which the description corresponds, and this file may be written in a language other than XML. For example, the description 1261 may be disposed in a file that is linked to the top-level template 1011.
The section 1221 of the library 120 also includes a description 1361, which describes the parameters of the platform m=1. The design tool discussed below in conjunction with
Still referring to
Furthermore, each hardwired-pipeline template 114 includes, or is associated with, a respective description 1381-138n, which describes the parameters of the hardwired-pipeline 44 that the template defines. Like the descriptions 1261-1341 discussed above, the design tool discussed below in conjunction with
Referring again to the library section 1221, this section also includes a description 140 of the one or more available pipeline accelerators 14 (
Still referring to
The system 150 includes a processor (not shown) for executing the software code that composes the tool 152. Consequently, in response to the code, the processor performs the functions that are attributed to the tool 152 in the discussion below. But for clarity of explanation, the tool 152, not the processor, is described as performing the actions.
In addition to the processor, the system 150 includes an input device 154, a display device 155, and the library 120 of
The tool 152 includes a symbolic-math front end 156, an interpreter 158, a generator 160 for generating a file 162 that defines a circuit, and a simulator 164.
The front end 156 receives from the input device 154 the mathematical expression that defines the algorithm that the circuit is to execute and other design information, and converts this information into a form that is readable by the interpreter 158. To allow one to define a circuit in terms of the mathematical expression that defines the algorithm that the circuit is to execute, in one embodiment the front end 156 includes a web browser that accepts XML with a schema for Math Markup Language (MathML). MathML is software standard that allows one to enter expressions using conventional mathematical symbols. The schema of MathML is a conventional plug in that imparts to a web browser this same ability, i.e., the ability to enter expressions using mathematical symbols. Alternatively, the front end 156 may utilize another technique for allowing one to define a circuit using a mathematical expression. Examples of such another technique include the technique used by the conventional software mathematical-expression solver MathCAD. Furthermore, as discussed below, one may enter the identity of a platform or pipeline accelerator 14 (
The interpreter 158 parses the information from the front end 156 and determines: 1) whether the library 120 includes templates 114 (
If the interpreter 158 determines that the library 120 includes a sufficient number of hardwired-pipeline templates 114 (
The file generator 160 combines the hardwired pipelines 44 (
The generator 160 then generates the file 162, which defines the circuit for executing the algorithm in terms of the hardwired pipelines 44 (
Next, the host processor 12 (
The simulator 164 receives the file 162 from the generator 160 and receives from the front end 154 designer-entered test data, such as a test vector, designer-entered constraint data, and a designer-entered exception-handling protocol, and then simulates operation of the circuit defined by the file 162. The simulator 164 also gathers parameter information (e.g., precision, latency) from the description files 138 (
Referring to
Suppose that one wishes to design a circuit that solves for a value y, which equals a mathematical expression according to the following equation:
y=√{square root over (x4 cos(z)+z3 sin(x))} (2)
Also suppose that x, y, and z are thirty-two-bit floating-point values.
Using the input device 154, the designer enters equation (2) into the front end 156 of the tool 152 by entering the following sequence of mathematical symbols: “√”, “x4”, “·”, “cos(z)”, “+”, “z3”, ”·”, and “sin(x)”. The designer also enters information specifying the input and output message specifications, for example indicating that x, y, and z are thirty-two-bit floating-point values. The designer may also enter information indicating desired operating parameters, such as the desired latency, in clock cycles, from inputs x and z to output y, and the desired types and precision of any intermediate values, such as cos(z) and sin(x), generated during the calculation of y. Furthermore, the designer may enter information that identifies a desired platform or pipeline accelerator 14 (
The front end 156 converts these mathematical symbols and the other information into a format compatible with the interpreter 158 if this information is not already in a compatible format.
Next, the interpreter 158 determines whether any of the hardwired-pipeline templates 114 in the library 120 defines a hardwired pipeline 44 that can solve for y in equation (2) within the specified behavior and operating parameters and that can be instantiated within the desired platform and on the desired pipeline accelerator 14 (
If the library 120 does include such a template 114, then the interpreter 158 informs the designer, via the display device 155, that a conventional FPGA synthesizing and routing tool can generate firmware for instantiating this hardwired pipeline 44 from the identified template 114, the corresponding communication-shell template 112, and the corresponding top-level template 101.
If, however, the library 120 includes no template 114 that defines a hardwired pipeline 44 that can solve for y in equation (2), then the interpreter 158 parses the equation (2) into portions, and determines whether the library includes templates 114 that define hardwired pipelines 44 for executing these portions within the specified behavior, operating parameters, and platform and on the specified pipeline accelerator 14 (
To identify a circuit that can solve for y in equation (2) but that includes the fewest number of hardwired pipelines 44, the interpreter 158 parses the equation (2) according to a top-down parsing sequence as discussed below. Typically, this top-down parsing sequence corresponds to the known algebraic laws for the order of operations.
First, the interpreter 158 parses the equation (2) into the following two portions: “√”, which is portion 170 in
If the interpreter 158 determines that the library 120 includes at least two hardwired-pipeline templates 114 that define hardwired pipelines 44 for respectively executing the portions 170 and 172 of equation (2), then the interpreter passes the identity of these templates to the file generator 160.
In this example, however, the interpreter 158 determines that although the library 120 includes a hardwired-pipeline template 114 that defines a pipeline 44 for executing the square-root operation 170 of equation (2), the library includes no hardwired-pipeline template that defines a pipeline for executing the portion 172.
Next, the interpreter 158 parses the portion 172 of equation (2). Specifically, the interpreter 158 parses the portion 172 into the following three respective portions 174, 176, and 178: “x4 cos(z)”, “+”, and “z3 sin(x)”.
If the interpreter 158 determines that the library 120 includes at least three hardwired-pipeline templates 114 that define hardwired pipelines 44 for respectively executing the portions 174, 176, and 178 of equation (2), then the interpreter passes the identity of these templates to the file generator 160.
In this example, however, the interpreter 158 determines that although the library 120 includes a hardwired-pipeline template 114 that defines a hardwired pipeline 44 for executing the summing operation 176 of equation (2), the library includes no templates 114 that define hardwired pipelines for executing the portions 174 or 178.
Next, the interpreter 158 parses the portions 174 and 178 of equation (2). Specifically, the interpreter 158 parses the portion 174 into three portions 180 (“x4”), 182 (“·”), and 184 (“cos(z)”), and parses the portion 178 into three portions 186 (“z3”), 188 (“·”), and 190 (“sin(x)”).
If the interpreter 158 determines that the library 120 does not include hardwired-pipeline templates 114 that define hardwired pipelines 44 for respectively executing each of the portions 180, 182, 184, 186, 188, and 190, then the interpreter displays via the device 155 an error message indicating that the library does not support a circuit that can solve for y in equation (2). In one embodiment of the invention, however, the library 120 includes hardwired-pipeline templates 114 that provide the primitive operations for multiplication and for raising variables to a power (e.g., cubing a value by using two multipliers in sequence) for single- or double-precision floating-point data types, and for data-type conversion. Also in this embodiment, the tool 152 recognizes common factors, for example that x is a factor of x3 if sin(x3) was needed instead of the sin(x), and generates circuitry to provide these common factors from chained multipliers.
In this example, however, the interpreter 158 determines that the library 120 includes hardwired-pipeline templates 114 that define hardwired pipelines 44 for respectively executing each portion 180, 182, 184, 186, 188, and 190 of equation (2).
Then, the interpreter 158 provides to the file generator 160 the identities of all the hardwired-pipeline templates 114 that define the hardwired-pipelines 44 for executing the following eight portions of equation (1): 170 (“√”), 176 (“+”),180 (“x4”), 182 (“·”), 184 (“cos(z)”), 186 (“z3”), 186 (“z3”), 188 (“·”), and 190 (“sin(x)”).
Referring to
Next, using the table 192, the file generator 160 selects the pipelines 44 from which to build a circuit that solves for y in equation (2). The generator 160 selects these pipelines 44 based on the behavior(s), operating parameter(s), plafform(s), and pipeline accelerator(s) 14 (
Then, the file generator 160 interconnects the selected hardwired pipelines 44 to form a circuit 200 (
To form the circuit 200, the file generator 160 first determines how the selected hardwired pipelines 441-447 can “fit” into the resources of a specified accelerator 14 (
In this example, the generator 160 determines that each PLIC 60 (
Next, based on the platform that the designer specifies, the generator 160 “inserts” into each of the PLICs 601-608 of the pipeline units 501-508 a respective hardware-interface layer 621-628. Assuming that the designer specifies platform m=1, the generator 160 generates the layers 621-628 from the following templates in section 1221 of the library 120: the interface-adapter-layer template 1081, the framework-services-layer template 1101, and the communication-shell templates 1121,1-1121,7, which respectively correspond to the pipeline templates 1141-1147, and thus to the pipelines 441-447. More specifically, the generator 160 generates the hardware-interface layer 621 from the interface-adapter-layer template 1081, the framework-services-layer template 1101, and the communication-shell template 1121,1. Similarly, the generator 160 generates the hardware-interface layer 622 from the templates 1081, 1101, and 1121,2, the hardware-interface layer 623 from the templates 1081, 1101, and 1121,3, and so on. Furthermore, because the PLICs 605 and 606 both will include the multiplier pipeline 445, the generator 160 generates both of the hardware-interface layers 625 and 626 from the interface-adapter and framework-services templates 1081 and 1101 and from the communication-shell template 1121,5; consequently, the hardware-interface layers 625 and 626 are identical but are instantiated on respective PLICs 605 and 606. Moreover, the generator 160 generates the hardware-interface layer 627 from the templates 1081, 1101, and 1121,6, and the hardware-interface layer 628 from the templates 1081, 1101, and 1121,7.
Then, the generator 160 “inserts” into each hardware-interface layer 621-628 a respective hardwired pipeline 441-447 (the generator 160 inserts the pipeline 445 into both of the hardware-interface layers 625 and 626, the pipeline 446 into the hardware-interface layer 627, and the pipeline 447 into the hardware-interface layer 628). More specifically, the generator 160 inserts the pipelines 441-447 into the hardware-interface layers 621-628 by respectively inserting the hardwired-pipeline templates 1141-1147 into the communication-shell templates 1121,1-1121,7.
Next, the generator 160 interconnects the pipeline units 501-508 to form the circuit 200, which generates the value y from equation (2) at its output (i.e., the output of the pipeline unit 508).
Referring to
The first intermediate stage 208 of the circuit 200 includes two instantiations of the pipelines 445 and operates as follows. The pipeline 445 in the PLIC 605 receives the streams of values sin(x) and z3 from the input stage 206 via an input portion of the hardware-interface layer 625 and generates, in a pipelined fashion, a corresponding stream of values z3 sin(x) via an output portion of the layer 625. Similarly, the pipeline 445 in the PLIC 606 receives the streams of values x4 and cos(z) from the input stage 206 via an input portion of the hardware-interface layer 626 and generates, in a pipelined fashion, a corresponding stream of values x4 cos(z) via an output portion of the layer 626.
The second intermediate stage 210 of the circuit 200 includes the hardwired pipeline 446, which receives the streams of values z3 sin(x) and x4 cos(z) from the first intermediate stage 208 via an input portion of the hardware-interface layer 627, and generates, in a pipelined fashion, a corresponding stream of values z3 sin(x)+x4 cos(z) via an output portion of the layer 627.
And the output stage 212 of the circuit 200 includes the hardwired pipeline 447, which receives the stream of values z3 sin(x)+x4 cos(z) from the second intermediate stage 210 via an input portion of the hardware-interface layer 628, and generates, in a pipelined fashion, a corresponding stream of values y=√{square root over (z3 sin(x)+x4 cos(z))} via an output portion of the layer 628.
Referring to
For example, the designer may swap out one or more of the pipelines 441-447 with one or more other pipelines from the table 192. Suppose the square-root pipeline 447 has a high precision but a relatively long latency per the default rules that the generator 160 follows as discussed above. If the table 192 includes another square-root pipeline having a shorter latency, then the designer may replace the pipeline 447 with the other square-root pipeline, for example by using the input device 154 to “drag” the other pipeline from the table into the schematic representation of the PLIC 608.
In addition, the designer may swap out one or more of the hardwired pipelines 441-447 with a symbolically defined polynomial series (i.e., a Taylor Series equivalent) that approximates one of the pipelined operations. Suppose the available square-root pipeline 447 has insufficient mathematical accuracy per the designers specification and the default rules that the generator 160 follows as discussed above. If the designer then specifies a new square-root function as a series summation of related monomials, then the front end 156, interpreter 158, and file generator 160 concatenate a series of parameterized monomial circuit templates into a circuit that solves for square roots. In this way the designer replaces the default pipeline 447 with the higher-precision square-root circuit using symbolic design. This example illustrates the symbolic use of polynomials to define new mathematical functions as established by Taylor's Theorem. A more detailed example is discussed below in conjunction with
The designer may also change the topology of the circuit 200. Suppose that according to the default rules discussed above, the generator 160 places each instantiation of the hardwired pipelines 441-447 into a separate PLIC 60. But also suppose that each PLIC 60 has sufficient resources to hold multiple pipelines 44. Consequently, to reduce the number of pipeline units 50 that the circuit 200 occupies, the designer may, using the input device 154, move some of the pipelines 44 into the same PLIC. For example, the designer may move both instantiations of the multiplier pipeline 445 out of the PLICs 605 and 606 and into the PLIC 607 with the adder pipeline 446, thus reducing by two the number of PLICs that the circuit 200 occupies. The designer then manually interconnects the two instantiations of the pipeline 445 to the pipeline 446 within the PLIC 607, or may instruct the generator 160 to perform this interconnection. Although the library 120 may not include a communication-shell template 112 that defines a communication shell 74 for this combination of multiple pipelines 445 and 446, the designer or another may write such a template and debug the communication shell that the template defines without having to rewrite the interface-adapter-layer and framework-services templates 1081 and 1101 and, therefore, without having to re-debug the layers that these templates define. This rearranging of pipelines 44 within the PLICs 60 is also called “refactoring” the circuit 200.
Moreover, the designer may decide to breakdown one or more of the pipelines 441-447 into multiple, less complex pipelines 44. For example, to equalize the latencies in the stage 206 of the circuit 200, the designer may decide to breakdown the x4 pipeline 443 into two x2 pipelines (not shown) and a multiplier pipeline 445. Or, the designer may decide to replace the sin(x) pipeline 441 with a combination of pipelines (not shown) that represents sin(x) in a series-expansion form (e.g. Taylor series, MacLaurin series).
Referring to
Referring to
Referring to
Following the same steps described above in conjunction with the formation of the circuit 200 of
Although the library 120 includes no communication-shell templates 112 for this combination of the hardwired pipelines 441-447, for simulation purposes the tool 152 derives the operational parameters and message specifications of the hardware-interface layer 62 from the description files 1281, 1301, 1321,1-1321,4, and 1321,7. Because the PLIC 60 incorporates the interface-adapter layer 70 and framework-services layer 72 defined by the templates 1081 and 1101, the tool 152 estimates the input and output operational parameters, e.g., input and output latencies, and the message specifications of the layers 70 and 72 directly from the description files 1281 and 1301. Then, referring to
Next, the generator 160 generates the file 162, which defines the circuit 200 of
In this embodiment, the designer wants to design a circuit to solve for y in the following equation:
y=√{square root over (ax4 cos(z)+bz3 sin(x))} (3)
The only differences between equation (3) and equation (2) is the presence of the constant coefficients a and b.
Referring to
Although such a modified circuit 200 is contemplated to accommodate the constant coefficients a and b, this circuit would require two additional pipeline units 50.
Referring to
Referring to
The generator 160 then generates the file 162 to include the entered values for the coefficients a and b. These values may contained within one or more XML tags or be present in some other form.
In another variation, the values of a and b may be provided to the configuration managers 88 (
Still referring to
Referring to
For example, one or more of the functions of the tool 152 may be performed by a functional block (e.g., front end 156, interpreter 158) other than the block to which the function is attributed in the above discussion.
Furthermore, the tool 152 may be described using more or fewer functional blocks. In addition, although the tool 152 is described as either fitting the eight instantiations of the hardwired pipelines 441-447 in eight PLICs 601-608 (
Moreover, although described as allowing a designer to define a circuit using conventional mathematical symbols, alternate embodiments of the front end 156 of the tool 152 may lack this ability, or may allow one to define a circuit using other formats or languages such as C++ or VHDL.
Furthermore, although the tool 152 is described as allowing one to design a circuit for instantiation on a PLIC, the tool 152 may also allow one to design a circuit for instantiation on an ASIC.
In addition, although the tool 152 is described as generating a file 162 that defines an algorithm-implementing circuit, such as the circuit 200 (
Moreover, the tool 152 may generate, and the library 120 (
Furthermore, the tool 152 may generate, and the library 120 (
In addition, the tool 152 may generate, and the library 120 (
Moreover, the library 120 (
Referring to
Consequently, a combination of summing and multiplying hardwired pipelines 44 interconnected to generate ax+bx2+cx3+ . . . +vxn can implement any function f(x) that one can expand into a MacLaurin series, where the only differences in this combination of pipelines from function to function are the values of the constant coefficients a, b, c, . . . , v. Therefore, if the tool 152 is programmed with, or otherwise has access to, the coefficients for a number of functions f(x), then the tool can implement any of these functions as a series expansion. Furthermore, because the accuracy of the implementation of a function f(x) is proportional to the number of expansion terms calculated and summed together, the tool 152 may set the number of expansion terms that the interconnected pipelines 44 generate based on the level of accuracy for f(x) that the circuit designer (not shown) enters into the tool. Alternatively, a designer may directly enter a function f(x) into the front end 156 (
F(x)=cos(x) is represented by the following MacLaurin series:
The circuit 240 includes a term-generating section 242 and a term-summing section 244. For clarity, only the parts of these sections that respectively generate and sum the first four power-of-x terms of the cos(x) series expansion are shown, it being understood that any remaining portions of these sections for respectively generating and summing the fifth and higher power-of-x terms are similar.
The term-generating section 242 includes a chain of multipliers 2461-246p (only multipliers 2461-2468 are shown) and delay blocks 2481-248q (only delay blocks 2481-2483 are shown) that generate the power-of-x terms of the cos(x) series expansion. The delay blocks 248 insure that the multipliers 246 only multiply powers of x from the same sample time.
The term-summing section 244 includes two summing paths: a path 250 for positive numbers, and a path 252 for negative numbers. The path 250 includes a chain of adders 2541-254r (only adders 2541-2542 are shown) and delay blocks 2561-2561 (only blocks 2561 and 2562 are shown). Similarly, the path 252 includes a chain of adders 2581-258t (only adder 2581 is shown) and delay blocks 2601-260u (only blocks 2601 and 2602 are shown). A final adder 262 sums the cumulative positive and negative sums from the paths 250 and 252 to provide the value for cos(x). Although the adder 262 is shown as summing the first five terms of the expansion (1 and the first four power-of-x terms), it is understood that the final adder 262 may be disposed further down the paths 250 and 252 if the circuit 240 generates additional terms of the cos(x) expansion. Where numbers being summed are floating-point numbers, exceptions, such as a mantissa-register underflow, may occur when a positive number is summed with a negative number that is almost equal to the positive number. But by providing separate summing paths 250 and 252 for positive and negative numbers, respectively, the circuit 240 limits the number of possible locations where such exceptions can occur to a single adder 262. Consequently, providing the separate paths 250 and 252 may significantly reduce the frequency of such floating-point exceptions, and thus may reduce the time that the peer-vector machine 10 (
Still referring to
At a start time, a value x1 is present at the input of the multiplier 2461, where the subscript “1” denotes the time or position of x1 relative to the other values of x.
In response to a first clock edge, a value x2 is present at the input of the multiplier 2461, and x12 is present at the output of this multiplier. For brevity, this example follows only the propagation of x1, it being understood that the propagation of x2 and subsequent values of x is similar but delayed relative to the propagation of x1. Furthermore, for clarity, x1 is hereinafter referred to “x” in this example.
In response to a second clock edge, −x2/2! is present at the output of the multiplier 2462, x4 is present at the output of the multiplier 2463, and x2 is available at the output of the block 2481.
In response to a third clock edge, “1” is present at the output of the block 2561, x4/4! is present at the output of the multiplier 2464, x6 is present at the output of the multiplier 2465, and x2 is available at the output of the block 2482.
In response to a fourth clock edge, −x6/6! is present at the output of the multiplier 2466, x8 is present at the output of the multiplier 2467, x2 is available at the output of the block 2483, and “1+x4/4!” is available at the output of the summer 2541.
In response to a fifth clock edge, x8/8! is present at the output of the multiplier 2468, “1+x4/4!” is available at the output of the block 2562, and “−x2/2!−x6/6!” is available at the output of the adder 2581.
In response to a sixth clock edge, “1+x4/4!+x8/8!” is available at the output of the adder 2542, and “−x2/2!−x6/6!” is available at the output of the block 2602.
And in response to a seventh clock edge, “cos(x)=1−x2/2!+x4/4!−x6/6!+x8/8!” (cos(x) approximated to the first four power-of-x terms of the MacLaurin series expansion) is available at the output of the adder 262. Therefore, in this example the latency of the circuit 240 (i.e., the number of clock cycles from when x is available at the inputs of the multiplier 2461 to when cos(x) is available at the output of the adder 262) is seven clock cycles. Furthermore, if the adder 262 summing a positive number and a negative floating-point number generates an exception, the exception manager 86 (
Alternatively, if the circuit 240 calculates one or more higher power-of-x terms, then the adder 262 is located after (to the right in
Still referring to
The circuit 270 includes a term-generating section 272 and a term-summing section 274. For clarity, only the parts of these sections that respectively generate and sum the first four power-of-x terms of the cos(x) series expansion are shown, it being understood that any remaining portions of these sections for respectively generating and summing the fifth and higher power-of-x terms are similar.
The term-generating section 272 includes a hierarchy of multipliers 2761-276p (only multipliers 2761-2768 are shown) and delay blocks 2781-278q (only delay blocks 2781-2782 are shown) that generate the power-of-x terms of the cos(x) series expansion. The delay blocks 278 insure that the multipliers 276 only multiply powers of x from the same sample time.
The term-summing section 274 includes two summing paths: a path 280 for positive numbers, and a path 282 for negative numbers. The path 280 includes a chain of adders 2841-284r (only adders 2841-2842 are shown) and delay blocks 2861-286s (only block 2861 is shown). Similarly, the path 282 includes a chain of adders 2881-288t (only adder 2881 is shown) and delay blocks 2901-290u (only block 2901 is shown). A final adder 292 sums the cumulative positive and negative sums from the paths 280 and 282 to provide the value for cos(x). Although the adder 292 is shown as summing the first five terms of the expansion (1 and the first four power-of-x terms), it is understood that the final adder 292 may be disposed further down the paths 280 and 282 if the circuit 270 generates additional terms of the cos(x) expansion.
Still referring to
At a start time, a value x is present at the input of the multiplier 2761.
In response to a first clock edge, x2 is present at the output of the multiplier 2761.
In response to a second clock edge, x4 is present at the output of the multiplier 2762, and x2 is available at the output of the block 2781.
In response to a third clock edge, “1” is present at the output of the block 2861, x4/4! is present at the output of the multiplier 2766, x6 is present at the output of the multiplier 2764, −x2/2! is available at the output of the multiplier 2765, and x8 is available at the output of the multiplier 2763,
In response to a fourth clock edge, −x6/6! is present at the output of the multiplier 2767, x8/8! is present at the output of the multiplier 2768, −x2/2! is available at the output of the block 2901, and “1+x4/4!” is available at the output of the summer 2841.
In response to a fifth clock edge, ¢1+x4/4!+x8/8!” is available at the output of the adder 2842, and “−x2/2!−x6/6!” is available at the output of the adder 2881.
And in response to a sixth clock edge, “cos(x)=1−x2/2!+x4/4!−x6/6!+x8/8!” (cos(x) approximated to the first four power-of-x terms of the MacLaurin series expansion) is available at the output of the adder 292. Therefore, in this example the latency of the circuit 270 is six clock cycles, which is one fewer clock cycle than the latency of the circuit 240 of
Alternatively, if the circuit 270 calculates one or more higher power-of-x terms, then the adder 292 is located after (to the right in
Still referring to
The term generator 300 includes a register 302 for storing x, a multiplier 304, a multiplexer 306, and term-storage registers 3081-308p (only registers 3081-3084 are shown). For clarity, only the parts of the generator 302 that generates the first four power-of-x terms of the cos(x) series expansion are shown, it being understood that any remaining portions of the generator for generating the fifth and higher power-of-x terms are similar.
Still referring to
At a start time, a value x is present at the input of the register 302.
In response to a first clock edge, the current value of x is loaded into, and thus is present at the output of, the register 302, and is present at the output of the multiplexer 306, which couples its input 312 to its output. The register 302 is then disabled. Alternatively, the register 302 is not disabled but the value of x at the input of this register does not change.
In response to a second clock edge, x2 is present at the output of the multiplier 304, and the multiplexer changes state and couples its input 314 to its output such that x2 is also present at the output of the multiplexer 306.
In response to a third clock edge, x2 is loaded into, and thus is available at the output of, the register 3101, and x3 is available at the output of the multiplier 304 and at the output of the multiplexer 306.
In response to a fourth clock edge, x4 is available at the output of the multiplier 304 and at the output of the multiplexer 306.
In response to a fifth clock edge, x4 is loaded into, and thus is available at the output of, the register 3102, and x5 is available at the output of the multiplier 304 and at the output of the multiplexer 306.
In response to a sixth clock edge, x6 is available at the output of the multiplier 304 and at the output of the multiplexer 306.
In response to a seventh clock edge, x6 is loaded into, and thus is available at the output of, the register 3103, and x7 is available at the output of the multiplier 304 and at the output of the multiplexer 306.
In response to an eighth clock edge, x8 is available at the output of the multiplier 304 and at the output of the multiplexer 306.
And in response to a ninth clock edge, x8 is loaded into, and thus is available at the output of, the register 3104, the next value of x is loaded into the register 302. But if the generator 300 generates powers of x higher than x8, the generator continues operating in the described manner before loading the next value of x into the register 302.
After the generator 300 generates all of the specified powers of the current value of x, the register 302, multiplier 304, multiplexer 306, and registers 310 repeat the above procedure for each subsequent value of x.
Alternative embodiments of the generator 300 are contemplated. For example, to generate the odd powers of x for a function other than cos(x), one can merely add additional registers 310 to store these values, because the multiplier 304 inherently generates these odd powers. Alternatively, the generator 300 may be modified to load x2 into the register 302 so that the multiplier 304 thereafter generates only even powers of x. Moreover, one or more of the registers 308 may be eliminated, and the multiplexer 306 may feed the respective powers of x directly to the term multipliers, e.g., the term multipliers 2462, 2464, 2466, 2468, . . . of
F(x)=ex is represented by the following MacLaurin series:
The circuit 320 includes a term-generating section 322 and a term-summing section 324, which includes positive- and negative-value summing paths 326 and 328. For clarity, only the parts of these sections that respectively generate and sum the first five power-of-x terms of the ex series expansion are shown, it being understood that any remaining portions of these sections for respectively generating and summing the sixth and higher power-of-x terms are similar.
The term-generating section 322 includes a chain of multipliers 3301-330p (only multipliers 3301-3308 are shown) and delay blocks 3321-332q (only delay blocks 3321-3324 are shown) that generate the power-of-x terms of the ex series expansion. The section 322 also includes, for each odd-power-of-x term (e.g., x, x3, x5, . . . ), a respective sign determiner 3341-334v (only determiners 3341-3343 are shown) that directs positive values of the odd-power-of-x term to the positive summing path 326 of the term-summing section 324, and that directs negative values of the odd-power-of-x term to the negative summing path 328.
The positive-value path 326 of the term-summing section 324 includes a chain of adders 3361-336r (only adders 3361-3365 are shown) and delay blocks 3381-338s (only blocks 3381-3383 are shown). Similarly, the negative-value path 328 includes a chain of adders 3401-340t (only adders 3401-3402 are shown) and delay blocks 3421-3421 (only blocks 3421-3422 are shown). A final adder 344 sums the cumulative positive and negative sums from the paths 326 and 328 to provide the value for ex. Although the final adder 344 is shown as summing the first six terms of the ex expansion (“1” and the first five power-of-x terms), it is understood that the final adder may be disposed further down the paths 326 and 328 if the circuit 320 generates additional terms of the expansion.
Still referring to
At a start time, a value x is present at both inputs of the multiplier 3301, at the input of the delay block 3321, and at the input of the sign determiner 3341.
In response to a first clock edge, x2 is available at the output of the multiplier 3301, x is available at the output of the delay block 3321, and “1” is available at the output of the delay block 3381. Furthermore, if x is positive, x and logic “0” are respectively available at the (+) and (−) outputs of the sign determiner 3341; conversely, if x is negative, logic “0” and x are respectively available at the (+) and (−) outputs of the determiner 3341.
In response to a second clock edge, x2/2! is available at the output of the multiplier 3302, x3 is present at the output of the multiplier 3303, and x is available at the output of the delay block 3322. Furthermore, if x is positive, “1+x” is available at the output of the adder 3361; conversely, if x is negative, “1+0=1” is present at the output of the adder 3361.
In response to a third clock edge, x3/3! is available at the output of the multiplier 3304, x4 is available at the output of the multiplier 3305, x is available at the output of the delay block 3323, and “1+x+x2/2!” (x positive) or “1+x2/2!” (x negative) is available at the output of the adder 3362.
In response to a fourth clock edge, x4/4! is present at the output of the multiplier 3306, x5 is present at the output of the multiplier 3307, x is available at the output of the block 3324, and “1+x+x2/2!” (x positive) or “1+x2/2!” (x negative) is available at the output of the delay block 3382. Furthermore, if x3/3!, and thus x, is positive, x3/3! and logic “0” are respectively present at the (+) and (−) outputs of the sign determiner 3342; conversely, if x3/3!, and thus x, is negative, logic “0” and x3/3! are respectively present at the (+) and (−) outputs of the determiner 3342. Moreover, if x is negative, then x is available at the output of the delay block 3421; conversely, if x is positive, then logic “0” is available at the output of the delay block 3421.
In response to a fifth clock edge, x5/5! is available at the output of the multiplier 3308, “1+x+x2/2!+x3/3!” (x positive) or “1+x2/2!” is available at the output of the adder 3363, x4/4! is available at the output of the delay block 3383, and “0” (x positive) or “−x−x3/3!” (x negative) is available at the output of the adder 3401.
In response to a sixth clock edge, if x5/5!, and thus x, is positive, x5/5! and logic “0” are respectively available at the (+) and (−) outputs of the sign determiner 3343; conversely, if x5/5!, and thus x, is negative, logic “0” and x5/5! are respectively available at the (+) and (−) outputs of the determiner 3343. Furthermore, “1+x+x2/2!+x3/3!+x4/4!” (x positive) or “1+x2/2⇄+x4/4!” (x negative) is available at the output of the multiplier 3364, and “0” (x positive) or “−x−x3/3!” (x negative) is available at the output of the delay block 3422.
In response to a seventh clock edge, “1+x+x2/2!+x3/3!+x4/4!+x5/5!” (x positive) or “1+x2/2!+x4/4!” (x negative) is available at the output of the adder 3365, and “0” (x positive) or “x−x3/3!−x5/4!” (x negative) is available at the output of the adder 34022.
And in response to an eighth clock edge, “ex=“1+x+x2/2!+x3/3!+x4/4!+x5/5!” (x positive) or “ex=1−x+x2/2!−x5/5!” (x negative) is available at the output of the adder 344.
Therefore, in this example, the latency of the circuit 320 is eight. Furthermore, if the adder 344, while summing a positive number and a negative floating-point number, generates an exception, the exception manager 86 (
Alternatively, if the circuit 320 calculates one or more power-of-x terms higher than the fifth power, then the adder 344 is located after (to the right in
Still referring to
The sign determiner 334, includes an input node 350, a (−) output node 352, a (+) output node 354, a register 356 that stores a logic “0”, and demultiplexers 358 and 360.
The demultiplexer 358 includes a control node 362 coupled to receive a sign bit of the value at the input node 350, a (−) input node 364 coupled to the input node 350, a (+) input node 366 coupled to the register 356, and an output node 368 coupled to the (−) output node 352.
Similarly, the demultiplexer 360 includes a control node 370 coupled to receive the sign bit of the value at the input node 350, a (−) input node 372 coupled to the register 356, a (+) input node 374 coupled to the input node 350, and an output node 376 coupled to the (+) output node 354.
Still referring to
In one operating mode, the sign determiner 3341 receives at its input node 350 a positive (+) value v, which, therefore, includes a positive sign bit. This sign bit is typically the most-significant bit of v, although the sign bit may be any other bit of v. In response to the positive sign bit, the demultiplexer 360 couples v (including the sign bit) from its (+) input node 374 to its output node 376, and thus to the (+) output node 354 of the sign determiner 3341. Furthermore, the demultiplexer 358 couples the logic “0” stored in the register 356 from the (+) input node 366 to the output node 368, and thus to the (−) output node 352 of the sign determiner 3341.
In the other operating mode, the sign determiner 3341 receives at its input node 350 a negative (−) value v, which, therefore, includes a negative sign bit. In response to the negative sign bit, the demultiplexer 358 couples v (including the sign bit) from its (−) input node 364 to its output node 368, and thus to the (−) output node 352 of the sign determiner 3341. Furthermore, the demultiplexer 360 couples the logic “0” stored in the register 356 from the (−) input node 372 to the output node 376, and thus to the (+) output node 354 of the sign determiner 3341.
Still referring to
Referring to
The preceding discussion is presented to enable a person skilled in the art to make and use the invention. Various modifications to the embodiments will be readily apparent to those skilled in the art, and the generic principles herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
Claims
1. A library, comprising:
- one or more circuit templates that each define a respective circuit operable to execute a respective algorithm; and
- an interface template that defines a hardware layer operable to interface one of the circuits to pins of a programmable logic circuit when the layer and the one circuit are instantiated on the programmable logic circuit.
2. The library of claim 1 wherein each circuit template includes extensible markup language that describes the respective algorithm.
3. The library of claim 1 wherein the interface template includes extensible markup language that describes the hardware layer.
4. The library of claim 1 wherein the programmable logic circuit comprises a field-programmable gate array.
5. The library of claim 1, further comprising a file that describes a platform with which the programmable logic circuit is compatible.
6. The library of claim 1 wherein the library comprises multiple circuit templates that define circuits that can be interconnected to for form a resulting circuit that can be instantiated one a programmable logic circuit to execute an algorithm.
Type: Application
Filed: Oct 3, 2005
Publication Date: Apr 20, 2006
Applicant:
Inventors: John Rapp (Manassas, VA), Scott Hellenbach (Amissville, VA), T. Kurian (Manassas, VA), D. Schooley (Manassas, VA)
Application Number: 11/243,506
International Classification: G06F 17/50 (20060101); H03K 19/00 (20060101);