Programmable processor architecture hirarchical compilation
One embodiment of the present includes a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits or greater in parallel, W being an integer value, at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value wherein and smaller than W. A scenario compiler is included in a hierarchical flow of compilation and used with other compilation and assembler blocks to generate binary code based on different types of codes to allow for efficient processing based on the sub-processors while maintaining low power consumption when the binary code is executed.
This application claims the benefit of U.S. Provisional Patent Application No. 60/598,417, entitled “Quasi-Adiabatic Programmable Processor Architecture” and filed on Aug. 2, 2004 and is a continuation-in-part of U.S. patent application Ser. No. 11/180,068, filed on Jul. 12, 2005 and entitled “PROGRAMMABLE PROCESSOR ARCHITECTURE”, the disclosures of both of which are incorporated herein by reference as though set forth in full.
BACKGROUND OF THE INVENTION1. Field of the Invention
This invention relates generally to the field of processors and more particularly, to processors having low power consumption, high performance, low die area, and flexibly and scalably employed in multimedia and communications applications.
2. Description of the Prior Art
With the advent of the popularity of consumer gadgets, such as cell or mobile phones, digital cameras, iPods and personal data assistances (PDAs), many new standards for communication with these gadgets have been adopted by the industry at wide. Some of these standards include H264, MPEG4, UWB, Bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and Security. However, an emerging problem is the use of different standards dictating communications of and between different gadgets requiring tremendous development effort. One of the reasons for the foregoing problem is that no processor or sub-processor, currently available in the marketplace, is easily programmable for use by all digital devices and conforming to the various mandated standards. It is only a matter of time before this problem grows as new trends in consumer electronics warrant even more standards adopted by the industry in the future.
One of the emerging and, if not, current, requirements of processors is low power consumption yet the ability to cause execution of code sufficient to process multiple applications. Current power consumption is on the order of sub-hundreds of milliwatts per application, whereas, the goal is to be under sub-hundreds of milliwatts for executing multiple applications. Another requirement of processors is low cost. Due to the wide utilization of processors in consumer products, the processor must be inexpensive to manufacture, otherwise, its use in most common consumer electronics is not pragmatic.
To provide specific examples for current processor problems, problems associated with RISCs, which are used in some consumer products, microprocessors, which are used in other consumer products, digital signal processors (DSPs), which are used in yet other consumer products and application specific integrated circuits (ASICs), which are used in still other consumer products, and some of the other well-know processors, each exhibiting a unique problem are briefly described below. These problems along with advantages of using each are outlined below in a “Cons” section discussing the disadvantages thereof and a “Pros” section discussing the benefits thereof.
A. RISC/Super Scalar Processors
RISC and Super Scalar processors have been the most widely accepted architectural solution for all general purpose computing. They are often enhanced with application specific accelerators for solving certain specialized problems within the context of a general solution.
Examples include: ARM series, ARC series, StrongARM series, and MIPS series.
Pros:
-
- Industry wide acceptance has lead to a more matured tool chain and wide software choices
- A robust programming model has resulted from a very efficient automatic code generator used to generate binaries from high level languages like C.
- Processors in the category are very good general purpose solutions.
- Moore's Law can be effectively used for increasing performance.
Cons:
-
- The general purpose nature of the architecture does not leverage common/specific characteristics of a set or sub-set of applications for better price, power and performance.
- They consume moderate to high amounts of power with respect to the amount of computation provided.
- Performance increase is mostly achieved at the expense of pipeline latency which adversely affects several multimedia and communication algorithms.
- Complicated hardware scheduler, sophisticated control mechanisms and significantly reduced restrictions for more efficient automatic code generation for general algorithms have made this category of solutions less area efficient.
B. Very Long Instruction Word (VLIW) and DSPs
VLIW architectures eliminated some of the inefficiencies found in RISC and Super Scalar architectures to create a fairly general solution in the digital signal processing space. Parallelism was significantly increased. The onus of scheduling was transferred from hardware to software to save area.
Examples include: TI 64xx, TI 55xx, StarCore SC140, ADI SHARC series.
Pros:
-
- Restricting the solution to the signal processing space improved 3P in comparison with RISC and Super Scalar architectures
- VLIW architectures provide higher level of parallelism relative to RISC and superscalar architectures.
- An efficient tool chain and industry wide acceptance was generated fairly rapidly.
- Automatic code generation and programmability are showing significant improvements as more processors designed for signal processing fall into this category.
Cons:
-
- Although problem solving capability is reduced to the digital signal processing space, it is too broad for a general solution like VLIW machine to have efficient 3P.
- Control is both expensive and power consuming especially for primitive control code in many multimedia and communication applications.
- Several power and area inefficient techniques were used to make automatic code generation easy. Strong reliance on these techniques by the software community is carrying forward this inefficiency from generation to generation.
- VLIW architectures are not well suited for processing serial code.
C. Reconfigurable Computing
Several efforts in industry and academia over the last 10 years were focused towards making a flexible solution with ASIC like price, power and performance characteristics. Many have challenged existing and matured laws and design paradigms with little industry success. Most of the attempts have been in the direction of creating solutions based on coarser grain FPGA like architectures.
Pros:
-
- Some designs restricted to a specific application while providing needed flexibility within that application proved to be price, power, performance competitive
- Research showed that such restricted yet flexible solutions can be created to address many application hotspots.
Cons:
-
- Several designs in this space did not provide an efficient and easy programming solution and therefore was not widely accepted by a community adept in programming DSPs.
- Automatic code generation from higher level languages like C was either virtually impossible or highly inefficient for many of the designs.
- 3P advantage was lost when an attempt was made to combine heterogeneous applications using one type of interconnect and one level of granularity. Degree of utilization of the provided parallelism suffered heavily.
- Reconfiguration overhead was significant in 3P for most designs.
- In many cases, the external interface was complicated because the proprietary reconfigurable fabric did not match industry standard system design methodologies.
- Reconfigurable machines are uni-processors and rely heavily on a tightly integrated RISC even for processing primitive control.
D. Array of Processors
Some recent approaches are focused on making reconfigurable systems better suited to process heterogeneous applications. Solutions in this direction connect multiple processors optimized for either one or a set of applications to create a processor array fabric.
Pros:
-
- Different processors optimized for different sets of applications when connected together using an efficient fabric can help solve a wide range of problems.
- Uniform scaling model allows number processors to be connected together as performance requirements increase.
- Complex algorithms can be efficiently partitioned.
Cons:
-
- Although performance requirements may be adequately answered, power and price inefficiencies are too high.
- The programming model varies from processor to processor. This makes the job of the application developer much harder.
- Uniform scaling of multiple processors is a very expensive and power consuming resource. This has shown to display some non-determinism that may be detrimental to the performance of the entire system.
- The programming model at the system level suffers from complexity of communicating data, code and control information without any shared memory resources—since shared memory is not uniformly scalable.
- Extensive and repetitive glue logic required to connect different types of processors to a homogeneous network adds to the area inefficiencies, increases power and adds to the latency.
In light of the foregoing, there is a need for a low-power, inexpensive, efficient, high-performance, flexibly programmable, heterogenous processor for allowing execution of one or more multimedia applications simultaneously.
SUMMARY OF THE INVENTIONBriefly, one embodiment of the present includes a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits or greater in parallel, W being an integer value, at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value wherein and smaller than W. A scenario compiler is included in a hierarchical flow of compilation and used with other compilation and assembler blocks to generate binary code based on different types of codes to allow for efficient processing based on the sub-processors while maintaining low power consumption when the binary code is executed.
IN THE DRAWINGS
A sub-processor (“CoolProcessor) is provided employing logic “macro-functional-units” (function-oriented dedicated logic), replacing the classical fixed datapath, complex exectution unit, and register set used in general purpose CPUs and DSP engines and replacing also the identical processing element used in homogeneous multiprocessors (MSs).
As shown and described below with reference to
One embodiment of the present invention employs four sub-processors (referred to as “black boxes” or “processor” in the provisional application No. 60/598,417, entitled “Quasi-Adiabatic Programmable Processor Architecture”). In this patent document, a processor 22 comprises a plurality of sub-processors. The four sub-processors are split inot two categories. The letter “W” designates CoolW sub-processors, capable of handling operands requiring wide datapaths. The CoolW sub-processor, however, will support wider rage data bits. The sub-processor is also capable of executing 64-bit IEEE-standard floating-point instructions. Its performance is greater than 49 MFLOPS at 150 MHz. The floating-point instruction set includes addition, subtraction, and multiplication.
The letter “N” indicates a CoolN sub-processor serving narrow datapaths, such as required for average-quality imaging and finite-field operations in communications. Each sub-processor comprises a heterogeneous software programmable datapath connection compute engines (in the CoolW sub-processor type) or compute engines (in the CoolN sub-processor type). The internal compute engines are referred to as MFU. Multiple instances of the MFUs are nonuniformly distributed between the two types of sub-processors.
A control circuit within each sub-processor operates as an engine and is a high language programmable controller for the sub-processor. The control circuit is aided by a core sequencer underscoring the hard-wired nature of the MFUs: each unit is aimed at executing efficiently only a fraction of the overall job. A rather large instruction memory, per sub-processor, holds code for the control circuit, internal interconnects, I/O, and MFUs requiring it. Sufficient shared buffer memory is provided to store operands and results of complex computations that make average demands on operand life.
A general purpose processor (referred to as “ARM926” in the provisional application) runs system software and generic applications (applications other than multimedia and communications). The GPP includes its own instruction and data memory or cache.
The interconnect is based on the Sonics “smart” SoC bus. An SoC architecture can include any number of sub-processors but the number of sub-processors defines the number of threads, as will be apparent shortly.
Referring now to
Accordingly, the product 12 is a converging product in that it incorporates all of the applications that need to be executed by today's mobile phone device 14, digital camera device 16, digital recording or music device 18 and PDA device 20. The product 12 is capable of executing one or more of the functions of the devices 14-20 simultaneously yet utilizing less power.
The product 12 is typically battery-operated and therefore consumes little power even when executing multiple applications of the applications executed by the devices 14-20. It is also capable of execute code to effectuate operations in conformance with a multitude of applications including but not limited to: H264, MPEG4, UWB, Bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and Security.
The interface circuit 26 shown coupled to the bus 30 and interface circuit 28, shown coupled to the bus 31, include the blocks 40-66, which are generally known to those of ordinary skill in the art and used by current processors.
The processor 22, which is a heterogeneous multi-processor, is shown to include shared data memory 70, shared data memory 72, a CoolW sub-processor (or block) 74, a CoolW sub-processor (or block) 76, a CoolN sub-processor (or block) 78 and a CoolN sub-processor (or block) 80. Each of the blocks 74-80 has associated therewith an instruction memory, for example, the CoolW block 74 has associated therewith an instruction memory 82, the CoolW block 76 has associated therewith an instruction memory 84, CoolN block 78 has associated therewith an instruction memory 86 and the CoolN block 80 has associated therewith an instruction memory 88. Similarly, each of the blocks 74-80 has associated therewith a control block. The block 74 has associated therewith a control block 90, the block 76 has associated therewith a control block 92, the block 78 has associated therewith a control block 94 and the block 80 has associated therewith a control circuit 96. The block 74 and 76 are designed to generally operate efficiently for 16, 24, 32 and 64-bit operations or applications, whereas, the blocks 78 and 80 are designed to generally operate efficiently for 1, 4, or 8-bit operations or applications.
The blocks 74-80 are essentially sub-processors and the CoolW blocks 74 and 76 are wide (or W) type of blocks, whereas, the CoolN blocks 78 and 80 are narrow (or N) type of blocks. Wide and narrow refers to the relative number of parallel bits processed or routed within a sub-processor and that gives the heterogeneous characteristic of the processor 22. Furthermore, the circuit 24 is coupled directly to one of the sub-processors, i.e. one of the blocks 74-80 resulting in the lowest latency path through the sub-processor to which it is coupled. In
It should be noted that while four blocks 74-80 are shown, other number of blocks may be utilized, however, utilizing additional blocks clearly results in additional die space and higher manufacturing costs.
Complicated applications requiring great processing power are not scattered in the circuit 20, rather, they are grouped or confined to a particular sub-processor or block for processing, which substantially improves power consumption by eliminating or at least reducing wire (metal) or routing lengths thereby reducing wire capacitance. Additionally, utilization is increased and activity is reduced contributing to lower power consumption.
The circuit 20 is an example of silicon on chip (or SoC) offering Quasi-Adiabatic Programmable sub-Processors for multimedia and communications applications, two types of sub-processors are included, as previously indicated: W type and N type. W type or Wide type processor is designed for high Power, Price, Performance efficiency in applications requiring 16, 24, 32 and 64-bits of processing. N type or Narrow type processor is designed for high efficiency in applications requiring 8, 4 and 1-bit of processing. While these bit numbers are used in the embodiments of the present invention, by way of figures and description, other number of bits may be readily employed.
Different applications require different performance or processing capabilities and are thus, executed by a different type of block or sub-processor. Take for instance, applications that are typically executed by DSPs, they would be generally be processed by W type sub-processors, such as the blocks 74 or 76 of
Other commonly occurring DSP kernels can be executed by N type sub-processors, such as blocks 78 and 80 and include, but are not limited to, Variable Length Codec, Viterbi Codec, Turbo Codec, Cyclic Redundancy Check, Walsh Code Generator, Interleaver/De-Interleaver, LFSR, Scrambler, De-spreader, Convolution Encoder, Reed-Solomon Codec, Scrambling Code Generator, and Puncturing/De-puncturing.
Both W and N type sub-processors are capable of keeping net activity and the resulting energy per transition low while maintaining high performance with increased utilization in comparison with existing architectural approaches like RISC, Reconfigurable, Superscalar, VLIW and Multi-processor approaches. The sub-processor architecture of the processor 22 reduces die size resulting in an optimal processing solution and includes a novel architecture referred to as “Quasi-Adiabatic” or “COOL” architecture. Programmable processors in accordance therewith are referred to as Quasi-Adiabatic Programmable or COOL Processors.
Quasi-Adiabatic Programmable or COOL Processors optimize data path, control, memory and functional unit granularity to match a finite subset of applications, as described previously. The way in which this is accomplished will be clear relative to a discussion and presentation of figures relating to the different units or blocks or circuits and their inter-operations of the processor 22, as presented below.
“Quasi-Adiabatic Programmable” or Concurrent Applications of heterOgeneous intercOnnect and functionaL units (COOL) Processors. In term of thermodynamics, Adiabatic Processes do not waste heat and transfer all the used energy to performing useful work. Due to the non-adiabatic nature of existing standard processes, circuit design, and logic cell library design techniques, one can not ever make an Adiabatic Processors. However, among the possible different possible processor architecture some may be closer to Adiabatic. The various embodiments of the present invention show a class of processor architectures which are significantly closer to Adiabatic as compared to the architectures of prior art, while they are, nevertheless, programmable. They are referred to as “Quasi-Adiabatic Programmable Processors”.
The integrated circuit 20 allows as many applications as can be supported by the resources within the processor 22 to be executed together or concurrently and the number of such applications far exceeds that which is supported by current processors. Examples of applications that can be simultaneously or concurrently executed by the integrated circuit 20 include but are not limited to downloading an application from a wireless device while decoding a movie that has been received, thus, a movie can be downloaded and decoded simultaneously. Due to achieving simultaneous application execution on the integrated circuit 20, which has a small die size or silicon real estate as compared to the number of applications it supports, costs of manufacturing the integrated circuit are significantly lower than that which is required for multiple devices of
Each of the blocks 74-80 can execute only one sequence (or stream) of programs at a given time. A sequence of program is referred to a function associated with a particular application. For example, FFT is a type of sequence. However, different sequences may be dependent on one another. For example, an FFT program, once completed may store its results in the memory 70 and the next sequence, may then use the stored result. Different sequences sharing information in this manner or being dependent upon each other in this manner is referred to as “stream flow”.
In
The instruction memories 82, 84, 86 and 88 are used to store instructions for execution by the blocks 74-80, respectively.
Included within the software architecture 302, a hardware abstraction layer or low level drivers 306 and an operating systems driver 308 cause interfacing or communication between the hardware components 304 and the software architecture 302. The software architecture 302 is further shown to include a CoolBios (basic input output system) 310 coupled to the hardware components 304 and to a scenario 312, which is for causing multiple applications 314 to be executed, each application 314 including kernels 316 for execution of computationally-intense functions, such as fast fourier transforms (FFTs), DCTs, Finite Impulse Response (FIR) filtering and others know in the industry. The software architecture 302 is further shown to include a system level software changes scenarios 318, which is shown to communicate with an operating systems interface (OSI) 322 and an operating system 320. The operating system 320 is further shown to communicate with the scenario 312, applications 314, and kernals 316. the kernels 316 are engines for execution of computationally intensive code, generally in assembly, or low level code.
Each of the applications 314 includes many kernels, such as the kernals 316 DCT, VLC, conditional encoding (CE), cyclic redundancy coding (CRC), down sampling (DS), variable length coding (VLC), discrete cosine transform (DCT), motion estimation (ME), motion compensation (MC) etc., that consume most of the compute time in an application. The scenario-level software 310 contains hooks to quasi-statically change the execution pattern of applications contained within that scenario. The scenario 318 causes scenarios to be changed while running on the hardware 304. From a software perspective, each of the kernels 316 is written in assembly code for executing an FFT or other computationally-intensive functions while the scenario 312 and each of the applications 314 are in a higher level language, such as “C” for reasons that will become apparent shortly. For now, suffice it to say that the combination of assembly and a higher level language being executed on a subprocessor CoolW or CooN and a control block included therein, as the hardware architecture of
The CoolBios 310 includes a set of software functions that allow input and output communication with the processor 22 and eliminates the need for a full operating system running on the processor 22.
The hardware component 304 and software architecture 302 provide an environment to load and execute a multi-application scenario. A “scenario”, as referred to herein, is a set of applications, such as the applications 314, executing concurrently. Some examples of each of the applications 314, as shown in
The software architecture 302 and the hardware components 304 of
The scenario 312 includes information, in its header, overhead information, to cause turning on or off each of the different applications 314. For example, the JPEG application can be turned off while the remaining applications, such as MP3, H.264 and 802.11g remain on. This effectively aids in reducing power consumption, as the need for power is reduced when an application that is not currently being used is turned off. Remaining processing power, i.e. that which is not currently being used, may be devoted to executing a new application with some limitations, as are now discussed.
Essentially, there are three modes of operation within the software architecture 302. One is real-time mode, an example of which is 802.11g, which has hardware time constraints. In this case, it is not feasible to add another application because a scenario 312 that includes an 802.11g application has compiled the latter and in the presence of a pre-complied application, a new application cannot be added. Generally, in the presence of applications having a timing constraint, a new application is not readily added or to dynamically change scenarios because it disturbs the processing balance, however, this is not an issue in mobile applications because scenarios are not readily changed in such applications.
The scenario 312 is pre-compiled and quasi-statically scheduled, which refers to turning applications on or off. The pre-compiled and scheduled scenario 312, which is in binary form is then stored in one of the sub-processors, such as the sub-processor 74. Turning off an application prevents “choking” of the system, that is, bandwidth is improved.
The system level software changes scenarios 318 causes changing of the scenario 312, which, as previously-stated, may be done dynamically. The code in the latter is in “C” or a high level code. The scenario 312 is written in scenario descriptive language (SDL), which is a unique and proprietary language with all rights reserved by 3Plus1 Technology, Inc. of Saratoga, Calif.
On the right-hand side of
The drivers 306 and 308 are used as tools for the general purpose processor (GPP) 32 on the highest level of the tool column 340 while, in the next level of the hierarchal tools, a scenario compiler 348 is used, by an application programmer, to allocate resources and executed on one or more particular sub-processors. The kernels 316 are then advantageously partitioned. An application is divided into smaller portions or threads, switching from one kernel to another.
The number of threads is limited to the number of sub-processors. The way in which applications are handed from one kernel to another is by the kernel 316 that is currently operating to finish a particular function, saving the result of the function in shared memory and signaling completion of its function and then another kernel 316 utilizing the stored information in shared memory to perform another function. A synchronization code is used for this hand-off, which is done by the scenario 312 and the particular tool is the scenario compiler 348 and is automated. Thus, synchronization and control code are generated automatically due to the presence of the thread.
In the next level of the tool hierarchy, as shown in the column 340, a controller/compiler 350 is used to compile a high level language being employed, such as “C”, which includes two parts, an optimizing assembler 352 and a low level assembler 354. The goal is to allow the programmer to write mostly C or high level code, rather than assembly, as the former is easier. This is easily allowed for given the sub-processor and hierarchal architecture of the present invention. The compiler 350 is optimized for each sub-processor, such as CoolW or CoolN. That is, high level code, written by a user or programmer is compiled, pursuant to certain rules, for storage and execution by a sub-processor and a control block located therein, as previously shown and discussed.
By changing scenarios, multiple applications can be performed, for example, a digital camera and a PDA can be performed in a single device simultaneously. The ability to do so results in foregoing the dynamic ability to change or add a scenario, as might be done in a personal computer, but this limitation is completely tolerable as a device that is to be used with a certain scenario need normally be quickly programmed to include another scenario in mobile handheld device applications.
By way of example, if a manufacturer introduces a product, such as a PDA, this is compiled along with other applications, such as a digital camera or MP3, etc., and a pre-compiled binary code is created using the hierarchical software tools compilation and the sub-processor-based hardware architecture of the present invention. Such a pre-compiled code and multiple applications make up a scenario, now, while another scenario may be pre-compiled, it is a rare occurrence due to the reluctance of the manufacturer to quickly introduce another product. Given time, another product is likely to be introduced warranting another scenario but the time to switch to another scenario is far from urgent.
In
The low level assembler 354 scheduling is done but also, all of the hardware components are available, whereas, the optimizing assembler 352 includes more restrictions because it operates at a higher level but is able to schedule more. Area and power is saved by less scheduling. The hierarchical flow of column 340 and the hardware architecture of the processor of
With continued reference to
Scenarios are compiled to run on a combination of multiple sub-processors that communicate through shared memory. The scenario compiler 348 is the tool that schedules the coarse grain data dependency graph wherein kernels and control code in one or more applications communicate with each other and with the controlling general-purpose processor. Dependencies are resolved to determine trigger conditions based upon which synchronization code is generated to evaluate these conditions at run-time. The compiler 350 targets the subset of a sub-processor or the control block located therein (such as the control block 90) that execute application control code and the scenario control and synchronization code.
The optimizing assembler 352 and the low-level assembler 354 target functions written in sub-processor assembly. They incorporate many scheduling techniques often found in higher level compilers such as register allocation and software pipelining. The binary software objects generated by these assemblers execute either on a CoolW or CoolN processor.
The scenario compiler 348 schedules the correct operation of the applications' functions and allocates data resources. The scenario compiler uses scheduling algorithms from the existing art to create the schedule. The scenario compiler 348 emits the source code (a compilable program, written in C) that implements the scenario scheduler. The scenario scheduler implements, in software, the schedule chosen by the scenario compiler. That is, it manages application resources (data—placed into shared and external memory—and functions) that are partitioned among the multiple processor cores contained within the target device. The scheduler ensures the correct sequencing and synchronization of functions and data that are in use within each of the multiple processors. The scenario compiler 348 also generates code to correctly access peripherals and DMA controllers as referenced by SDL-specific language features based on information about the target heterogeneous multi-processor(s) provided to the scenario compiler 348.
The SDL allows for a collection of functionality used in the present invention. The Scenario Description Language (SDL) is a language created for the purpose of creating high-level, abstract descriptions of scenarios and the applications contained within. SDL is compact, human-readable, and scalable. SDL provides language syntax and semantics to describe: the flow of data into and out of the sub-processors and between functions executing on the sub-processor; the amount of storage required to stream data through the applications executing on the sub-processor; the priority of each application to facilitate the creation of a functionally correct schedule that satisfies latency requirements; the amount of data (and its type) produced and consumed by each function; the maximum (worst case) execution time of each function, which is used in the creation of the schedule; and the placement of each function onto W- or N-type sub-processors.
Generally,
The block 416 allows the programmer to meet his/her design goals without having to optimize either the high level code or the assembly code. Having the SDL allows for allocating a function from one block to another block at a high level. The block 416 serves as a street map. The adjust partitioning and kernels of
The scenario description block 416 serves as input to the scenario compiler block 418, as does the block 422. The output of the block 418 serves as input to the block 420 and the block 408 serves as input to the block 416. The block 416 describes inter-dependencies between the kernels 316 and applications 314 of
The block 418, once provided with the foregoing top-down and bottom-up information, performs a best match process in the form of a schedule. The schedule, for example, provides information regarding the inter-dependencies of the sub-processors execution of which requires synchronization code for the control circuit of a sub-processor. The schedule information and synchronization information are provided by the block 420, which receives input from the block 418. The output of the block 420 is provided as input to the block 424. Having the block 420 receiving its input from the block 418 is generally not performed by prior art techniques due to their design/hardware limitations. That is, the hardware architecture, based on sub-processors, as shown in previous figures and the referenced patent document, allows for scheduling and synchronization after the block 418 performs its operation. This allows for the control circuit and each sub-processor to be the same as the other and for the code to be transportable.
The non-native compilation and simulation block 428 is for compiling in the absence of a processor, that is, during development, while the hardware is yet not ready, compilation is performed in “non-native” environment, whereas, the native simulation block 440 is in native environment. The block 428 allows for both assembly and high level code compilation while a native compiler or the actual compiler to be ultimately employed is not yet ready. Thus, an off-the-shelf compiler, i.e. non-native, may be employed and combined with assembly code for simulation. This is sub-processor specific. The kernels 316 and the time consumed for executing control code compete.
In
Optimization is done on a partition-basis. That is, high level code is optimized separately from assembly code and from SDL. For example, assembly code is optimized by the block 432, high level code is optimized by the block 410 and SDL is optimized by the block 416. This is a divide and conquer approach allowing advantageous optimization of each type of code that is not attainable without such a division of code. The block 424 receives high level code and compiles the same but output assembly code to the block 430, which is optimized by the block 432. The output of the block 432 is provided to the block 434 for creation of still further low level code and the output of the block 434 is provided to the block 436 for generation of binary object code to be used by a sub-processor. The assembly code that is written by the programmer is provided from the block 412 to the block 432 for assembling.
The block 434 performs various functions, shown in
At 506, rules are used to determine what the actual latencies are using a database of rules. At 510, this determination is made because previous instructions are not necessarily known. For example, the first instruction of a sub-routine is one where its previous instruction is not necessarily known. At 512, worst case possibilities are determined. At 514, latency rules are checked against the register value latencies. A latency is basically a delayed or previous instruction. That is, the programmer's annotation is compared to the rules for latency and if there is a mismatch, an error is reported at 516. An example of the programmer's annotation is discussed hereinbelow.
A computer processor implements strict read-after-write behavior for a register when an instruction that reads the register always obtains the value written by the most recent previously executed instruction that writes the register. To achieve strict read-after-write behavior for a register, it is sometimes necessary to delay the execution of an instruction that reads the register by one or more clock cycles from when it would otherwise execute. An instruction so delayed is said to be stalled for the one or more clock cycles of delay. The advantages of implementing strict read-after-write behavior for all registers are:
-
- (1) The same sequence of instructions can execute correctly on a wider range of processor implementations, and
- (2) Assembly language programming is made easier.
For some processors, particularly those designed for high-performance digital signal processing and related embedded systems, the complexity of the processor has been reduced by not implementing strict read-after-write behavior for all registers. For such a processor, when an instruction reads a register it may not obtain the value written by the most recently executed instruction to write the register, but instead an older value of the register. Although the specific behavior is always deterministic and can be documented as a set of latency rules, for some processors these rules are quite complex taken together. For processors of this kind, unfortunately, assembly language programmers have invariably been burdened with observing the latency rules entirely on their own without any automated verification from the programming tools that they are applying the rules correctly. When the latency rules are complex, programmers naturally make avoidable errors that may not be discovered until program testing.
Latency Verification:
In
For each register read by each instruction, a syntactic annotation is made in the program text to indicate which previous register value the programmer expects the instruction to obtain for the register. The lack of an annotation is either an error or indicates a default assumption. For example, the default assumption could be that the expected value obtained for a register is the value written by the most recent previously executed instruction that writes the register (i.e., the same as strict read-after-write behavior). Whenever the programmer expects a value different from the default assumption, an annotation is required. For example, if the programmer expects the value obtained for a register to be the value written by the n-th earlier instruction that writes the register (n>1), the annotation could be that a distinctive character be repeated n−1 times adjacent to the register denotation in the instruction. If, for example, this distinctive character is the dollar sign ($), then the assembly language instruction
-
- add r1, $$r2, $r3
would indicate that the programmer expects the value obtained for register r3 to be the value written by the second previous instruction to write r3, and the value obtained for register r2 to be the value written by the third previous instruction to write r2. In the above example, the current value of register r1 and two values ago of register r2 and the previous value of register r3 are being added. The assembler or block 434 checks to ensure that all of these values are available by performing the process ofFIG. 5 . It should be noted that the annotation need not be a dollar sign, rather, it can be any notation.
- add r1, $$r2, $r3
Given these annotations, for each instruction, the assembler or other programming tool automatically determines whether the programmer's expectations are correct, by examining the sequences of instructions that can execute previous to the given instruction along all paths leading to the given instruction, and applying the documented latency rules to these sequences.
Block 434 determines whether the latency annotations are correct for instruction n for this path, while block 520 performs the other usual functions of an assembler for instruction n. In block 508, the earlier instructions that contribute to the inputs of instruction n are identified. Block 506 determines, from the complete set of latency rules, those rules that are relevant to the interaction between each earlier instruction that contributes to the inputs of instruction n and instruction n itself. Where the instructions that may precede instruction n are unknown (for example, at the entrance to a subroutine), worst-case assumptions must be made (blocks 510 and 512). Finally, wherever the annotation made by the programmer does not match the behavior of the actual processor, as determined by the latency rules, an error is reported (blocks 514 and 516).
Although the present invention has been described in terms of specific embodiments, it is anticipated that alterations and modifications thereof will no doubt become apparent to those skilled in the art. It is therefore intended that the following claims be interpreted as covering all such alterations and modification as fall within the true spirit and scope of the invention.
Claims
1. A software architecture for execution on a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits, or more, in parallel, W being an integer value and having at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value and smaller than, the software architecture comprising:
- a scenario compiler for pre-compiling a scenario to create a binary code based on assembly code and high level language and scenario description language code, the scenario compiler including a plurality of applications, each application including one or more kernels, the scenario compiler pre-compiling the scenario for efficient execution thereof by a plurality of sub-processors, each sub-processor including a control circuit including high level code for execution thereof, the control circuit is a high language programmable controller for the sub-processor,
- wherein a hierarchical compilation of different types of programming codes allow for efficient binary code creating while reducing power consumption when the binary code is executed by the sub-processors.
2. A software architecture, as recited in claim 1, further including a schedule and synchronization block communicating with the scenario compiler and for generating code, based on scenario description language (SDL) to operate with one or more of the sub-processors.
3. A software architecture, as recited in claim 2, further including a high level language compiler block receiving input from the synchronization block for compiling high level code.
4. A software architecture, as recited in claim 3, further including an assembler block coupled to receive information from the high level language compiler block and from an assembly code block, which provides assembly code written by a user, the assembler block for assembling the assembly code and the information received from the high level language compiler block.
5. A software architecture, as recited in claim 4, further including a binary code block for generating binary code based on assembly code, high level code and SDL.
6. A software architecture, as recited in claim 5, further including a scenario description and optional optimization block coupled to the scenario description block and upon the generation of binary code, a user's design goals are verified and if the design goals are not met, the scenario description and optional optimization block modifies the scenario.
7. A software architecture, as recited in claim 6, wherein the sub-processors each include applications having kernels, the kernels being engines for execution of computationally intensive code.
8. A software architecture, as recited in claim 7, further including a scenario description block coupled to the scenario compiler block for generating SDL for describing inter-dependencies between the kernals.
9. A software architecture, as recited in claim 8, further including a low-level assembler and linker block coupled to the optimizing assembler block for assembling the lowest-level code.
10. A software architecture, as recited in claim 9, wherein the low-level assembler and linker block further includes a latency verification block responsive to an N number of previous instructions and a current instruction for verifying the presence of N number of previous instructions used by a user for instructions requiring previous instructions.
11. A software architecture, as recited in claim 10, wherein the latency verification block for verifying the user's instruction, which includes use of previous instructions, against latency rules.
12. A software architecture, as recited in claim 11, further including shared memory coupled to the sub-processors wherein the kernel of one of the sub-processors hands off to another sub-processor by placing, in the shared memory, information to be used by the another sub-processor.
13. A method of generating and executing code on a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits, or more, in parallel, W being an integer value and having at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value and smaller than, the software architecture comprising:
- pre-compiling a scenario to create a binary code based on assembly code and high level language and scenario description language code;
- generating efficient binary code to be executed by the sub-processors based on applications including kernels, the kernels for executing computationally intensive code, the execution of the binary code by the sub-processors causing reduction of power consumption and flexible coding options to a user.
14. A method of generating and executing code, as recited in claim 13, further including performing latency verification to prevent a user from using erroneous previous instructions.
Type: Application
Filed: Aug 2, 2005
Publication Date: Feb 2, 2006
Inventors: Amit Ramchandran (San Jose, CA), John Hauser (Berkeley, CA)
Application Number: 11/195,429
International Classification: G06F 9/45 (20060101);