Method of and apparatus and architecture for real time signal processing by switch-controlled programmable processor configuring and flexible pipeline and parallel processing
A new signal processor technique and apparatus combining microprocessor technology with switch fabric telecommunication technology to achieve a programmable processor architecture wherein the processor and the connections among its functional blocks are configured by software for each specific application by communication through a switch fabric in a dynamic, parallel and flexible fashion to achieve a reconfigurable pipeline, wherein the length of the pipeline stages and the order of the stages varies from time to time and from application to application, admirably handling the explosion of varieties of diverse signal processing needs in single devices such as handsets, set-top boxes and the like with unprecedented performance, cost and power savings, and with full application flexibility.
The invention of the present application relates generally to the field of real-time signal processing, being more particularly concerned with the problems of increasing signal processing demand being driven by the convergence of more and more varieties of different data communication features desired to be presented in a single device—such as a handset or set-top boxes or a single device package or the like—this application being a continuation of co-pending U.S. application Ser. No. 11/508,768 (Aug. 23, 2006) about to be abandoned in favor of the present application.
BACKGROUND OF INVENTIONPresent-day processors for digital signal processing (DSP) software algorithm computations in handsets, set-top boxes and other single device packages are struggling with the problem of accommodating the convergence of a wide variety of different real-time signal processing needs and control processing capabilities required to be handled in a single device. Such convergence of more and more features in a single device compounded with ever-evolving technology standards has led to exponentially increasing signal processing demand, creating new technology challenges. This is particularly true for mobile devices and for home network standards and different services, among other applications.
Existing technology choices for current System On Chip (SoC) design in emerging markets include the above mentioned DSP, General Purpose Processor (GPP) and Application Specific Integrated Circuit (ASIC) Block. Unfortunately, however, each of these falls short of fully solving the problem. While DSP is programmable for different applications and provides good real-time performance for DSP-centric algorithms, such as voice and data communications, DSP has limited control and general purpose processing capability. With GPP, again different applications are programmable, but with poor real-time performance and with the requirement for quite extensive control processing capability. As for the ASIC Block approach, while this may be optimized for specific application algorithms in terms of processing performance, this technique has very limited programmability and is usually not reusable for new applications, technologies and standards. To try to combine these three technological approaches, moreover, provides a trade-off near-impossibility (e.g. Qualcomm 3GMM Baseband—attempting to combine in a single SoC to meet the requirement with 2 DSPs+2 GPP's+13 ASIC accelerator blocks, for example). Such an approach, moreover, requires dedicated hardware for many possible features which hardware is not simultaneously exercised in such usage mode and still always takes up die area and consumes power.
The problems with current technologies as “solutions” reside in the fact that the systems become ever more complex, inflexible and costly, requiring more specialized cores that result in highly complex systems, with component and system scalability becoming an ever-pressing issue. New features, applications and standards, moreover, become harder to incorporate. More complex systems additionally mean longer development cycles and higher cost/performance ratios.
The present invention, indeed, as later fully explained, addresses the solution by providing a novel programmable core that can meet all the processing needs of the current device applications, which current processor architectures cannot accomplish, though the art is struggling with improvement proposals.
The advent of the pipeline processor, however, did significantly increase execution speed from CISC (Complicate Instruction Set Computer) to RISC (Reduced Instruction Set Computer). For an example of five instructions, CISC required 31 cycles to execute them in series; whereas the pipelined RISC provided a 350% improvement in throughput. Current deep-pipelined multi-issue DSP architecture followed with hardware added for pipelined implementation and functional units were created to increase parallelism of data flow with faster buses and increased clock rates. This has resulted, however, in increased complexity, larger die size and higher power consumption. But more importantly, as the emerging applications require more diverse signal processing algorithms, many are beyond that accommodated by conventional DSP technology—voice, audio, video image processing, data communication, etc. While the pipelined architecture improves the performance of a CPU, the pipeline solution loses its advantage when the order of calculations is different from the functional blocks aligned in a pipeline. In that case, calculation takes much longer. The pipeline solution is not always very efficient in operation, either. For instance, load and store instructions never use the stage for mathematical calculation. A specific pipeline, moreover, just cannot serve the needs of all algorithms—the exploding variety of real-time signal processing now desired in mobile and consumer devices, with current DSP and GPP techniques unable adequately to meet such emerging signal processing needs.
The present invention is believed to have provided a break-through solution through a programmable core and reconfigurable pipeline that admirably meets the processing needs of today's diverse applications through a novel combining of microprocessor-based technology developed for optimizing control programs based on fixed pipeline architectures, and switch fabric technology for the different field of telecommunication equipment, including internet routers/switches and embedded processors. The invention, indeed, combines the strengths of both CISC and RISC architectures, but surpasses the performance of current high-performance DSP cores, providing the programmability and flexibility of a general purpose processor and an architecture well-suited to a wide variety of processing needs, including communications algorithms, multimedia processing (audio, video, imaging), networking protocols, control functions and the like—in short, an application “agnostic” architecture for a “converged” world.
OBJECTS OF INVENTIONA primary object of the invention, accordingly, is to provide a new and improved method of and architecture apparatus or system for processing software computational instructions of a wide variety of different real-time signal processing applications, including for convergence in single devices, that shall not be subject to any of the above-described limitations and/or disadvantages of prior art approaches but that, to the contrary, provides for meeting the processing needs of today's devices and expanding applications.
A further object is to provide such an improvement through a novel combination of microprocessor-based technology, and switch fabric technology from the different field of switching telecommunications in which applicant has been consulting and inventing for several decades.
Still another object is to provide a novel combination of a programmable embedded processor with reconfigurable pipeline stages wherein the configuring of the processor functional components permits of a flexible and specific application-tailored pipeline as distinguished from prior fixed single pipeline data streams.
Another object is to provide such flexibility through a cross-connect switch fabric in a dynamic, parallel and flexible fashion wherein the switch is configured through each application set of instructions during operation and in real-time.
Still another object is to provide such a novel technique wherein, after application software instruction decoding, the length of the pipeline stages and the order of the stages varies from time to time and from application to application.
An additional object is to provide such a new programmable embedded processor and reconfigurable pipeline wherein the architecture is scalable and wherein the processor is configured for performing parallel processing utilizing fully the calculation capability of the internal processor functional components.
Another objective is to allow software programmers to created new user-defined assembly instructions which correspond to specific internal processor configurations that are tailored to the new-defined function.
Other and further objects will be pointed out hereinafter and are more fully delineated in the appended claims.
SUMMARYIn summary, however, from its novel methodology aspect, the invention embraces a method of processing computer software computational instructions fed to a processor, that comprises, compiling and analyzing inputted user software applications to determine the specific computational tasks that need to be performed for each software application; generating a set of instructions in real time for each application configuration of the processor and the connections among its functional components required for that specific application; connecting the processor through switching to a data pipeline of variably configurable length and order of its stages; and communicating the processor components configured for each specific application through the switching in a dynamic, parallel and flexible fashion, correspondingly to configure the appropriate length and order of the pipeline stages for each specific application.
For apparatus implementation for the practice of the invention, it contemplates a flexible data pipeline architecture for accommodating substantially all types of software computational instruction sets for varying applications having, in combination, a programmable processor with reconfigurable pipeline stages the order and lengths of which vary in response to varying application instruction sets that establish corresponding configurations of the processor and of the connections, amongst its functional components specifically to suit the application.
The novel processor architecture of the invention enables greater scalability and flexibility than the prior DSP and GPP techniques previously mentioned and, importantly, is application agnostic and requires shorter application development cycles, and lower cost/performance rates.
Preferred and best mode embodiments are hereinafter described in detail.
The invention will now be described with reference to the accompanying drawings,
Turning first to the basic and generic pipeline structure and methodology diagram of this invention shown in
While the processor P of the invention may have the same type of functional components as those used in current RISC processors, shown as mathematical execution units EX1-EXn (multipliers, adders, shifters or a pipelined multiplier, for example) and memory units such as data memory banks at MU1-MUn, these components in the programmable processor of the invention communicate with one another in a fundamentally different manner from the RISC processor. In today's fixed staged pipeline RISC processors, instructions are executed in fixed order. As a result, functional units in such a processor are not efficiently utilized and they become increasingly more complex and costly.
Instead of lining the similar functional units up into a pipeline, the invention uses the switch matrix 5 to provide the flexibility of connecting them to adapt or configure them for the specific tasks required of the particular software application instruction set. The cross-connect switch 5, moreover, connects execution units EX1-EXn on one side and memory blocks MU1-MUn on the other side, configuring them into different structures in accordance with the different software algorithms of the different applications, and at different times.
The programming of the program memory 1 is shown in
While in the RISC type operation, the instructions are executed with fixed pipeline stages all within the same fixed clock cycles, as before noted, the cross-connect switch 5 of the invention connects execution units EX and memory blocks MU configuring them, as before noted, into different structures for different algorithms at different times. The connections of the cross-connect switch 5 are determined during program compilation time. The compiler analyzes each task in a program. Based on the resources that are available at the time, it decides to configure the available resources for the current task or to hold off the instruction execution. Execution units EX in the diagram and memory banks are routed into a network allowing execution of multiple tasks in parallel. Or it configures all resources into one large pipeline. The functionality of each execution unit EX, however, may take multiple cycles to achieve. Each EX may have unique functionality and can be configured for rather complicated functions. Those are very difficult to be realized by fixed pipeline solutions. All this data from one EX to another is through the switch, instead of a prior art bus, then to the memory or somewhere else. It greatly reduces the bus bandwidth required.
As before explained, the symbol EX is a mathematical unit—a multiplier, an adder, a shifter, or a pipeline multiplier, etc. A multiplier can be constructed, for example, by many adders. For instance, a 16×16 multiplier can be constructed by 16 adders. Furthermore, those adders can be pipelined. That means one multiplier can be finished after a certain number of cycles. For example, A+B+C+D usually has to be executed in four cycles. If adders are used as part of a multiplier, four additions may be executed in one cycle. EX can thus be very flexible and constructed by operational code at runtime. The multiple memory banks MU connected to the switch 5 provide data for the parallel processing. Each EX may require its own data unlike general-purpose CPU design, which has only one memory and only one piece of data that may be fetched for each cycle. The EX unit, moreover, can be configured for an equation. With the approach of the invention, multiple EXs, not just one, can be handled as in a general-purpose processor or a DSP. This provides efficiency in solving more complicated problems.
The architecture of the invention may process multiple data at the same time and write them back as well, making it possible more efficiently to fully utilize the hardware inside a device, unlike conventional general-purpose processors and DSPs which can only use one resource at any time and leave the rest idle. The configurability during compile time not only minimizes the complexity of logic design but can also support more applications.
To address the need for ever-increasing computational power, today's pipeline DSP and general-purpose processor designs take two commonly used approaches—to increase clock rate and to integrate more and more accelerators. What is needed, however, is not a faster instruction decoder or faster instruction fetch. When more computing power is needed, the invention simply uses more execution units during the design phase and keeps the rest of the processor design unchanged. There will be very limited die increase. Consequently, there will be less power consumption and smaller die size compared to the current staged fixed pipeline approach. The architecture of the invention covers both general-purpose processor and DSP.
Returning to
If desired, the instruction decode 3 may configure multiple processor memory units with one instruction, as shown in
An example of the flexible reconfigurable pipeline and parallel processing capability of the invention with multiplier units MULT1 and 2 and units ALU1 and 2, (Arithmetic Logic Unit), and memory units MU1-4, is presented in
Further architectural advantages that the invention provides over the prior signal processing techniques also include the minimizing of memory and register access through storing intermediate data in an ALU, relieving the burden of interconnect buses, removing the bottleneck of parallel instruction execution by making true parallel processing possible, and allowing processor hardware resources to be more fully and efficiently utilized. For most signal processing algorithms, moreover, it reduces the total necessary cycles and is able to handle the same application at lower clock rates than current signal processing architectures. The invention lowers power consumption, requires fewer pipeline stages, less logic complexity and smaller die sizes.
Consider, for example, the case of the Discrete Cosine Transform Calculations commonly used in video compression. With a general-purpose five stage pipeline processor to calculate this equation, it takes 22 cycles. For a DSP using multiplier accumulation, it takes 10 clock cycles. The architecture of the present invention using one execution unit takes 6 clock cycles; 3 cycles with two execution units plus 1 more cycle for latency; 2 cycles with three execution units plus 2 cycles for latency; 1 cycle with 4 cycles of latency when the number of execution units is increased to 5.
As another example, a typical pipeline in a RISC system requires three instructions. If the equation needs to be executed for N times, the total cycles required by a typical RISC system, is 3*N, whereas, with the approach of the invention, it only requires N+1 cycles, saving a huge number of total cycles.
In summary, the invention provides a novel software solution that is capable of supporting multiple protocols and multiple functionalities in a single device. Configurable hardware structure makes it suitable for many applications. Multiple execution units and memory blocks are configured to make the equation under processing a configurable pipeline, which minimizes the before-mentioned memory register access rate and releases the burden of interconnecting buses. Multiple memory banks increase memory access bandwidth which enables true parallel processing and significantly reduces the total cycles required. More efficient utilization of hardware allows lower clock rate and results in lower power consumption. The lower clock rate also leads to less pipeline stages, thus reducing both die size and logical complexity.
Today's computers are built with a piece of fixed hardware (processor). What the invention suggests is a piece of dynamically configurable hardware, which is not only more efficient in utilizing hardware resources but also capable of handling multiple threads of applications in parallel.
This architecture, as earlier mentioned, relieves part of the burden on the internal buses of prior art signal processors by reducing the frequency of data access. In addition, since memory is divided into several banks and connected to the switch, it increases memory access bandwidth under certain conditions, making full utilization of multiple mathematical units possible. These features allow the device to be configured for performing true parallel processing, thus fully utilizing the calculation capability of the internal functional blocks.
Overall the new processor architecture using the switch technology to connect functional blocks inside a processor instead of putting them into fixed pipeline stages, is dynamically configurable because its internal structure can vary with time. It is readily scalable, as before stated, because the number of functional blocks is only determined by the target application. It is also programmable because it is a true processor and can be applied to many different applications.
Lastly, in order pictorially to demonstrate the universality of the present invention in its flexible adaptation of signal processing, and its achievement in overcoming prior art limitations, achieving, rather, unprecedented cost and power savings, reference is made to
Further modifications will also occur to those skilled in the art, and such are considered to fall within the spirit and scope of the invention as defined in the appended claims.
Claims
1. A flexible data pipeline architecture for accommodating substantially all types of software computational instruction sets for varying applications having, in combination, a programmable embedded processor with reconfigurable pipeline stages the order and lengths of which vary in response to varying application instruction sets that establish corresponding configurations of the processor and of the connections amongst its functional components specifically to suit the application.
2. The data pipeline architecture of claim 1 wherein the functional components communicate through a switch in a dynamic, parallel and flexible fashion.
3. The data pipeline architecture of claim 2 wherein the switch is configured through each set of instructions during operation in real-time.
4. The data pipeline architecture of claim 3 wherein the instruction sets are generated by a software compiler receiving the application software instructions and analyzing the same to determine which computational tasks need to be performed in each application and how to configure the processor and the connections amongst the functional components to accommodate the same.
5. The data pipeline architecture of claim 4 wherein, after instruction decoding, the length of pipeline stages and the order of the stages vary from time to time and from application to application.
6. The data pipeline architecture of claim 5 wherein the configuring of the functional components permits of a flexible structure as contrasted with fixed single pipeline data streams.
7. The data pipeline architecture of claim 4 where both simple and complicated software instruction sets of varying sizes are enabled efficiently to use the same pipeline concurrently.
8. The data pipeline architecture of claim 4 wherein the switch is of a cross-connect switch matrix type and the processor functional components include mathematical execution units of adders and multipliers, and memory units dynamically and parallelly interconnectable through the switch.
9. The data pipeline architecture of claim 8 wherein the architecture is scalable, with computation intensive applications requiring more mathematical execution units then less complicated applications.
10. The data pipeline architecture of claim 9 wherein the amount of mathematical and/or memory units is determined during design cycle for a particular application without requiring modification to the compiler and with little impact on development time.
11. The data pipeline architecture of claim 8 wherein the memory units are divided into several banks, connected to the switch, thereby increasing memory access bandwidth and making full utilization of multiple mathematical units possible.
12. The data pipeline architecture of claim 11 wherein the processor is configured for performing parallel processing, efficiently fully utilizing the calculation capability of the internal processor functional components.
13. The data pipeline architecture of claim 12 wherein the efficiency enables reduction in the total clock rate cycles required for each application.
14. The data pipeline architecture of claim 13 wherein the lower clock rate reduces power consumption and allows for more logic between two pipeline stages, leading to fewer pipeline stages.
15. The data pipeline architecture of claim 13 wherein the lower clock rate provides for more computing power, allowing the handling of more complicated calculations and applications.
16. The data pipeline architecture of claim 2 wherein the switch connects the processor functional components inside the processor as distinguished from putting them into fixed pipeline stages.
17. The data pipeline architecture of claim 2 wherein the processor is (1) dynamically configurable because its internal structure can vary with time, (2) scalable because the number of functional components is only determined by the specific application, and (3) is programmable because it is a true processor applicable to many different applications.
18. A flexible data pipeline structure for accommodating software computational instructions for varying program applications, having, in combination, a programmable embedded processor with reconfigurable pipeline stages the order and length of which varies in response to varying program application instructions; the processor including program memory for storing application instructions from a compiler; instruction fetch and decode units connected to the program memory; a switch matrix selectively interconnecting pluralities of mathematical execution units and memory units and controlled by a switch control unit fed by the instruction decode unit; the switch matrix providing full access switching with any allowable connections between two units, and with the switch matrix connecting to a DMA.
19. The data pipeline structure of claim 18 wherein the mathematical execution units are selected from the group consisting of integer multipliers, integer ALU, floating-point multipliers, and floating-point ALU.
20. The data pipeline structure of claim 18 wherein the memory units are one of data memory banks and L2 memory banks.
21. The data pipeline structure of claim 18 wherein the processor is provided with a C library including special computational functions, to be directly fed to the compiler and converting the program to the desired processor machine code instructions for setting the mathematical execution units operation, the switch control instructions for connecting the different execution units, and instructions for setting the parameter of the memory unit operations.
22. The data pipeline structure of claim 21 wherein the compiler exploits parallelism for each program based on its instruction sequence and task-required execution units, producing machine instructions in the appropriate time sequence to configure the execution units and memory units and DMA and the connections amongst them.
23. The data pipeline structure of claim 21 wherein special memory unit configuration instructions are provided for each memory unit providing the start address for memory access, auto memory address increment after each access, and memory access clock cycle frequency.
24. The data pipeline structure of claim 23 wherein one instruction fed through the switch matrix configures multiple memory units, providing address and mode information.
25. The data pipeline structure of claim 18 wherein the switch control unit is operated by switch control vector to set the connections from the output of one mathematic execution unit to the input of another, the connections from any memory unit to an execution unit input, the connection of the DMA to any memory unit, and the connections from the instruction decoder to any execution unit, memory unit and/or DMA.
26. The data pipeline structure of claim 1 provided in a single package adapted to accommodate the convergence of a variety of differing signal—processing application demands with parallelism.
27. A method of processing computer software computational instructions fed to a processor, that comprises, compiling and analyzing inputted user software applications to determine the specific computational tasks that need to be performed for each software application; generating a set of instructions in real time for each application configuration of the processor and the connections among its functional components required for that specific application; connecting the processor through switching to a data pipeline of variably configurable length and order of its stages; and communicating amongst the processor components configured for each application through the switching in a dynamic, parallel and flexible fashion, correspondingly to configure the appropriate length and order of the pipeline stages for each specific application.
28. The method of claim 27 wherein said functional components include pluralities of mathematical execution units and pluralities of memory units or banks.
29. The method of claim 28 wherein said switching is cross-connection switching between the execution units and the memory units to configure them into different structures for different application algorithms at different times and corresponding to the different specific software applications.
30. A method of signal processing combining microprocessor technology with switch fabric telecommunication technology to achieve a programmable processor architecture wherein the processor and the connections among its functional blocks are configured by software to each specific application by communication through a switch fabric in a dynamic, parallel and flexible fashion to achieve a reconfigurable pipeline wherein the length of the pipeline stages and the order of the stages are varied from time to time and from application to application, handling the adapting to the explosion of varieties of diverse signal processing needs in single devices such as handsets, set-top boxes and the like.
31. A method as claimed in claim 27 wherein new user-defined assembly instructions are created that are tailored to one or more of specific functions, computational equations, or tasks, and which correspond to specific sets of internal processor configurations, including the execution of unit configurations, switch control configurations, and memory unit configurations.
Type: Application
Filed: Oct 6, 2007
Publication Date: Dec 4, 2008
Patent Grant number: 8099583
Inventor: Xiaolin Wang (Concord, MA)
Application Number: 11/973,184
International Classification: G06F 9/30 (20060101);