PARALLEL ARITHMETIC DEVICE, DATA PROCESSING SYSTEM WITH PARALLEL ARITHMETIC DEVICE, AND DATA PROCESSING PROGRAM
A parallel arithmetic device including a plurality of data wirings disposed in a first direction and a second direction; a plurality of flag wirings corresponding to the data wirings; a plurality of wiring coupling switches disposed each being disposed at respective intersections of the data wirings; and a plurality of processor elements surrounded by the data wirings. A processor element from among the plurality of the processor elements is configured to: perform an arithmetic process on data supplied from a first processor element based on a first flag supplied from the first processor element, the data being supplied on data wiring and the first flag being supplied on flag wiring; output a computation result to a second processor element on data wiring; and output a second flag based on the computation result to the second processor on flag wiring.
Latest Renesas Electronics Corporation Patents:
- SEMICONDUCTOR DEVICE
- Electronic device and semiconductor device
- Semiconductor device including gate electrode for applying tensile stress to silicon substrate, and method of manufacturing the same
- DEVICE AND METHOD OF SECURE DECRYPTION BY VIRTUALIZATION AND TRANSLATION OF PHYSICAL ENCRYPTION KEYS
- Semiconductor device, communication system and packet transmission method
This Application is a Continuation Application of U.S. application Ser. No. 13/935,790 filed Jul. 5, 2013, which claims priority from Japanese Patent Application No. 2012-154903 filed on Jul. 10, 2012, the disclosures of which are incorporated herein by reference in their entirety.
BACKGROUNDThe present invention relates to a parallel arithmetic device, a data processing system with the parallel arithmetic device, and a computer-readable medium storing a data processing program.
A coarse-grained dynamically reconfigurable processor (parallel arithmetic device or array-type processor) forms a circuit by dynamically changing the processing of each of a plurality of processor elements and the relation of coupling therebetween in accordance with an externally input object code. Hence, the dynamically reconfigurable processor can reuse circuit resources to perform a complicated arithmetic process with a small-scale circuit. An exemplary configuration of the dynamically reconfigurable processor is disclosed, for instance, in Japanese Patents Nos. 3921367 and 3861898.
A single-instruction multiple-data (SIMD) processor, which performs a plurality of arithmetic processes in parallel upon receipt of a single-operation instruction, is disclosed in Japanese Patents Nos. 4292197 and 4699002 and in Japanese Unexamined Patent Application Publication (Translation of PCT Application) No. 2010-539582.
SUMMARYWhen processing a plurality of sets of data in a parallel manner, a related-art dynamically reconfigurable processor (parallel arithmetic device or array-type processor) has to not only issue an operation instruction about processing to each of a plurality of processor elements, which correspond to the sets of data, but also issue an operation instruction about the relation of coupling to each of a plurality of switch elements that determine the relation of coupling between the processor elements. This also holds true for the case where a circuit corresponding to an unrolled part of a loop description is dynamically configured. Hence, the related-art dynamically reconfigurable processor has to store in memory an enormous number of instructions about the processing of each of the processor elements and about the relation of coupling therebetween. Consequently, the related-art dynamically reconfigurable processor cannot efficiently use circuit resources.
Other problems and novel features will become apparent from the following description and from the accompanying drawings.
According to an aspect of the present invention, there is provided a parallel arithmetic device including a status management section, a plurality of processor elements, and a plurality of switch elements that determine the relation of coupling of each of the processor elements. The processor elements each include an instruction memory and a plurality of arithmetic units. The instruction memory memorizes a plurality of operation instructions corresponding respectively to a plurality of contexts so that an operation instruction corresponding to a context selected by the status management section is read out. The arithmetic units perform parallel arithmetic processes on a plurality of sets of input data in a manner compliant with the operation instruction read out from the instruction memory.
According to another aspect of the present invention, there is provided a computer-readable medium storing a data processing program for supplying circuit data to a parallel arithmetic device that includes a status management section, a plurality of processor elements, and a plurality of switch elements that determine the relation of coupling of each of the processor elements. The processor elements each include an instruction memory and a plurality of arithmetic units. The instruction memory memorizes a plurality of operation instructions corresponding to each of a plurality of contexts so that an operation instruction corresponding to a context selected by the status management section is read out. The arithmetic units perform parallel arithmetic processing on each of a plurality of sets of input data in a manner compliant with the operation instruction read out from the instruction memory. The data processing program causes a computer to perform a behavioral synthesis process and a layout process. The behavioral synthesis process is performed to generate a structural description by unrolling, for behavioral synthesis purposes, a loop description that is included in an operation description and with no data dependency between iterations. The layout process is performed to generate the circuit data by subjecting the structural description to logic synthesis and performing a place and route.
The above aspects of the present invention make it possible to provide a parallel arithmetic device capable of efficiently using circuit resources.
Embodiments of the present invention will be described in detail based on the following figures, in which:
Embodiments of the present invention will now be described with reference to the accompanying drawings. As the drawings are simplified, they should not be used to narrowly interpret the technical scope of each embodiment. Like elements are designated by like reference numerals and will not be redundantly described.
In the following description of the embodiments, if necessary for convenience sake, a description of the present invention will be given in a divided manner in plural sections or embodiments, but unless specifically stated, they are not unrelated to each other, but are in such a relation that one is a modification, an application, a detailed explanation, or a supplementary explanation of a part or the whole of the other. Also, in the embodiments described below, when referring to the number of elements (including the number of pieces, numerical values, amounts, ranges, and the like), the number of elements is not limited to a specific number unless specifically stated or except the case where the number is apparently limited to a specific number in principle. The number larger or smaller than the specified number is also applicable.
Further, in the embodiments described below, their components (including operating steps and the like) are not always indispensable unless specifically stated or except the case where the components are apparently indispensable in principle. Similarly, in the embodiments described below, when the shape of the components, the positional relationship therebetween, and the like are mentioned, the substantially approximate and similar shapes and the like are included therein unless specifically stated or except the case where it is conceivable that they are apparently excluded in principle. The same goes for the aforementioned number of elements (including the number of pieces, numerical values, amounts, ranges, and the like).
First EmbodimentThe array-type processor 20 shown in
An object code (circuit data) 15 is supplied from the outside to the interface section 201. The code memory 202 includes an information storage medium such as RAM to memorize the object code 15 supplied to the interface section 201.
The object code 15 includes a plurality of contexts (corresponding to a plurality of later-described data paths) and a state transition condition (corresponding to a later-described state transition machine). As each context, operation instructions for the processor elements 207 and for the switch elements 208 are set. As the state transition condition, an operation instruction for the status management section 203, which selects one of the contexts depending on the situation, is set.
The status management section 203 selects one of the contexts in accordance with the status of the state transition machine and outputs a plurality of instruction pointers (IPs) corresponding to the selected context to the processor elements 207.
The present embodiment will be described on the assumption that the processor element 207 includes eight arithmetic units 212 for performing an arithmetic process on 16-bit data and eight registers 213 for retaining 16-bit data. Accordingly, each data wiring is 128 bits wide (=16 bits wide×8 pieces) and each flag wiring is 8 bits wide (=1 bit wide×8 pieces).
The processor element 207 performs an arithmetic process on data that is supplied from another processor element 207 through data wiring, and outputs a computation result (data) to another processor element 207 through data wiring. Further, a flag is supplied to the processor element 207 from another processor element 207 through flag wiring, and the processor element 207 outputs a flag to another processor element 207 through flag wiring. For example, the processor element 207 determines in accordance with a flag supplied from another processor element 207 whether or not to start an arithmetic process, and outputs a flag corresponding to the result of the arithmetic process to another processor element 207.
The instruction memory 211 stores an operation instruction corresponding to each of the contexts. One of the operation instructions is read out from the instruction memory 211 in accordance with an instruction pointer (IP) from the status management section 203. The processor element 207 and the switch element 208 perform operations in compliance with the operation instruction read out from the instruction memory 211.
Each arithmetic unit 212 performs an arithmetic process on input data in compliance with the operation instruction read out from the instruction memory 211. In this instance, each of the eight arithmetic units 212 performs an arithmetic process in parallel on each of a plurality of sets of input data in compliance with one operation instruction (SIMD instruction) read out from the instruction memory 211.
Each register 213 temporarily stores, for example, the data input into a corresponding arithmetic unit 212, the result of computation performed by the corresponding arithmetic unit 212, and intermediate data derived from the arithmetic process performed by the corresponding arithmetic unit 212. The result of computation performed by each arithmetic unit 212 may be directly output to the outside of a processor unit while bypassing the register 213.
In compliance with an operation instruction read out from the instruction memory 211, the wiring coupling switches 214 to 216 couple a data wiring between the corresponding processor element 207 (a processor element 207 having the instruction memory 211 storing the operation instruction) and another processor element 207 (e.g., a neighboring processor element 207).
In compliance with an operation instruction read out from the instruction memory 211, the wiring coupling switches 216 to 218 couple a flag wiring between the corresponding processor element 207 (a processor element 207 having the instruction memory 211 storing the operation instruction) and another processor element 207 (e.g., a neighboring processor element 207).
The wiring coupling switches 214 to 216 couple wirings in compliance with an operation instruction read out from the instruction memory 211. The wiring coupling switch 216 is disposed at the intersection of a data wiring and a flag wiring.
As described above, each processor element 207 can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction (SIMD computation). In other words, unlike in the past, the array-type processor 20 according to the present embodiment can perform two or more arithmetic processes in parallel by using one processor element.
As such being the case, the array-type processor 20 according to the present embodiment permits the dynamic configuration of a circuit (parallel arithmetic processing circuit) for performing arithmetic processes in parallel upon receipt of a smaller number of instructions than in the past. This makes it possible to efficiently use circuit resources.
Obviously, each processor element 207 can individually perform an arithmetic process upon receipt of one operation instruction (scalar computation) in the same manner as in the past when one of the arithmetic units 212 is activated.
Processor elements developed in the past can perform only one arithmetic process upon receipt of one operation instruction. In other words, related-art array-type processors can perform a plurality of arithmetic processes in parallel only when they use a plurality of processor elements. Thus, the related-art array-type processors need to dynamically configure a parallel arithmetic processing circuit in response to many operation instructions. Therefore, the related-art array-type processors cannot efficiently use circuit resources.
Second EmbodimentA second embodiment of the present invention will now be described in relation to a data processing device 10 that generates the object code 15 to be supplied to the array-type processor 20.
The data processing device 10 shown in
As shown in a conceptual diagram of
The loop processing section 108 analyzes the syntax of the source code 11 and unrolls a predefined loop description of a plurality of loop descriptions included in the source code 11. In the present embodiment, the loop processing section 108 unrolls a user-specified predefined loop description. The loop processing section 108 may alternatively be configured so that a predefined loop description is automatically selected from the loop descriptions and unrolled.
For example, the loop processing section 108 unrolls a loop description with no data dependency between iterations. In other words, the loop processing section 108 unrolls a loop description with no data dependency between a plurality of loop processes. The unrolled part is synthesized by a behavioral synthesizer as a circuit to be eventually subjected to parallel arithmetic processing (parallel arithmetic processing circuit). This parallel arithmetic processing circuit is dynamically configured by combining one or more switch elements 208 with one or more processor elements 207 that can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction.
The DFG generation section 101 creates a DFG (data flow graph) in accordance with the result of analysis of the source code 11 and the result of processing by the loop processing section 108. The DFG includes nodes, which represent various processing functions such as a computational function, and branches, which represent the flows of data.
The scheduling section 102 performs scheduling in accordance with a synthesis constraint 12 and circuit information 13 to determine when to execute a plurality of nodes, and outputs the result of scheduling as a CDFG (control data flow graph). The allocation section 103 determines, in accordance with the synthesis constraint 12 and circuit information 13, a register and a memory unit that are to be used to temporarily store data represented by branches in the CDFG. The allocation section 103 also determines which arithmetic unit is to be used to perform computations represented by nodes in the CDFG.
The synthesis constraint 12 includes preselected information such as a circuit scale, a resource amount, a delay constraint (timing constraint; clock frequency), and a pipelining target. The circuit information 13 includes preselected information such as the scale and delay of each of later-described resources (arithmetic unit 212, register 213, memory unit 210, etc.) included in the array-type processor 20.
The scheduling section 102 and the allocation section 103 perform scheduling and allocation, respectively, by giving a dedicated synthesis constraint and circuit information to or performing a forwarding (bypassing) process (described later) on a loop description to be pipelined, which is included in the loop descriptions excluding the loop descriptions to be unrolled. In the present embodiment, the loop description to be pipelined is user-specified. Alternatively, the loop description to be pipelined may be specified automatically by the data processing device 10.
Loop description pipelining will now be briefly described with reference to
When, as shown in
When, as shown in
When, as shown in
When a loop description is pipelined as described above, the number of execution cycles is reduced. This provides an increased throughput (processing capacity).
Details of loop description pipelining are also disclosed in “High-level Synthesis Challenges for Mapping a Complete Program on a Dynamically Reconfigurable Processor” (Takao Toi, Noritsugu Nakamura, Yoshinosuke Kato, Toru Awashima, Kazutoshi Wakabayashi, IPSJ Transaction on System LSI Design Methodology, February 2010, vol. 3, pp. 91-104), published by the inventors of the present invention.
However, when a loop description is pipelined, a data hazard may occur. Therefore, it is necessary to avoid such a data hazard. The data hazard will now be described briefly with reference to
At first, the four stages A1 (read), B1 (read), C1 (write), D1 (read) of the first loop process are sequentially executed. Further, the four stages A2 (read), B2 (read), C2 (write), D2 (read) of the second loop process are sequentially executed with a delay of one step from the start of the first loop process. In this instance, a data read process at the stage A2 is performed earlier than a data write process at the stage C1. Therefore, it is probable that irrelevant data may be read out. This type of problem is referred to as a data hazard.
Such a data hazard can be avoided, for example, by performing a forwarding (bypassing) process during scheduling for behavioral synthesis. This ensures that the data read process at the stage A2 is not executed earlier than the data write process at the stage C1.
Returning to
The RTL description generation section 107 outputs the above-mentioned state transition machine and the data paths corresponding respectively to the states included in the state transition machine as the RTL description 14.
Subsequently, the object code generation section 109 reads the RTL description 14, generates a net list by performing, for example, technology mapping and a place and route, subjects the net list to binary conversion, and outputs the result of binary conversion as the object code 15.
As described above, the behavioral synthesis section 100 unrolls a loop description with no data dependency between iterations and performs behavioral synthesis to generate a parallel arithmetic processing circuit. The array-type processor 20 then dynamically configures the parallel arithmetic processing circuit by combining one or more switch elements 208 with one or more processor elements 207 that can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction. Therefore, the array-type processor 20 can dynamically configure the parallel arithmetic processing circuit upon receipt of a smaller number of operation instructions than in the past. This makes it possible to efficiently use circuit resources.
An operation of the behavioral synthesis section 100 in the data processing device 10 will now be described with reference to
First of all, the behavioral synthesis section 100 performs syntax analysis (step S101) upon receipt of the source code 11, and then optimizes an operation description language level (step S102).
In this instance, the behavioral synthesis section 100 selects a predefined loop description from a plurality of loop descriptions included in the source code 11 and unrolls the selected loop description (step S103).
For example, the behavioral synthesis section 100 unrolls a loop description with no data dependency between iterations. In other words, the behavioral synthesis section 100 unrolls a loop description with no data dependency between a plurality of loop processes. The unrolled part is synthesized by a behavioral synthesizer as a circuit to be eventually subjected to parallel arithmetic processing (parallel arithmetic processing circuit). This parallel arithmetic processing circuit is dynamically configured by combining one or more switch elements 208 with one or more processor elements 207 that can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction.
Subsequently, the behavioral synthesis section 100 assigns nodes, which represent various processing functions, and branches, which represent the flows of data (step S104), and prepares a DFG (step S105).
Next, the behavioral synthesis section 100 performs scheduling (step S106) and allocation (step S107) in accordance with the synthesis constraint 12 and with the circuit information 13.
Next, in accordance with the results of scheduling and allocation, the behavioral synthesis section 100 generates a state transition machine and a plurality of data paths corresponding respectively to the states included in the state transition machine (steps S108 and S109). Further, the behavioral synthesis section 100 achieves pipelining by collapsing a plurality of states included in the loop description to be pipelined (step S110). Subsequently, the behavioral synthesis section 100 optimizes an RT level and logic level with respect to the state transition machines and data paths (step S111), and outputs the result of optimization as the RTL description 14 (step S112).
As described above, the behavioral synthesis section 100 unrolls a loop description with no data dependency between iterations and performs behavioral synthesis to generate a parallel arithmetic processing circuit. Subsequently, the array-type processor 20 dynamically configures the parallel arithmetic processing circuit by combining one or more switch elements 208 with one or more processor elements 207 that can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction. Therefore, the array-type processor 20 can dynamically configure the parallel arithmetic processing circuit upon receipt of a smaller number of operation instructions than in the past. This makes it possible to efficiently use circuit resources.
(Exemplary Hardware Configuration of Data Processing Device 10)The behavioral synthesis section 100 according to the present embodiment and the data processing device 10 having the behavioral synthesis section 100 can be implemented, for instance, with a general-purpose computer system. A brief description is given below with reference to
The HDD 115 stores an OS (operating system) (not shown), operation description information 116, circuit information 117, and a data processing program 118. The operation description information 116 relates to a circuit operation and corresponds, for instance, to the source code (operation description) 11 shown in
The CPU 111 controls, for example, various processes performed in the computer 110 and access to the RAM 112, ROM 113, interface 114, and HDD 115. The computer 110 is configured so that the CPU 111 reads and executes the OS and the data processing program 118, which are stored on the HDD 115. This enables the computer 110 to implement the behavioral synthesis section 100 according to the present embodiment and the data processing device 10 having the behavioral synthesis section 100.
(Data Processing System 1)In the data processing system 1 shown in
A third embodiment of the present invention will now be described in relation to a concrete example of a loop description that is to be unrolled.
The behavioral synthesis section 100 not only pipelines the whole or part of a loop description with data dependency between iterations and performs behavioral synthesis, but also unrolls the whole or part of a loop description with no data dependency between iterations and performs behavioral synthesis.
In the example shown in
As a result, the array-type processor 20 can not only perform SIMD computations on a circuit corresponding to the outer loop for parallel processing purposes, but also pipeline a circuit corresponding to the inner loop for parallel processing purposes. Consequently, the array-type processor 20 can perform a wider range of parallel processing than a related-art SIMD processor.
As regards a multiple loop, the inner loop is suitable for being pipelined because it has data dependency between iterations, whereas the outer loop is suitable for being unrolled and processed independently because it has no data dependency between iterations. Concrete examples are given below.
FIRST CONCRETE EXAMPLEAs for JPEG and MPEG, the loop processes for the outer and inner loops are, for example, as indicated below.
- Outer loop: The loop process is performed on each macro-block of 8-row×8-column pixels or 16-row×16-column pixels.
- Inner loop: The loop process is a DCT conversion process that is performed on each macro-block of a plurality of pixels.
As for voice signal FFT conversion, the loop processes for the outer and inner loops are, for example, as indicated below.
- Outer loop: The loop process is performed on each block of 1024 point signals.
- Inner loop: The loop process is an FFT process that is performed on each block of 1024 point signals.
As for an FIR image filter, the loop processes for the outer and inner loops are, for example, as indicated below.
- Outer loop: The loop process is performed on each block that is obtained when an image frame or an image is divided into regions.
- Inner loop: The loop process is a filter process that is performed on each block of pixels.
When the moving average of a plurality of stock prices is to be calculated, the loop processes for the outer and inner loops are, for example, as indicated below.
- Outer loop: Stock name.
- Inner loop: The loop process is a process of calculating the moving average of each stock.
A fourth embodiment of the present invention will now be described in relation to a detailed method that the behavioral synthesis section 100 uses to automatically assign a SIMD instruction to an unrolled part of a loop described with a scalar variable (a detailed method of automatically rewriting the unrolled part with a vector variable). The fourth embodiment will be described on the assumption that the behavioral synthesis section 100 unrolls a part of the outer loop of a multiple loop shown in
For example, the behavioral synthesis section 100 divides the outer loop having four hundred iterations into a loop description (first loop description) A having eight iterations and a loop description (second loop description) B having fifty iterations in accordance with the number of arithmetic units (eight units) in each processor element 207.
Next, the behavioral synthesis section 100 unrolls the loop description A having eight loop descriptions as shown in
Next, as shown in
As described above, the behavioral synthesis section 100 unrolls a loop described with a scalar variable and then assigns a SIMD instruction to the unrolled part.
Exemplary methods that the behavioral synthesis section 100 uses to assign the SIMD instruction will now be described with reference to
Referring to the example shown in
The other operations of the behavioral synthesis section 100 shown in
The first SIMD instruction assignment method inhibits a scheduling process from becoming complicated and is therefore advantageous in that the total processing time for behavioral synthesis is reduced.
(Second SIMD Instruction Assignment Method)Referring to the example shown in
In the subsequent processes ranging from DFG generation (step S105) to pipelining (step S110), data is entirely processed as a scalar amount.
Subsequently, the behavioral synthesis section 100 assigns the above-mentioned SIMD instruction to the unrolled part of the loop description (step S1101), optimizes the RTL (step S111), and outputs the RTL description (step S112).
The second SIMD instruction assignment method is advantageous in that it is practically unnecessary to change an existing behavioral synthesis flow.
As described above, the behavioral synthesis section 100 according to the present embodiment unrolls a predefined loop description included in an operation description and automatically assigns a SIMD instruction for the array-type processor 20 to the unrolled part. This eliminates the necessity of learning a dedicated language including, for example, the SIMD instruction. Consequently, the length of design time can be reduced.
As described above, the array-type processor (parallel arithmetic device) 20 according to the foregoing embodiments includes a plurality of processor elements 207 that are capable of performing a plurality of arithmetic processes in parallel upon receipt of one operation instruction. Therefore, the array-type processor 20 according to the foregoing embodiments can dynamically configure a parallel arithmetic processing circuit that performs arithmetic processes in parallel upon receipt of a smaller number of operation instructions than in the past. This makes it possible to efficiently use circuit resources.
Further, the behavioral synthesis section (behavioral synthesizer) 100 according to the foregoing embodiments unrolls a loop description with no data dependency between iterations and performs behavioral synthesis to generate a parallel arithmetic processing circuit. The array-type processor 20 then dynamically configures the parallel arithmetic processing circuit by combining one or more switch elements 208 with one or more processor elements 207 that can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction. Consequently, the array-type processor 20 according to the foregoing embodiments can dynamically configure the parallel arithmetic processing circuit upon receipt of a smaller number of operation instructions than in the past. This makes it possible to efficiently use circuit resources.
Furthermore, the behavioral synthesis section (behavioral synthesizer) 100 according to the foregoing embodiments unrolls a loop description included in an operation description and automatically assigns a SIMD instruction to the unrolled part. This eliminates the necessity of learning a dedicated language including, for example, the SIMD instruction. Consequently, the length of design time can be reduced.
Moreover, the array-type processor 20 according to the foregoing embodiments can not only perform SIMD computations for parallel processing purposes, but also perform pipelining for parallel processing purposes. Consequently, the array-type processor 20 can perform a wider range of parallel processing than a related-art SIMD processor. As the amount of computation processible by a single operation instruction increases with an increase in the degree of parallelism, the performance per unit area increases. Further, as the same amount of computation can be provided at a lower frequency, the power consumption per unit performance is suppressed.
The foregoing embodiments have been described on the assumption that each processor element 207 includes eight arithmetic units capable of performing an arithmetic process on 16-bit data. However, the present invention is not limited to such a configuration. The employed configuration can be changed as needed so that each processor element 207 includes two or more arithmetic units. Further, the employed configuration can also be changed as needed so that each processor element 207 includes arithmetic units capable of performing an arithmetic process on data with a bit width other than 16 bits. In this instance, however, the bit widths of data wiring, flag wiring, and the like need to be changed as well.
The foregoing embodiments have been described on the assumption that the behavioral synthesis section 100 unrolls an outer loop of a multiple loop having an inner loop as well as the outer loop. However, the present invention is not limited to the use of such an unrolling scheme. The employed configuration can be changed as needed so that the behavioral synthesis section 100 unrolls a loop description with no data dependency between iterations.
It is useful in terms of functional safety to provide each processor element 207 with two or any more number of arithmetic units. In this case, the behavioral synthesis section 100 assigns identical copied scalar instructions to multiple arithmetic units in the respective processor elements 207 by copying scalar instructions during assignment of a SIMD instruction. Detecting whether the results of operations are matched between the multiple arithmetic units with the identical instructions allocated or providing an additional circuit for correction make it possible to handle the case of failure in part of the arithmetic units in the processor element. For the use together with the foregoing embodiments, for instance, copied scalar instructions are assigned to two arithmetic units in the above-described processor element provided with eight arithmetic units, and then the processor element can be used as SIMD computation enabling parallel execution of four instructions. For correction, three or more arithmetic units are made to execute identical instructions to enable majority calculation of correct values. Besides, the behavioral synthesis section 100 according to the foregoing embodiments and the data processing device having the behavioral synthesis section 100 can implement an arbitrary process by having the CPU (central processing unit) execute a computer program.
In the above example, the program can be stored on various types of non-transitory computer-readable media and supplied to the computer. The non-transitory computer-readable media include various types of tangible storage media. More specifically, the non-transitory computer-readable media include a magnetic recording medium (e.g., flexible disk, magnetic tape, or hard disk drive), a magneto-optical recording medium (e.g., magneto-optical disk), a CD-ROM (read-only memory), a CD-R, a CD-R/W, a DVD (digital versatile disc), a BD (Blu-ray (registered trademark) disc), a semiconductor memory (e.g., mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash ROM, or RAM (random-access memory)). The program may be supplied to the computer by using various types of transitory computer-readable media. The transitory computer-readable media include an electrical signal, an optical signal, and an electromagnetic wave. The transitory computer-readable media can supply the program to the computer through an electric wire, optical fiber, or other wired communication path or a wireless communication path.
(Differences from Related-Art Technologies)
The SIMD processor disclosed in Japanese Patents Nos. 4292197 and 4699002 and in Japanese Unexamined Patent Application Publication (Translation of PCT Application) No. 2010-539582 successively interprets operation instructions in the same manner as a CPU and the like and performs a plurality of arithmetic processes in parallel. In the SIMD processor, a conditional branch is implemented as a jump instruction.
Consequently, even when, for instance, a conditional branch dependent only on one of a plurality of sets of data to be processed in parallel occurs, the SIMD processor cannot execute the conditional branch with respect to only the one set of data. In other words, even when, for instance, a conditional branch dependent only on one of a plurality of sets of data to be processed in parallel occurs, the SIMD processor needs to execute the conditional branch with respect to all the sets of data. It means that the SIMD processor has to entirely execute the conditional branch with respect to all the sets of data to be processed in parallel. This results in an increase in the number of execution cycles.
Meanwhile, the array-type processor according to the foregoing embodiments includes a plurality of processor elements that can perform a plurality of arithmetic processes in parallel upon receipt of one operation instruction (SIMD computations). Therefore, the array-type processor according to the foregoing embodiments can dynamically configure a circuit for performing arithmetic processes in parallel upon receipt of a small number of operation instructions. In other words, the array-type processor according to the foregoing embodiments can perform SIMD computations as is the case with the SIMD processor.
Here, the conditional branch is not implemented as a jump instruction, but is synthesized as a data path containing a multiplexer during behavioral synthesis. Therefore, unlike the SIMD processor, the array-type processor according to the foregoing embodiments suppresses an increase in the number of execution cycles even when a plurality of conditional branches exist. In other words, the degradation in computational performance is suppressed.
Further, the related-art SIMD processor is often designed by using an operation description given in a dedicated language extended to handle a vector as a variable instead of using an operation description given, for instance, in C language.
Meanwhile, the behavioral synthesis section (behavioral synthesizer) according to the foregoing embodiments unrolls a predefined loop description included in an operation description given in C language or the like and automatically assigns a SIMD instruction to the unrolled part. This eliminates the necessity of learning a dedicated language, thereby making it possible to reduce the length of design time.
The dynamically reconfigurable processor disclosed in Japanese Patents Nos. 3921367 and 3861898 forms a circuit by dynamically changing the processing of each of a plurality of processor elements and the relation of coupling therebetween as described earlier. Hence, the dynamically reconfigurable processor can reuse circuit resources to perform a complicated arithmetic process with a small-scale circuit. Here, as shown in
However, as mentioned earlier, the related-art dynamically reconfigurable processor has to store in memory an extremely large number of instructions about the processing of each of a plurality of processor elements and the relation of coupling therebetween. Hence, the related-art dynamically reconfigurable processor cannot efficiently use circuit resources.
Meanwhile, the array-type processor according to the foregoing embodiments includes a plurality of processor elements that are capable of performing a plurality of arithmetic processes in parallel (SIMD computations) upon receipt of one operation instruction. Hence, the array-type processor according to the foregoing embodiments can dynamically configure a circuit for performing arithmetic processes in parallel upon receipt of a small number of instructions. Consequently, circuit resources can be efficiently used.
While the present invention contemplated by its inventors has been described in detail in terms of preferred embodiments, it is to be understood that the present invention is not limited to those preferred embodiments, but extends to various modifications that nevertheless fall within the spirit and scope of the appended claims.
Claims
1. A parallel arithmetic device comprising:
- a plurality of data wirings disposed in a first direction and a second direction;
- a plurality of flag wirings corresponding to the data wirings;
- a plurality of wiring coupling switches disposed each being disposed at respective intersections of the data wirings; and
- a plurality of processor elements surrounded by the data wirings;
- wherein a processor element from among the plurality of the processor elements is configured to: perform an arithmetic process on data supplied from a first processor element based on a first flag supplied from the first processor element, the data being supplied on data wiring and the first flag being supplied on flag wiring; output a computation result to a second processor element on data wiring; and output a second flag based on the computation result to the second processor on flag wiring.
2. The parallel arithmetic device according to claim 1,
- wherein each of the plurality of the processor elements comprises a instruction memory configured to store a plurality of operation instructions;
- wherein each of the plurality of the processor elements is configured to perform an operation based on a selected operation instruction among the plurality of operation instructions.
3. The parallel arithmetic device according to claim 2,
- wherein each of plurality of wiring coupling switches is configured to control data wiring and vertical wiring disposed in the first direction based on an operation instruction read from an instruction memory, and to couple the processor elements using data wiring.
4. The parallel arithmetic device according to claim 2,
- wherein each of plurality of the processor elements comprises a plurality of arithmetic units each configured to perform an arithmetic process based on operation instructions stored in that processor element;
- wherein each arithmetic element is configured to perform an arithmetic process on input data in accordance with an operation instruction read out from the instruction memory of that processor element.
5. The parallel arithmetic device according to claim 2,
- wherein each of plurality of the processor elements comprises a plurality of arithmetic units configured to perform an arithmetic process based on operation instructions stored in the processor element;
- wherein each arithmetic element is configured to perform an arithmetic process in parallel on each of a plurality of sets of input data in accordance with one operation instruction read out from the instruction memory of that processor element.
6. A parallel arithmetic device comprising:
- data wiring and flag wiring;
- wiring coupling switches disposed as respective intersections of the data wiring;
- a first processor element configured to output first data and a first flag;
- a second processor element configured to receive the first data supplied on data wiring connecting the first processor element to the second processor element, to receive the first flag supplied on flag wiring connecting the first processor element and the second processor element, to perform an arithmetic process on the first data based on the first flag, and to output a computation result and a second flag corresponding to the computation result; and
- a third processor element configured to receive the computation result supplied on data wiring connecting the second processor element to the third processor element, and to receive the second flag supplied on flag wiring connecting the second processor element to the third processor element.
7. The parallel arithmetic device according to claim 6, wherein the second processor element includes a first arithmetic unit configured to perform a first arithmetic process on the first data, and a second arithmetic unit configured to perform a second arithmetic process on the first data in parallel with the first arithmetic process.
8. The parallel arithmetic device according to claim 7, wherein the first arithmetic process is different from the second arithmetic process.
9. The parallel arithmetic device according to claim 8, wherein the second processor element further includes an instruction memory configured to store operation instructions, the first arithmetic unit and the second arithmetic unit being configured according to the operation instructions stored on the instruction memory.
10. The parallel arithmetic device according to claim 6, wherein the second processor element is configured to perform at least two arithmetic processes in parallel upon receipt of one operation instruction.
11. A data processing system comprising:
- a data processing device comprising: a behavioral synthesis section configured to generates a structural description by unrolling, for behavioral synthesis purposes, a loop description that is included in an operation description and with no data dependency between iterations; and a layout section configured to subject the structural description to logic synthesis and performs a place and route; and
- a parallel arithmetic device comprising: data wiring and flag wiring; wiring coupling switches disposed as respective intersections of the data wiring; a first processor element configured to output first data and a first flag; a second processor element configured to receive the first data supplied on data wiring connecting the first processor element to the second processor element, to receive the first flag supplied on flag wiring connecting the first processor element and the second processor element, to perform an arithmetic process on the first data based on the first flag, and to output a computation result and a second flag corresponding to the computation result; and a third processor element configured to receive the computation result supplied on data wiring connecting the second processor element to the third processor element, and to receive the second flag supplied on flag wiring connecting the second processor element to the third processor element,
- wherein the second processor element is dynamically configurable according to a state output by the data processing device.
12. The data processing system according to claim 11, wherein the second processor element is dynamically configured as a circuit corresponding to an unrolled part of the loop description by using arithmetic units included in the second processor element.
13. The data processing system according to claim 11, wherein the behavioral synthesis section divides the loop description with no data dependency between iterations into:
- a first loop description having a number of iterations according to a number of arithmetic units included in each of the first processor element, second processor element, and third processor element; and
- a second loop description adapted to perform a loop process on the first loop description, and unrolls the first loop description.
14. The data processing system according to claim 11, wherein the behavioral synthesis section performs behavioral synthesis by unrolling an outer loop of a multiple loop having an inner loop as well as the outer loop.
15. The data processing system according to claim 11, wherein the behavioral synthesis section replaces the unrolled part of the loop description with a vector variable and outputs the vector variable as the structural description.
Type: Application
Filed: Feb 12, 2016
Publication Date: Jun 9, 2016
Applicant: Renesas Electronics Corporation (Tokyo)
Inventors: Takao TOI (Tokyo), Taro FUJII (Tokyo), Yoshinosuke KATO (Tokyo), Toshiro KITAOKA (Tokyo)
Application Number: 15/042,527