APPARATUS AND METHOD FOR CONTROLLING EXECUTION OF A SINGLE THREAD BY MULTIPLE PROCESSORS
An apparatus includes a plurality of processors and a holder unit. The plurality of processors execute a task as a unit of processing by dividing the task into multiple threads including single and parallel threads, where the single thread is executed by only one of the plurality of processors whose respective pieces of processing have reached the thread, and the parallel thread is executed in parallel with another parallel thread by the plurality of processors. The holder unit is configured to held information to be shared by the plurality of processors. Each processor executes one of the multiple threads at a time, and causes the holder unit to hold reaching-state information indicating an extent to which the multiple threads executed by the plurality of processors have reached the single thread. Each processor determines whether to execute the single thread, based on the reaching-state information held in the holder unit.
Latest FUJITSU LIMITED Patents:
- MISMATCH ERROR CALIBRATION METHOD AND APPARATUS OF A TIME INTERLEAVING DIGITAL-TO-ANALOG CONVERTER
- SWITCHING POWER SUPPLY, AMPLIFICATION DEVICE, AND COMMUNICATION DEVICE
- IMAGE TRANSMISSION CONTROL DEVICE, METHOD, AND COMPUTER-READABLE RECORDING MEDIUM STORING PROGRAM
- OPTICAL NODE DEVICE, OPTICAL COMMUNICATION SYSTEM, AND WAVELENGTH CONVERSION CIRCUIT
- COMPUTER-READABLE RECORDING MEDIUM STORING INFORMATION PROCESSING PROGRAM, INFORMATION PROCESSING METHOD, AND INFORMATION PROCESSING APPARATUS
This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2014-165172, filed on Aug. 14, 2014, the entire contents of which are incorporated herein by reference.
FIELDThe embodiment discussed herein is related to apparatus and method for controlling execution of a single thread by multiple processors.
BACKGROUNDA parallel computer including multiple processors operable in parallel enhances processing efficiency by dividing a task as a unit of processing into multiple threads and then making the multiple processors to execute the threads. A processor device, such as a central processing unit (CPU) including multiple cores, is one of parallel computers.
For a parallel computer of this type, there has been proposed a technique in which a storage area is first allocated to a thread continuously activating from the start to the end of a program, and then variables used in the other treads are stored in the storage area (for example, see Japanese Laid-open Patent Publication No. 2002-99426). This technique ensures that even when another thread executed in parallel ends, a variable used in the other thread is held in the storage area without being lost during the execution of the program.
Another proposed technique is that, based on a value set to a flag allocated to a main memory, a thread waits for execution of synchronous processing until execution of an instruction code by another thread completes, and executes the synchronous processing after the execution of the instruction code is completed (for example, see Japanese Laid-open Patent Publication No. 2011-134145).
SUMMARYAccording to an aspect of the invention, an apparatus includes a plurality of processors and a holder unit. The plurality of processors execute a task as a unit of processing by dividing the task into multiple threads including a single thread and a parallel thread, where the single thread is a thread to be executed by only one of the plurality of processors whose respective pieces of processing have reached the thread, and the parallel thread is a thread to be executed in parallel with another parallel thread by the plurality of processors. The holder unit is configured to held information to be shared by the plurality of processors. Each of the plurality of processors executes one of the multiple threads at a time, and causes the holder unit to hold reaching-state information indicating an extent to which the multiple threads executed by the plurality of processors have reached the single thread. Each processor determines whether to execute the single thread, based on the reaching-state information held in the holder unit.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
A storage area used by multiple threads is allocated to an external storage device such as a main memory in order to enable access from multiple threads. For this reason, the number of cycles for access to the storage area is larger than the number of cycles for access to a register provided in a processor device, and thereby the access efficiency is low. As a result, processing efficiency during execution of multiple threads in parallel may be lowered.
Hereinafter, embodiments are described with reference to the accompanying drawings. A signal line for transmitting a signal is described by using a reference numeral same as a signal name.
The execution units 12, 22 may execute multiple threads in parallel, or a single thread alone. The holder unit 30 is shared by both of the processors 10, 20, and is configured to hold reaching-state information indicating an extent to which processing executed by the execution units 12, 22 has reached a single thread STH (STH0 or STH1). The single thread STH is a thread exclusively executed only by any one of the execution units 12, 22. For example, when the execution unit 12 of the processor 10 executes a single thread STH, the execution unit 22 of the other processor 20 skips the single thread STH without executing. In the example illustrated in
Each of the control units 14, 24 stores reaching-state information into the holder unit 30 when processing of execution units 12, 22 reach an entrance of the single thread STH. Each of the determination units 16, 26 determines based on the reaching-state information held by the holder unit 30 whether to cause execution units 12, 22 to execute the single thread.
Lower part of
Reference numerals T0, T1, T2, T3, T4, and T5 represent time, indicating that processing by the execution unit 22 is faster than processing by the execution unit 12.
At times T0, T1, processing of both the execution units 12, 22 has not reached an entrance of a single thread STH0. Therefore, the holder unit 30 holds reaching-state information indicating “no execution unit whose processing has reached the single thread STH0”.
At a time T2, processing executed by the execution unit 22 reaches the entrance of the single thread STH0, and the control unit 24 stores reaching-state information indicating “processing of the execution unit 22 has reached the single thread STH0” into the holder unit 30. Since the holder unit 30 is provided in the processor device, time for storing the reaching-state information is shorter than time for storing reaching-state information into an external storage device of the processor device. The determination unit 26 of the processor 20 including the execution unit 22 whose processing has reached the entrance of the single thread STH0 causes the execution unit 22, based on the reaching-state information for the single thread STH0 held by the holder unit 30, to execute the single thread STH0. Next, at a time T3, the execution unit 12 executes a thread PTH0, and the execution unit 22 executes a thread PTH1.
At a time T4, processing executed by the execution unit 22 reaches an entrance of a single thread STH1. However, reaching-state information held by the holder unit 30 indicates “processing of the execution unit 22 has reached the single thread STH0” (that is, processing of the execution unit 12 has not yet reached an entrance of the single thread STH0). Since reaching-state information which the holder unit 30 is able to hold is information corresponding to one single thread STH, the control unit 24 of the processor 20 does not store reaching-state information indicating “processing of the execution unit 22 has reached the single thread STH1” into the holder unit 30. Since reaching-state information for the single thread STH1 is not held by the holder unit 30, the determination unit 26 of the processor 20 determines to suspend execution of the single thread STH1 by the execution unit 22.
That is, the determination unit 26 detects that before processing executed by the execution unit 12 reaches the entrance of the single thread STH0, processing executed by the execution unit 22 has reached the entrance of the single thread STH1 executed after the single thread STH0. Then, when the holder unit 30 has no area to store reaching-state information corresponding to the single thread STH1, the execution unit 22 suspends execution of the single thread STH1.
Next, at a time T5, processing executed by the execution unit 12 reaches the entrance of the single thread STH0. The determination unit 16 of the processor 10 determines by referring to reaching-state information held by the holder unit 30 that entire processing of execution units 12, 22 has reaches the entrance of the single thread STH0. The determination unit 16 detects, based on reaching-state information held by the holder unit 30, that the execution unit 22 of the other processor 20 has executed the single thread STH0, and causes processing executed by the execution unit 12 to jump from the entrance of the single thread STH0 to the exit thereof. Thus, execution of the single thread STH0 by the execution unit 12 is skipped.
Time for referring to the reaching-state information is shorter than time for referring to reaching-state information held by an external storage device of the processor device. Then, the control unit 14 of the processor 10 initializes reaching-state information held by the holder unit 30 to “no execution unit whose processing has reached the single thread STH1”.
Then, referring to reaching-state information held by the holder unit 30, the control unit 24 of the processor 20 stores reaching-state information indicating “processing of execution unit 22 has reached the single thread STH1” into the holder unit 30, since reaching-state information for the single thread STH1 is held therein. Then, the determination unit 26 of the processor 20 causes the execution unit 22, based on the reaching-state information for the single thread STH1 held by the holder unit 30, to execute the single thread STH1.
In the embodiment illustrated in
When the holder unit 30 does not hold reaching-state information indicating “processing of the processor 20 has reached the single thread STH1”, the determination unit 26 of the processor 20 determines to hold execution of the single thread STH1 by the execution unit 22. When an area for storing new reaching-state information is not available in the holder unit 30, propriety of executing the single thread STH may be controlled according to a storage capacity of the holder unit 30 by suspending execution of the single thread STH1.
The core C0 includes an operation unit OPU, a data register unit DREG, an address register unit AREG, a program counter PC, an incrementer INC, an instruction register unit IREG, a decoder unit DEC, and selectors S1, S2. The operation unit OPU includes a register file REG, an arithmetic unit EX, and a flag registers SF, ZF. The operation unit OPU is an example of an execution unit for executing a thread.
The program counter PC outputs an address received from the selector S1 to the incrementer INC, and the selector S2. The incrementer INC increments an address received from the program counter PC, and outputs the incremented address to the selector S1.
The selector S1 selects an address from the incrementer INC when sequentially fetching instruction codes, and selects an address from the operation unit OPU when a branch instruction, a jump instruction, or the like is executed. The selector S1 outputs a selected address to the program counter PC. The selector S2 selects an address outputted from the program counter PC when fetching an instruction code, and selects an address outputted from the address register unit AREG when executing a load instruction or a store instruction. The selector S2 outputs the selected address to the cache memory CM via the address bus AD0.
When the core C0 fetches an instruction, an instruction code is read from the cache memory CM according to the address bus AD0, and a read instruction code is stored into the instruction register unit IREG via the data bus DIN. When the instruction code is not held in the cache memory, the cache memory CM outputs an address to the main memory MM via the address bus AD1, and receives the instruction code from the main memory MM via the data bus DT. For example, the address AD1 is a high-order address of the address AD0, and the instruction code (program) corresponding to one cache line of the cache memory CM is read from the main memory MM. Then, the cache memory CM holds the instruction code read from the main memory MM, and outputs the read target instruction code out of held instruction codes to the instruction register unit IREG via the data bus DIN.
When the core C0 executes a load instruction, data is read from the cache memory CM according to the address bus AD0, and a read data is stored into the register file REG via the data bus DIN. When target data of the load instruction is not held in the cache memory CM, the cache memory CM reads data corresponding to one cache line from the main memory MM in a manner similar to the reading of the instruction code. Then, the cache memory CM holds the data read from the main memory MM, and outputs a load target data out of the held data to the register file REG via the data bus DIN.
When the core C0 executes a store instruction, data outputted from the data register unit DREG to the data bus DOUT is written into the cache memory CM according to an address outputted to the address bus AD0.
The instruction register unit IREG has multiple areas for holding instruction codes received from the cache memory CM, and outputs the held instruction codes sequentially to the decoder unit DEC. The decoder unit DEC decodes the instruction codes received from the instruction register unit IREG, and, based on the decoding results, generates control signals for controlling operations of the operation unit OPU, selectors S1, S2, and so on.
The data register unit DREG includes multiple areas for holding data outputted from the operation unit OPU during execution of the store instruction. The address register unit AREG includes multiple areas for holding addresses outputted from the operation unit OPU during execution of the load instruction or store instruction.
The register file REG includes multiple registers for holding data read from the cache memory CM, or data outputted from the arithmetic unit EX. Based on a control signal from the decoder unit DEC, the register file REG outputs data held in at least one of the multiple registers of the register file REG to the arithmetic unit EX.
The arithmetic unit EX executes operation in accordance with an instruction code decoded by the decoder unit DEC, and outputs operation results to the register file REG, data register unit DREG, address register unit AREG, or selector S1. The arithmetic unit EX sets or resets flag registers SF, ZF based on the operation results, and refers to values of the flag registers SF, ZF when executing the logical operation instruction or branch instruction. The operation unit OPU may include a flag register other than flag registers SF, ZF.
The register unit REGU includes multiple registers REGi (i represents any one of 0, 1, 2, 3, and 4), and registers REGj. Here, I, the number of storage areas of the register REGi, is not limited to “5”, but may be any number greater than or equal to “1”. However, as illustrated in
In
Registers REGi, REGj are accessed when each of cores C0 to C3 executes the instruction code TEST&IDA (TEST & Increment, Decrement and Assignment) which will be illustrated in
The register REGj stores a total passing count j which represents the total number of single processing blocks SIBs through which all the threads THs have passed. The total passing count j is an example of total passing count information indicating the number of single processing blocks SIBs through which processing of all the cores C0 to C3 has passed. The register REGj is an example of total passing count area for holding the total passing count information. Usage of registers REGi, REGj are described with reference to
The instruction code TEST&IDA is processed when the microprogram is executed by the arithmetic unit EX, in a manner similar to the addition instruction, multiplication instruction, load instruction, and store instruction. Operation of the arithmetic unit EX executing the instruction code TEST&IDA may be implemented by a wired logic. However, by employing the microcode, the instruction code TEST&IDA may be added easily compared with the wired logic system, and a hardware function (architecture of instruction set) may be easily modified.
The cache memory CM operates as an instruction cache and a data cache. The cache memory CM may be provided for each of cores C, and may include a primary cache and a secondary cache. The main memory MM is a memory module, such as a synchronous dynamic random access memory (SDRAM) or a flash memory, and stores a program executed by the CPU and data handled by the CPU. The main memory MM includes a storage area for holding a core number n indicating the number of cores C, and a storage area for holding passing counts m (m0, m1, m2, m3) indicating the number of single processing blocks SIBs, illustrated in
The single processing block SIB is a processing block that is executed by one thread at a time. Except when there is no free space in the register REGi illustrated in
Upon reaching the entrance of the single processing block SIB, each thread TH executes the instruction code TEST&IDA. “n” and “m” of the instruction code TEST&IDA are operands (variables), respectively representing the core number n and the passing count m held in the main memory MM or the cache memory CM.
Based on values of flag registers SF, ZF that are set by execution of the instruction code TEST&IDA, each thread TH determines whether to execute the single processing block SIB or pass the single processing block SIB without executing the same. An example of determination processing executed by each thread TH is illustrated in
Upon reaching the entrance of the single processing block SIB, cores C execute, in the step S202, a load instruction to load the core number n and the passing count m from the main memory MM. When the cache memory CM holds the core number n and the passing count m, the core number n and the passing count m are read from the cache memory CM.
Next, in the step S100, cores C execute the instruction code TEST&IDA with the core number n and the passing count m loaded from the main memory MM as variables. An example of the processing executed by the instruction code TEST&IDA is illustrated in
Next, in the step S204, when the value of the flag register SF after execution of the instruction code TEST&IDA is “1”, cores C determines that there is an available register in the registers REGi, and causes the processing to shift to the step S208. When the value of the flag register SF after execution of the instruction code TEST&IDA is not “1” (that is, “0”), cores C determines that there is no available register in the registers REGi, and causes the processing to shift to the step S206.
In the step S206, cores C return the processing to the step S100 after waiting for a predetermined period of time. In the step S206, cores C may execute the other processing while waiting for the predetermined period of time.
In the step S208, when the value of the flag register ZF after execution of the instruction code TEST&IDA is “1”, the core C determines that the core C has first reached the entrance of the single processing block SIB, and causes the processing to shift to the step S210. When the value of the flag register ZF after execution of the instruction code TEST&IDA is not “1” (that is, “0”), the core C determines that the other thread has reached the entrance of the single processing block SIB earlier, and causes the processing to shift to the step S212.
In the step S210, processing of the core C jumps to the single processing block SIB, and the core C executes the single processing block SIB. In the step S212, processing of the core C jumps to the exit of the single processing block SIB, and the core C starts next processing without executing the single processing block SIB. That is, the core C determines not to execute the single processing block SIB, and jumps the processing to the exit of the single processing block SIB. This inhibits the single processing block SIB from being executed by multiple cores C, and also suppresses malfunction of the CPU. After execution of steps S210 and S212, the processing is shifted to the step S214.
In the step S214, the core C increments the passing count m loaded from the main memory MM in the step S202 by “1”. Next, in the step S216, the core C executes the store instruction to store the passing count m incremented in the step S214 into the main memory MM. When the cache memory CM holds a passing count m, the passing count m incremented in the step S214 is stored into the cache memory CM and thereafter stored into the main memory MM. Then, the processing executed by the core C ends.
In the step S102, when a difference between the passing count m and the total passing count j is smaller than I indicating the number of registers REGi (“5” in
In the step S104, the arithmetic unit EX sets the flag register SF at “1” to indicate that the processing has reached the entrance of the single processing block SIB, and then causes the processing to shift to the step S106. In the step S106, the arithmetic unit EX calculates a remainder i (“m % I”) by dividing the passing count m by I indicating the number of registers REGi, as the number i that is assigned to a register REGi to be used, and causes the processing to shift to the step S108.
In the step S108, when the unreached-thread count Xi stored in the register REGi, whose number i is obtained in the step S106, is “0”, the arithmetic unit EX determines that the processing has first reached the entrance of the single processing block SIB, and causes the processing to shift to the step S110. On the other hand, when the unreached-thread count Xi is not “0”, the arithmetic unit EX determines that processing of the other core C has reached the entrance of the single processing block SIB, and causes the processing to shift to the step S116.
In the step S110, the arithmetic unit EX stores a value obtained by subtracting “1” from the core number n (“4” in
When processing of the other core C has reached the entrance of the single processing block SIB, the arithmetic unit EX reduces the unreached-thread count Xi by “1” in the step S116, and causes the processing to shift to the step S118. In the step S118, the arithmetic unit EX resets the flag register ZF at “0” to indicate that the processing has failed to first reach the entrance of the single processing block SIB, and causes the processing to shift to the step S120.
In the step S120, when the unreached-thread count Xi is “0”, the arithmetic unit EX determines that the processing has last reached the entrance of the single processing block SIB, and causes the processing to shift to the step S122. When the unreached-thread count Xi is not “0”, the arithmetic unit EX determines that there is a Core C whose processing has not yet reached the entrance of the single processing block SIB, and ends the processing. In the step S122, since the processing of all cores C has reached the entrance of the single processing block SIB, the arithmetic unit EX increments the total passing count j by “1”, and ends the processing.
On the other hand, when there is no available register REGi, the arithmetic unit EX sets the flag register SF at “0” to artificially indicate that the processing has not reached the entrance of the single processing block SIB (although already having reached actually) in the step S114, and ends the processing.
Thus, the processing of steps S110 and S112 is performed by a core C whose processing has first reached the entrance of the single processing block SIB. The processing of steps S116 to S122 is performed by a core C whose processing has reached the entrance of the single processing block SIB secondly or later. Further, the processing of the step S122 is performed by a core C whose processing has last reached the entrance of the single processing block SIB. The step S114 is processing executed by a core C when there is no free space in the register REGi.
The mark “*” of flag registers SF, ZF represents “0” or “1”. A broken line pointed by arrow represents a single processing block (SIB0 to SIB6), and a section above or below the single processing block represents a parallel processing block PAB (PAB0 to PAB6). A number i of the register REGi illustrated along with single processing blocks SIB0 to SIB6 is calculated in the step S106 illustrated in
Processing executed by each of cores C proceeds from above downward in
First, at a time T0, each of cores C0 to C3 starts the parallel processing block PAB0. In the initialized state, registers REGi, REGj, and passing counts m0 to m3 are initialized.
At a time T10, the core C3 completes execution of the parallel processing block PAB0, first reaches the entrance of the single processing block SIB0, and executes the instruction code TEST&IDA ((a) in
Since the unreached-thread count X0 of the register REGi is initialized to “0” before the processing reaches the entrance of the single processing block SIB0, the processing of the core C3 is determined to have first reached the entrance of the single processing block SIB. Thus, in the step S110 illustrated in
After executing the instruction code TEST&IDA, the core C3 causes the processing, in the step S210 illustrated in
Next, at a time T20, the core C0 completes execution of the parallel processing block PAB0, reaches the entrance of the single processing block SIB0 in second place, and executes the instruction code TEST&IDA ((f) of
After executing the instruction code TEST&IDA, the core C0 causes the processing, in the step S212 illustrated in
Next, at a time T30, the core C2 completes execution of the parallel processing block PAB0, reaches the entrance of the single processing block SIB0 in the third place, and executes the instruction code TEST&IDA ((k) of
Next, at a time T40, the core C1 completes execution of the parallel processing block PAB0, reaches the entrance of the single processing block SIB0 in the last place, and executes the instruction code TEST&IDA ((p) of
Next, at a time T50 of
Since the unreached-thread count X1 of the register REGi is initialized to “0” before the processing reaches the entrance of the single processing block SIB1, the processing of the core C0 is determined to have reached the entrance of the single processing block SIB in the first place. Thus, similarly with the operation at the time T10, the core C0 sets the unreached-thread count X1 at “3” (core number−1), and sets the flag register ZF at “1” ((c) and (d) of
Next, at a time T60, the core C3 completes execution of the parallel processing block PAB1, reaches the entrance of the single processing block SIB1 ((f) of
Next, at a time T80, before the processing of the core C1 reaches the entrance of the single processing block SIB1, the processing of the core C2 reaches the entrance of the single processing block SIB2 ((k) of
Next, at a time T100 of
Next, at a time T110, before the processing of the core C1 reaches the entrance of the single processing block SIB1, the processing of the core C0 reaches the entrance of the single processing block SIB3 ((b) of
Next, at a time T120, the processing of the core C3 reaches the entrance of the single processing block SIB3 ((c) of
Next, at a time T140, before the processing of the core C1 reaches the entrance of the single processing block SIB1, the processing of the core C3 reaches the entrance of the single processing block SIB4 ((e) of
Next, at a time T150 of
Next, at a time T160, before the processing of the core C1 reaches the entrance of the single processing block SIB1, the processing of the core C2 reaches the entrance of the single processing block SIB5 ((b) of
Next, at a time T170, before the processing of the core C1 reaches the entrance of the single processing block SIB1, the processing of the core C2 reaches the entrance of the single processing block SIB6 ((d) of
In the step S204 of
Next, at a time T181, the core C1 completes execution of the parallel processing block PAB1, and reaches the entrance of the single processing block SIB1 ((f) of
Before the processing of the core C1 reaches the entrance of the single processing block SIB1, the unreached-thread count X1 is “1”. Thus, in steps S116 and S118 of
After executing the instruction code TEST&IDA, the core C1 causes the processing, in the step S212 illustrated in
Thereafter, at a time T182, the core C2 waits for a predetermined period of time, and then executes the instruction code TEST&IDA. Although times T181 and T182 are provided separately for the purpose of illustration, operations indicated at times T181 and T182 are executed consecutively.
Before reaching the time T182, the passing count m2 is “6”, the total passing count j is “2”, and “m2−j (=4)” is smaller than the number I (“5”) of the register REGi. Thus, the core C2 sets the flag register SF at “1” ((l) of
After executing the instruction code TEST&IDA, the core C2 causes the processing, in the step S210 illustrated in
Next, at a time T190 of
After executing the instruction code TEST&IDA, since the flag register SF is “1”, and the flag register ZF is “0”, the core C1 causes the processing to jump to the exit of the single processing block SIB2, increments the passing count m1, and stores the incremented passing count m1 into the main memory MM ((f) of
Next, at a time T200, the core C1 completes execution of the parallel processing block PAB3, and reaches the entrance of the single processing block SIB3 ((g) of
Operations at times T0 and T10 are the same as operations at times T0 and T10 illustrated in
Next, at a time T20, the processing of the core C0 reaches the entrance of the single processing block SIB, and at a time T30, the processing of the core C2 reaches the entrance of the single processing block SIB ((b) and (c) of
Next, at a time T40, the processing of the core C1 reaches the entrance of the single processing block SIB, and the total passing count j is set at “1” ((d) and (e) of
Next, at a time T50 of
Thereafter, at times T60, T70, and T80, processing of core C0, C3, and C2 reach the entrance of the single processing block SIB1 sequentially ((b), (c), and (d) of
A statement “#pragma omp parallel” indicates that blocks enclosed in “{ }” are executed in parallel with each other. A statement “#pragma omp single” indicates that a block enclosed in “{ }” is executed by a single thread. A clause “(nowait)” indicates that a thread, which has completed a single processing block SIB directed by the statement “#pragma omp single”, shifts to a next processing regardless of the status of other threads. Operations illustrated in
On the other hand, an exit of a statement “#pragma omp single” modified by excluding a clause “nowait” from a statement “#pragma omp single (nowait)” includes an implicit barrier for waiting for a next processing until completion of processing of all threads. Thus, when the statement “#pragma omp single (nowait)” illustrated in
A statement “#pragma omp parallel for” indicates that “for sentence” in a next line is executed in parallel. In a program illustrated in
A statement “#pragma omp sections” indicates that blocks enclosed in “{ }” are executed in parallel by allocating a thread for each statement “#pragma omp sections”. An exit of a block specified by the statement “#pragma omp sections” includes an implicit barrier for waiting for a next processing until completion of processing of all the threads. Thus, when re-writing the statement “#pragma omp sections” to the statement “#pragma omp single (nowait)”, a statement “#pragma omp barrier” is added to the end of the block “for sentence”. The statement “#pragma omp barrier” is used for synchronization.
As described above, even in this embodiment, the CPU includes a register REGU which holds reaching-state information including the unreached-thread count Xi, and the total passing count j, similarly with the embodiment illustrated in
Further, a core C, whose processing has reached the entrance of the single processing block SIB in the last place, initializes areas corresponding to a register REGi to a state ready for holding a new unreached-thread count Xi. This enables execution of a single processing block SIB whose execution has been suspended, and also enables control of the propriety of executing a new single processing block SIB by using the initialized area. That is, the propriety of executing the single processing block SIB may be controlled by cyclically using multiple areas for storing the unreached-thread count Xi in the register REGi.
The passing count m managed for each of cores C0 to C3 is incremented for every passing of the single processing block SIB, and the total passing count j of cores C0 to C3 is incremented every time any processing of cores C0 to C3 reaches the entrance of each of single processing blocks SIB in the last place. Thus, execution or suspension of a single processing block SIB corresponding to the passing count m may be determined by comparing a difference between the passing count m and the total passing count j, and the number I indicating the number of areas for storing the unreached-thread count Xi in the register REGi, with each other.
A core C, whose processing has reached the entrance of the single processing block SIB in the second place or later, determines not to execute the single processing block SIB, and causes the processing to jump to the exit of the single processing block SIB to suppress execution of the single processing block SIB by multiple cores C.
Each of cores C, whose processing has reached the entrance of the single processing block SIB, fetches an instruction code TEST&IDA. The instruction code TEST&IDA is executed by a microprogram. Thus, a hardware function (architecture of instruction set) may be easily altered.
The CPU illustrated in
Processing executed by the CPU illustrated in
At a time T30, before the processing of cores C1 and C2 reaches the entrance of the single processing block SIB0, the core C3 completes the parallel processing block PAB1, and the processing thereof reaches the entrance of the single processing block SIB1 ((a) and (b) of
In the step S204 of
Next, at a time T40, the core C1 completes execution of the parallel processing block PAB0, the processing thereof reaches the entrance of the single processing block SIB0, and the unreached-thread count X0 of the register REGi is changed to “1” ((d) and (e) of
Next, at a time T51 of
The core C3, which has waited for the predetermined period of time in the step S206 illustrated in
Thereafter, at a time T52, since the unreached-thread count X0 is “0”, the core C3 determines that the storage area of the unreached-thread count X0 is empty, sets “3” at the unreached-thread count X0, and sets the flag register ZF at “1” ((e) and (f) of
Next, at a time T60, processing of the core C0 reaches the entrance of the single processing block SIB1, the unreached-thread count X0 is changed to “2”, and the flag register ZF is reset to “0” ((h) and (i) of
Next, at a time T70, before processing of cores C1 and C2 reaches the entrance of the single processing block SIB1, the core C0 completes the parallel processing block PAB2, and processing thereof reaches the entrance of the single processing block SIB2 ((k) and (l) of
Next, at a time T80, processing of the core C1 reaches the entrance of the single processing block SIB1, and processing of the core C3 reaches the entrance of the single processing block SIB2 ((m) and (n) of
Next, at a time T91 of
Next, at a time T92, the core C0 calculates a remainder i (=0) by dividing the passing count m0 (=2) by the number I (=1) indicating the number of the registers REGi, and determines to use a storage area of the unreached-thread count X0. Then, the core C0 sets “3” to the unreached-thread count X0, and sets the flag register ZF at “1” ((e) and (f) of
Thereafter, the core C3, which has executed the instruction code TEST&IDA when the passing count m3 is “2”, sets the flag register SF at “1” since “m3−j (=0)” is smaller than the number I (“1”) indicating the number of the registers REGi ((i) of
After executing the instruction code TEST&IDA, since the flag register SF is “1”, and the flag register ZF is “0”, the core C3 causes the processing to jump to the exit of the single processing block SIB1, increments the passing count m0, and stores the incremented passing count m0 into the main memory MM ((l) of
Then, at a time T100, cores C0, C3 execute the parallel processing block PAB3, and cores C1, C2 execute the parallel processing block PAB2.
In the embodiment illustrated in
The embodiments illustrated in
Each of cores C may execute multiple threads in parallel. In this case, when the core C0 illustrated in
Characteristics and advantages of the embodiments shall be apparent from the above detailed description. This intends that the appended claims cover the characteristics and advantages of the above embodiments within a scope not deviating from the spirit and the right thereof. Any modifications and variations may be readily conceivable to those of ordinary skill in the art. Therefore, it is not intended to limit the scope of embodiments having inventiveness to the foregoing, and appropriate modifications and equivalents included in the scope disclosed in the embodiments may be covered.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiment of the present invention has been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Claims
1. An apparatus comprising:
- a plurality of processors configured to execute a task as a unit of processing by dividing the task into multiple threads including a single thread and a parallel thread, the single thread being a thread to be executed by only one of the plurality of processors whose respective pieces of processing have reached the thread, the parallel thread being a thread to be executed in parallel with another parallel thread by the plurality of processors; and
- a holder unit configured to held information to be shared by the plurality of processors, wherein
- each of the plurality of processors is configured: to execute one of the multiple threads at a time; to cause the holder unit to hold reaching-state information indicating an extent to which the multiple threads executed by the plurality of processors have reached the single thread; and to determine whether to execute the single thread, based on the reaching-state information held in the holder unit.
2. The apparatus of claim 1, wherein
- the multiple threads includes multiple single threads to be sequentially executed, the multiple single threads including a first single thread, and a second single thread to be executed after the first single thread;
- when a first processor in the plurality of processors detects that processing executed by the first processor reaches the second single thread before processing executed by one of the plurality of processors other than the first processor reaches the first single thread, and when there is no area left in the holder unit for holding the reaching-state information for the second single thread, the first processor determines to suspend execution of the second single thread.
3. The apparatus of claim 1, wherein
- the multiple threads includes multiple single threads to be sequentially executed;
- the holder unit includes reached-processor count areas each holding, as the reaching-state information, a reached-processor count in association with each of the multiple single threads, the reached-processor count indicating a number of processors whose respective pieces of processing have reached the each single thread; and
- a first processor in the plurality of processors, whose processing has lastly reached the each single thread, initializes the reached-processor count area associated with the each single thread so as to enable the reached-processor count area to hold new reaching-state information.
4. The apparatus of claim 3, wherein
- the holder unit includes a total passing count area configured to hold, as the reaching-state information, a total passing count indicating a number of single threads that have been passed through by processing of all the plurality of processors; and
- when a difference between the total passing count and a passing count of a second processor in the plurality of processors is equal to or greater than a number of the reached-processor count areas, the second processor suspends execution of a first single thread associated with the second processor, the passing count indicating a number of single threads that have been passed through by processing of each of the plurality of processors.
5. The apparatus of claim 4, wherein
- a third processor that has lastly reached a single thread among the plurality of processors increments the total passing count.
6. The apparatus of claim 1, wherein
- a first processor that has determined not to execute the single thread causes processing thereof to jump to an exit of the single thread.
7. The apparatus of claim 1, wherein
- each of the plurality of processors includes:
- a decoding unit configured to decode an instruction code included in a program; and
- an operation unit configured to operate based on the instruction code decoded by the decoding unit, and
- operation of the each processor is implemented by the operation unit thereof that operates based on a first instruction code included in the program executed by the each processor, the first instruction code being fetched when processing of the each processor reaches the single thread.
8. A method of controlling an apparatus including a plurality of processors and a holder unit, the holder unit being configured to held information shared by the plurality of processors, the plurality of processors being configured to execute a task as a unit of processing by dividing the task into multiple threads including a single thread and a parallel thread, the single thread being a thread to be executed by only one of the plurality of processors whose respective pieces of processing have reached the thread, the parallel thread being a thread to be executed in parallel with another parallel thread by the plurality of processors, the method comprising:
- executing, by each of the plurality of processors, one of the multiple threads at a time;
- causing, by the each processor, the holder unit to hold reaching-state information indicating an extent to which the multiple threads executed by the plurality of processors have reached the single thread; and
- determining, by the each processor, whether to execute the single thread, based on the reaching-state information held in the holder unit.
Type: Application
Filed: Jun 17, 2015
Publication Date: Feb 18, 2016
Patent Grant number: 9569273
Applicant: FUJITSU LIMITED (Kawasaki)
Inventor: Yoshihisa NAKASHIMA (Kawasaki)
Application Number: 14/741,790