BALANCING WORKLOAD IN A MULTIPROCESSOR SYSTEM RESPONSIVE TO PROGRAMMABLE ADJUSTMENTS IN A SYNCRONIZATION INSTRUCTION
In a multiprocessor system with threads running in parallel, workload balancing is facilitated by recognizing a plurality of levels of sub-tasks of a memory synchronization instruction and selectively choosing for at least one thread to do less than all of levels of these sub-tasks in response to the memory synchronization instruction. Which thread waits to synchronize can be impacted by this choice. The programmer can cause a thread expected to be a bottleneck to wait less than other threads. Where one thread is a producer and another thread is a consumer, types of memory synchronization can be adapted to these roles.
Latest IBM Patents:
- EFFICIENT RANDOM MASKING OF VALUES WHILE MAINTAINING THEIR SIGN UNDER FULLY HOMOMORPHIC ENCRYPTION (FHE)
- MONITORING TRANSFORMER CONDITIONS IN A POWER DISTRIBUTION SYSTEM
- FUSED MULTIPLY-ADD LOGIC TO PROCESS INPUT OPERANDS INCLUDING FLOATING-POINT VALUES AND INTEGER VALUES
- Thermally activated retractable EMC protection
- Natural language to structured query generation via paraphrasing
The present application claims priority of U.S. provisional application Ser. No. 61/293,266 filed Jan. 8, 2010, which is also incorporated herein by reference. The present application is filed concurrently with “GENERATION-BASED MEMORY SYNCHRONIZATION IN A MULTIPROCESSOR SYSTEM WITH WEAKLY CONSISTENT MEMORY ACCESSES (24878)
Benefit is also claimed of the following, which are also incorporated by reference
-
- U.S. Patent Application Ser. Nos. 61/261,269, filed Nov. 13, 2009 for “LOCAL ROLLBACK FOR FAULT-TOLERANCE IN PARALLEL COMPUTING SYSTEMS”;
- 61/293,611, filed Jan. 8, 2010 for “A MULTI-PETASCALE HIGHLY EFFICIENT PARALLEL SUPERCOMPUTER”;
- 61/295,669, filed Jan. 15, 2010 for “SPECULATION AND TRANSACTION IN A SYSTEM SPECULATION AND TRANSACTION SUPPORT IN L2 L1 SUPPORT FOR SPECULATION/TRANSACTIONS IN A2 PHYSICAL ALIASING FOR THREAD LEVEL SPECULATION MULTIFUNCTIONING L2 CACHE CACHING MOST RECENT DIRECTORY LOOK UP AND PARTIAL CACHE LINE SPECULATION SUPPORT”,
- U.S. patent application Ser. No. 12/684,367, filed Jan. 8, 2010, for “USING DMA FOR COPYING PERFORMANCE COUNTER DATA TO MEMORY”;
- U.S. patent application Ser. No. 12/684,172, filed Jan. 8, 2010 for “HARDWARE SUPPORT FOR COLLECTING PERFORMANCE COUNTERS DIRECTLY TO MEMORY”;
- U.S. patent application Ser. No. 12/684,190, filed Jan. 8, 2010 for “HARDWARE ENABLED PERFORMANCE COUNTERS WITH SUPPORT FOR OPERATING SYSTEM CONTEXT SWITCHING”;
- U.S. patent application Ser. No. 12/684,496, filed Jan. 8, 2010 for “HARDWARE SUPPORT FOR SOFTWARE CONTROLLED FAST RECONFIGURATION OF PERFORMANCE COUNTERS”;
- U.S. patent application Ser. No. 12/684,429, filed Jan. 8, 2010, for “HARDWARE SUPPORT FOR SOFTWARE CONTROLLED FAST MULTIPLEXING OF PERFORMANCE COUNTERS”;
- U.S. patent application Ser. No. 12/697,799 filed Feb. 1, 2010, for “CONDITIONAL LOAD AND STORE IN A SHARED CACHE”;
- U.S. patent application Ser. No. 12/684,738, filed Jan. 8, 2010, for “DISTRIBUTED PERFORMANCE COUNTERS”;
- U.S. patent application Ser. No. 12/684,860, filed Jan. 8, 2010, for “PAUSE PROCESSOR HARDWARE THREAD ON PIN”;
- U.S. patent application Ser. No. 12/684,174, filed Jan. 8, 2010, for “PRECAST THERMAL INTERFACE ADHESIVE FOR EASY AND REPEATED, SEPARATION AND REMATING”;
- U.S. patent application Ser. No. 12/684,184, filed Jan. 8, 2010, for “ZONE ROUTING IN A TORUS NETWORK”;
- U.S. patent application Ser. No. 12/684,852, filed Jan. 8, 2010, for “PROCESSOR RESUME UNIT”;
- U.S. patent application Ser. No. 12/684,642, filed Jan. 8, 2010, for “TLB EXCLUSION RANGE”;
- U.S. patent application Ser. No. 12/684,804, filed Jan. 8, 2010, for “DISTRIBUTED TRACE USING CENTRAL PERFORMANCE COUNTER MEMORY”;
- U.S. patent application Ser. No. 61/293,237, filed Jan. 8, 2010, for “ORDERING OF GUARDED AND UNGUARDED STORES FOR NO-SYNC I/O”;
- U.S. patent application Ser. No. 12/693,972, filed Jan. 26, 2010, for “DISTRIBUTED PARALLEL MESSAGING FOR MULTIPROCESSOR SYSTEMS”;
- U.S. patent application Ser. No. 12/688,747, filed Jan. 15, 2010, for “Support for non-locking parallel reception of packets belonging to the same reception FIFO”;
- U.S. patent application Ser. No. 12/688,773, filed Jan. 15, 2010, for “OPCODE COUNTING FOR PERFORMANCE MEASUREMENT”;
- U.S. patent application Ser. No. 12/684,776, filed Jan. 8, 2010, for “MULTI-INPUT AND BINARY REPRODUCIBLE, HIGH BANDWIDTH FLOATING POINT ADDER IN A COLLECTIVE NETWORK”;
- U.S. patent application Ser. No. 61/293,552, filed Jan. 8, 2010, for “LIST BASED PREFETCH”;
- U.S. patent application Ser. No. 12/684,693, filed Jan. 8, 2010, for “PROGRAMMABLE STREAM PREFETCH WITH RESOURCE OPTIMIZATION”;
- U.S. patent application Ser. No. 61/293,494, filed Jan. 8, 2010, for “NON-VOLATILE MEMORY FOR CHECKPOINT STORAGE”;
- U.S. patent application Ser. No. 61/293,476, filed Jan. 8, 2010, for “NETWORK SUPPORT FOR SYSTEM INITIATED CHECKPOINTS”;
- U.S. patent application Ser. No. 61/293,554, filed Jan. 8, 2010, for “TWO DIFFERENT PREFETCHING COMPLEMENTARY ENGINES OPERATING SIMULTANEOUSLY”;
- U.S. patent application Ser. No. 12/697,015 filed Jan. 29, 2010, for “DEADLOCK-FREE CLASS ROUTES FOR COLLECTIVE COMMUNICATIONS EMBEDDED IN A MULTI-DIMENSIONAL TORUS NETWORK”;
- U.S. patent application Ser. No. 61/293,559, filed Jan. 8, 2010, for “IMPROVING RELIABILITY AND PERFORMANCE OF A SYSTEM-ON-A-CHIP BY PREDICTIVE WEAR-OUT BASED ACTIVATION OF FUNCTIONAL COMPONENTS”;
- U.S. patent application Ser. No. 61/293,569, filed Jan. 8, 2010, for “IMPROVING THE EFFICIENCY OF STATIC CORE TURNOFF IN A SYSTEM-ON-A-CHIP WITH VARIATION”;
- U.S. patent application Ser. No. 12/697,043 filed Jan. 29, 2010 for “IMPLEMENTING ASYNCHRONOUS COLLECTIVE OPERATIONS IN A MULTI-NODE PROCESSING SYSTEM”;
- U.S. patent application Ser. No. 12/697,175 Jan. 29, 2010 for “I/O ROUTING IN A MULTIDIMENSIONAL TORUS NETWORK”;
- U.S. patent application Ser. No. 12/684,287, filed Jan. 8, 2010 for “ARBITRATION IN CROSSBAR INTERCONNECT FOR LOW LATENCY”;
- U.S. patent application Ser. No. 12/684,630, filed Jan. 8, 2010 for “EAGER PROTOCOL ON A CACHE PIPELINE DATAFLOW”;
- U.S. patent application Ser. No. 12/723,277 filed Mar. 12, 2010 for “EMBEDDING GLOBAL BARRIER AND COLLECTIVE IN A TORUS NETWORK”;
- U.S. patent application Ser. No. 61/293,499, filed Jan. 8, 2010 for “GLOBAL SYNCHRONIZATION OF PARALLEL PROCESSORS USING CLOCK PULSE WIDTH MODULATION”;
- U.S. patent application Ser. No. 12/696,817 filed Jan. 29, 2010 for “HEAP/STACK GUARD PAGES USING A WAKEUP UNIT”; and
- U.S. patent application Ser. No. 61/293,603, filed Jan. 8, 2010 for “MECHANISM OF SUPPORTING SUB-COMMUNICATOR COLLECTIVES WITH O(64) COUNTERS AS OPPOSED TO ONE COUNTER FOR EACH SUB-COMMUNICATOR”.
This invention was made with government support under Contract No. B554331 awarded by the Department of Energy. The Government has certain rights in this invention
BACKGROUNDThe invention relates to the field of synchronizing threads carrying out memory access requests in parallel in a multiprocessor system.
The PowerPC architecture is defined in IBM® Power ISA™ Version 2.06 Jan. 30, 2009, which is incorporated herein by reference. This document will be referred to as “PowerPC Architecture” or “PPC” herein. The PowerPC architecture defines three levels of synchronization:
-
- heavy-weight sync, also called hwsync, or msync,
- lwsync (lightweight sync) and
- eieio (also called mbar, memory barrier).
More about the msync instruction can be found in the article: Janice M. Stone, Robert P. Fitzgerald, “Storage in the PowerPC,” IEEE Micro, pp. 50-58, April, 1995
SUMMARYIt has been found that programmers tend to overuse the msync instruction, resulting in excessive delays in a multiprocessor system. Also, it has been found that workload balancing requires greater flexibility in determining which processes need to wait during synchronization. To balance workload, it is desirable to loosen waiting requirements associated with synchronization requests.
Advantageously, a computer method for use in a multiprocessor system might include
-
- processing a plurality of software threads in parallel;
- responsive to a first thread, decoding a first memory synchronization instruction, the first instruction corresponding to a first synchronization level;
- responsive to the first synchronization level, implementing a first partial synchronization task;
- responsive to a second thread, decoding a second memory synchronization instruction, the second instruction corresponding to a second synchronization level different from, but compatible with, the first synchronization level;
- responsive to the second thread, implementing a second partial synchronization task responsive to the second synchronization level, the second partial synchronization task being complementary with the first partial synchronization task, so that the first and second synchronization tasks cooperate to achieve full synchronization.
Further advantageously, a multiprocessor system might include
-
- facilities adapted to run a plurality of threads in parallel;
- a central generation indication module adapted to associate generations with memory synchronization instructions; and
- facilities adapted to decode at least one memory synchronization instruction in at least one of the threads, in accordance with a memory synchronization protocol that implements a plurality of levels of memory synchronization each level having a respective distinct mode of operation responsive to the central generation indication module.
Still further advantageously, a method for use in a multiprocessor system might include
-
- responsive to a given thread running on the system, recognizing a memory synchronization instruction, the instruction implicating a plurality of memory synchronization sub-tasks;
- responsive to the instruction, invoking at least one memory synchronization facility in accordance with a synchronization scheme including a plurality of synchronization levels; and
- distributing the sub-tasks responsive to the levels so as to offload sub-tasks from or allocate subtasks to the given thread
Objects and advantages will be apparent throughout.
Embodiments will now be described by way of non-limiting example with reference to the following figures.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable media may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The present document mentions a number of instruction and function names such as “ms c,” “hwsync,” “lwsync,” and “eieio;” “TLBsync,” “Mbar,” “full sync,” “non-cumulative barrier,” “producer sync,” “generation change sync,” “producer generation change sync,” “consumer sync,” and “local barrier.” Some of these names come from the Power PC architecture and others are new to this document, but all are nevertheless arbitrary and for convenience of understanding. An instruction might equally well be given any name as a matter of preference without altering the nature of the instruction or without taking the function instruction or the hardware supporting it outside of the scope of the claims. Moreover the claimed invention is not limited to a particular instruction set.
Generally implementing an instruction will involve creating specific computer hardware that will cause the instruction to run when computer code requests that instruction. The field of Application Specific Integrated Circuits (“ASIC”s) is a well-developed field that allows implementation of computer functions responsive to a formal specification. Accordingly, not all specific implementation details will be discussed here. Instead the functions of instructions and units will be discussed.
As described herein, the use of the letter “B” typically represents a Byte quantity, e.g., 2B, 8.0B, 32B, and 64B represent Byte units. Recitations “GB” represent Gigabyte quantities. Throughout this disclosure a particular embodiment of a multi-processor system will be discussed. This embodiment includes various numerical values for numbers of components, bandwidths of interfaces, memory sizes and the like. These numerical values are not intended to be limiting, but only examples. One of ordinary skill in the art might devise other examples as a matter of design choice.
The term “thread” will be used herein. A thread can be a hardware thread, meaning a processing circuitry within a processor. A thread can also be a software thread, meaning segments of computer program code that are run in parallel, for instance on hardware threads.
General Features of a Multiprocessor System in which the Invention May be Implemented
The compute node 50 is a single chip (“nodechip”) is based on low power A2 PowerPC cores, though any compatible core might be used. While the commercial embodiment is built around the PowerPC architecture, the invention is not limited to that architecture. In the embodiment depicted, the node includes 17 cores 52, each core being 4-way hardware threaded. There is a shared L2 cache 70 accessible via a full crossbar switch 60, the L2 including 16 slices 72. There is further provided external memory 80, in communication with the L2 via DDR-3 controllers 78—DDR being an acronym for Double Data Rate.
A messaging unit (“MU”) 100 includes a direct memory access (“DMA”) engine 21, a network interface 22, a Peripheral Component Interconnect Express (“PCIe”) unit 32. The MU is coupled to interprocessor links 90 and i/o link 92.
Each FPU 53 associated with a core 52 has a data path to the L1-data cache 55. Each core 52 is directly connected to a supplementary processing agglomeration 58, which includes a private prefetch unit. For convenience, this agglomeration 58 will be referred to herein as “L1P”—meaning level 1 prefetch—or “prefetch unit;” but many additional functions are lumped together in this so-called prefetch unit, such as write combining. These additional functions could be illustrated as separate modules, but as a matter of drawing and nomenclature convenience the additional functions and the prefetch unit will be illustrated herein as being part of the agglomeration labeled “L1P.” This is a matter of drawing organization, not of substance. Some of the additional processing power of this L1P group is shown in
In this embodiment, the L2 Cache units provide the bulk of the memory system caching. Main memory may be accessed through two on-chip DDR-3 SDRAM memory controllers 78, each of which services eight L2 slices.
To reduce main memory accesses, the L2 advantageously serves as the point of coherence for all processors within a nodechip. This function includes generating L1 invalidations when necessary. Because the L2 cache is inclusive of the L1s, it can remember which processors could possibly have a valid copy of every line, and can multicast selective invalidations to such processors. In the current embodiment the prefetch units and data caches can be considered part of a memory access pathway.
The units 301 and 302 have outputs relevant to memory synchronization, as will be discussed further below with reference to
The L2, as point of coherence, detects that the copy of the data resident in the L1D for thread β is invalid. Slice 1801 therefore queues an invalidation signal to the queue 1809 and then, via the crossbar switch, to the queue 1807 of core/L1 group 1805.
When α writes the flag, this again passes through queue 1806 to the crossbar switch 1803, but this time the write is hashed to the queue 1810 of a second slice 1802 of the L2. This flag is then stored in the slice and queued at 1811 to go to through the crossbar 1803 to queue 1807 and then to the core/L1 group 1805. In parallel, thread β, is repeatedly scanning the flag in its own L1D.
Traditionally, multiprocessor systems have used consistency models called “sequential consistency” or “strong consistency”, see e.g. the article entitled “Sequential Consistency” in Wikipedia. Pursuant to this type of model, if unit 1804 first writes data and then writes the flag, this implies that if the flag has changed, then the data has also changed. It is not possible for the flag to be changed before the data. The data change must be visible to the other threads before the flag changes. This sequential. model has the disadvantage that threads are kept waiting, sometimes unnecessarily, slowing processing.
To speed processing, PowerPC architecture uses a “weakly consistent” memory model. In that model, there is no guarantee whatsoever what memory access request will first result in a change visible to all threads. It is possible that β will see the flag changing, and still not have received the invalidation message from slice 1801, so β may still have old data in its L1D.
To prevent this unfortunate result, the PowerPC programmer can insert msync instructions 1708 and 1709 as shown in
In accordance with the embodiment disclosed herein, to support concurrent memory synchronization instructions, requests are tagged with a global “generation” number. The generation number is provided by a central generation counter. A core executing a memory synchronization requests the central unit to increment the generation counter and then waits until all memory operations of the previously current generation and all earlier generations have completed.
A core's memory synchronization request is complete when all requests that were in flight when the request began have completed. In order to determine this, the L1P monitors a reclaim pointer that will be discussed further below. Once it sees the reclaim pointer moving past the generation that was active at the point of the start of the memory synchronization request, then the memory synchronization request is complete.
A number of units within the nodechip queue memory access requests, these include:
L1P
L2
DMA
PCIe
Every such unit can contain some aspect of a memory access request in flight that might be impacted by a memory synchronization request.
The global OR tree 502 per
Because the memory subsystem has paths—especially the crossbar—through which requests pass without contributing to the global OR reduce tree of
Memory access requests tagged with a generation number may be of many types, including:
-
- A store request; including compound operations and “atomic” operations such as store-add requests
- A load request, including compound and “atomic” operations such as load-and-increment requests
- An L1 data cache (“L1D”) cache invalidate request created in response to any request above
- An Instruction Cache Block Invalidate instruction from a core 52 (“ICBI”, a PowerPC instruction);
- An L1 Instruction Cache (“L1I”) cache invalidate request created in response to a ICBI request
- A Data Cache Block Invalidate instruction from a core 52 (“DCBI”, a PowerPC instruction);
- An L1I cache invalidate request created in response to a DCBI request
The memory synchronization unit 905 shown in
-
- A 3 bit counter 601 that defines the current generation for memory accesses;
- A 3 bit reclaim pointer 602 that points to the oldest generation in flight;
- Privileged DCR access 603 to all registers defining the current status of the generation counter unit. The DCR bus is a maintenance bus that allows the cores to monitor status of other units. In the current embodiment, the cores do not access the broadcast bus 604. Instead they monitor the counter 601 and the pointer 602 via the DCR bus;
- A broadcast interface 604 that provides the value of the current generation counter and the reclaim pointer to all memory request generating units. This allows threads to tag all memory accesses with a current generation, whether or not a memory synchronization instruction appears in the code of that thread;
- A request interface 605 for all synchronization operation requesting units;
- A track and control unit 606, for controlling increments to 601 and 602.
In the current embodiment, the generation counter is used to determine whether a requested generation change is complete, while the reclaim pointer is used to infer what generation has completed.
The module 905 of
For a synchronization operation, a unit can request an increment of the current generation and wait for previous generations to complete.
The central generation counter uses a single counter 601 to determine the next generation. As this counter is narrow, for instance 3 bits wide, it wraps frequently, causing the reuse of generation numbers. To prevent using a number that is still in flight, there is a second, reclaiming counter 602 of identical width that points to the oldest generation in flight. This counter is controlled by a track and control unit 606 implemented within the memory synchronization unit. Signals from the msync interface unit, discussed with reference to
The generation counter can only advance if doing so would not cause it to point to the same generation as the reclaim pointer per in the next cycle. If the generation counter is stalled by this condition, it can still receive incoming memory synchronization requests from other cores and process them all at once by broadcasting the identical grant to all of them, causing them all to wait for the same generations to clear. For instance, all requests for generation change from the hardware threads can be OR'd together to create a single generation change request.
The generation counter (gen_cnt) 601 and the reclaim pointer (rcl_ptr) 602 both start at zero after reset. When a unit requests to advance to a new generation, it indicates the desired generation. There is no request explicit acknowledge sent back to the requestor, the requestor unit determines at whether its request has been processed based on the global current generation 601, 602. As the requested generation can be at most the gen_cnt+1, requests for any other generation at are assumed to have already been completed.
If the requested generation is equal to gen_cnt+1 and equal to rcl_ptr at, an increment is requested because the next generation value is still in use. The gen_cnt will be incremented as soon as the rcl_ptr increments.
If the requested generation is not equal to gen_cnt+1, it is assumed completed and is ignored.
If the requested generation is equal to gen_cnt+1 and not equal to rcl_ptr, gen_cnt is incremented at; but gen_cnt is incremented at most every 2 cycles, allowing units tracking the broadcast to see increments even in the presence of single cycle upset events.
Per
-
- Per 804 it is not identical to the generation counter;
- per 801, the gen_cnt has pointed to its current location for at least n cycles. The variable n is defined by the generation counter broadcast and OR-reduction turn-around latency plus 2 cycles to remove the influence of transient errors on this path; and
- Per 803, the OR reduce tree has indicated for at least 2 cycles that no memory access requests are in flight for the generation rcl_ptr points to. In other words, in the present embodiment, the incrementation of the reclaim pointer is an indication to the other units that the requested generation has completed. Normally, this is a requirement for a “full sync” as described below and also a requirement for the PPC msync.
The PowerPC architecture defines three levels of synchronization:
heavy-weight sync, also called hwsync, or msync,
lwsync (lightweight sync) and
eieio (also called mbar, memory barrier).
Generally it has been found that programmers overuse the heavyweight sync in their zealousness to prevent memory inconsistencies. This results in unnecessary slowing of processing. For instance, if a program contains one data producer and many data consumers, the producer is the bottleneck. Having the producer wait to synchronize aggravates this. Analogously, if a program contains many producers and only one consumer, then the consumer can be the bottleneck and forcing it to wait should be avoided where possible.
In implementing memory synchronization, it has been found advantageous to offer several levels of synchronization programmable by memory mapped I/O. These levels can be chosen by the programmer in accordance with anticipated work distribution. Generally, these levels will be most commonly used by the operating system to distribute workload. It will be up to the programmer choosing the level of synchronization to verify that different threads using the same data have compatible synchronization levels.
Seven levels or “flavors” of synchronization operations are discussed herein. These flavors can be implemented as alternatives to the msync/hwsync, lwsync, and mbar/eieio instructions of the PowerPC architecture. In this case, program instances of these categories of Power PC instruction can all be mapped to the strongest sync, the msync, with the alternative levels then being available by memory-mapped i/o. The scope of restrictions imposed by these different flavors is illustrated conceptually in the Venn diagram of
The seven flavors disclosed herein are:
Full Sync 1711The full sync provides sufficient synchronization to satisfy the requirements of all PowerPC msync, hwsync/lwsync and mbar instructions. It causes the generation counter to be incremented regardless of the generation of the requester's last access. The requestor waits until all requests complete that were issued before its generation increment request. This sync has sufficient strength to implement the PowerPC synchronizing instructions.
Non-Cumulative Barrier 1712This sync ensures that the generation of the last access of the requestor has completed before the requestor can proceed. This sync is not strong enough to provide cumulative ordering as required by the PowerPC synchronizing instructions. The last load issued by this processor may have received a value written by a store request of another core from the subsequent generation. Thus this sync does not guarantee that the value it saw prior to the store is visible to all cores after this sync operation. More about the distinction between non-cumulative barrier and full sync is illustrated by
-
- a) the store 1626 to precede the load 1623 per arrow 1629;
- b) the store 1625 to precede the load 1627 per arrow 1630, and
- c) the store 1626 to precede the load 1628 per arrow 1631.
- The full sync, which corresponds to the PowerPC msync instruction, will guarantee the correctness of order of all three arrows 1629, 1630, and 1631. The non-cumulative barrier will only guarantee the correctness of arrows 1629 and 1630. If, on the other hand, the program does not require the order shown by arrow 1631, then the non-cumulative barrier will speed processing without compromising data integrity.
This sync ensures that the generation of the last store access before the sync instruction of the requestor has completed before the requestor can proceed. This sync is sufficient to separate the data location updates from the guard location update for the producer in a producer/consumer queue. This type of sync is useful where the consumer is the bottleneck and where there are instructions that can be carried out between the memory access and the msync that do not require synchronization. It is also not strong enough to provide cumulative ordering as required by the PowerPC synchronizing instructions.
Generation Change Sync 1714This sync ensures only that the requests following the sync are in a different generation than the last request issued by the requestor. This type of sync is normally requested by the consumer and puts the burden of synchronization on the producer. This guarantees that load and stores are completed. This might be particularly useful in the case of atomic operations as defined in co-pending application 61/299,911 filed Jan. 29, 2010, which is incorporated herein by reference, and where it is desired to verify that all data is consumed.
Producer Generation Change Sync 1715This sync is designed to slow the producer the least. This sync ensures only that the requests following the sync are in a different generation from the last store request issued by the requestor. This can be used to separate the data location updates from the guard location update for the producer in a producer/consumer queue. However, the consumer has to ensure that the data location updates have completed after it sees the guard location change. This type does not require the producer to wait until all the invalidations are finished. The term “guard location” here refers to the type of data shown in the flag of
This request is run by the consumer thread. This sync ensures that all requests belonging to the current generation minus one have completed before the requestor can proceed. This sync can be used by the consumer in conjunction with a producer generation change sync by the producer in a producer/consumer queue.
Local Barrier 1717This sync is local to a core/L1 group and only ensures that all its preceding memory accesses have been sent to the switch.
At 1105 thread β—the consumer—tests whether the ready flag is set. At 1106, thread β also tests, in accordance with a consumer sync, whether the reclaim pointer has reached the generation of the current synchronization request. When both conditions are met at 1107, then thread β can use the data at 1108.
In addition to the standard addressing and data functions 454, 455, when the L1P 58—shown in FIG. 14—sees any of these synchronization requests at the interface from the core 52, it immediately stops write combining—responsive to the decode function 457 and the control unit 452—for all currently open write combining buffers 450 and enqueues the request in its request queue 451. During the lookup phase of the request, synchronizing requests will advantageously request an increment of the generation counter and wait until the last generation completes, executing a Full Sync. The L1P will then resume the lookup and notify the core 52 of its completion.
To invoke the synchronizing behavior of synchronization types other than full sync, at least two implementation options are possible:
1. synchronization caused by load and store operations to predefined addresses
-
- Synchronization levels are controlled by memory-mapped I/O accesses. As store operations can bypass load operations, synchronization operations that require preceding loads to have completed are implemented as load operations to memory mapped I/O space, followed by a conditional branch that depends on the load return value. Simple use of load return may be sufficient. If the sync does not depend on the completion of preceding loads, it can be implemented as store to memory mapped I/O space. Some implementation issues of one embodiment are as follows. A write access to this location is mapped to a sync request which is sent to the memory synchronization unit. The write request stalls the further processing of requests until the sync completes. A load request to the location causes the same type of requests, but only the full and the consumer request stall. All other load requests return the completion status as value back, a 0 for sync not yet complete, a 1 for sync complete. This implementation does not take advantage all of the built in PowerPC constraints of a core implementing PowerPC architecture. Accordingly, more programmer attention to order of memory access requests is needed.
2. configuring the semantics of the next synchronizations instruction, e.g. the PowerPC msync, via storing to a memory mapped configuration register - In this implementation, before every memory synchronization instruction, a store is executed that deposits a value that selects a synchronization behavior into a memory mapped register. The next executed memory synchronization instruction invokes the selected behavior and restores the configuration back to the Full Sync behavior. This reactivation of the strongest synchronization type guarantees correct execution if applications or subroutines that do not program the configuration register are executed.
- Synchronization levels are controlled by memory-mapped I/O accesses. As store operations can bypass load operations, synchronization operations that require preceding loads to have completed are implemented as load operations to memory mapped I/O space, followed by a conditional branch that depends on the load return value. Simple use of load return may be sufficient. If the sync does not depend on the completion of preceding loads, it can be implemented as store to memory mapped I/O space. Some implementation issues of one embodiment are as follows. A write access to this location is mapped to a sync request which is sent to the memory synchronization unit. The write request stalls the further processing of requests until the sync completes. A load request to the location causes the same type of requests, but only the full and the consumer request stall. All other load requests return the completion status as value back, a 0 for sync not yet complete, a 1 for sync complete. This implementation does not take advantage all of the built in PowerPC constraints of a core implementing PowerPC architecture. Accordingly, more programmer attention to order of memory access requests is needed.
configuration for a current memory synchronization instruction issued by a core 52,
when the currently operating memory synchronization instruction started,
whether data has been sent to the central unit, and
whether a generation change has been received.
The register storing configuration will sometimes be referred to herein as “configuration register.” This control unit 906 notifies the core 52 via 908 when the msync is completed. The core issuing the msync drains all loads and stored, stops taking loads and stores and stops the issuing thread until the msync completion indication is received.
This control unit also exchanges information with the global generation counter module 905. This information includes a generation count. In the present embodiment, there is only one input per L1P to the generation counter, so the L1P aggregates requests for increment from all hardware threads of the processor 52. Also, in the present embodiment, the OR reduce tree is coupled to the reclaim pointer, so the memory synchronization interface unit gets information from the OR reduce tree indirectly via the reclaim pointer.
The control unit also tracks the changes of the global generation (gen_cnt) and determines whether a request of a client has completed. Generation completion is detected by using the reclaim pointer that is fed to observer latches in the L1P. The core waits for the L1P to handle the msyncs. Each hardware thread may be waiting for a different generation to complete. Therefore each one stores what the generation for that current memory synchronization instruction was. Each then waits individually for its respective generation to complete.
For each client 901, the unit implements a group 903 of three generation completion detectors shown at 1001, 1002, 1003, per
For each store request generated by a client, the first 1001 of the three detectors sets its ginfl_flag 1005 and updates the last_gen latch 1004 with the current generation. This detector is updated for every store, and therefore reflects whether the last store has completed or not. This is sufficient, since prior stores will have generations less than or equal to the generation of the current store. Also, since the core is waiting for memory synchronization, it will not be making more stores until the completion indication is received.
For each memory access request, regardless whether load or store, the second detector 1002 is set correspondingly. This detector is updated for every load and every store, and therefore its flag indicates whether the last memory access request has completed.
If a client requests a full sync, the third detector 1003 is primed with the current generation, and for a consumer sync the third detector is primed with the current generation-1. Again, this detector is updated for every full or consumer sync.
Since the reclaim pointer cannot advance without everything in that generation having completed and because the reclaim pointer cannot pass the generation counter, the reclaim pointer is an indication of whether a generation has completed. If the rcl_ptr 602 moves past the generation stored in last gen, no requests for the generation are in flight anymore and the ginfl_flag is cleared.
Full SyncThis sync completes if the ginfl_flag 1009 of the third detector 1003 is cleared. Until completion, it requests a generation change to the value stored in the third detector plus one.
Non-Cumulative BarrierThis sync completes if the ginfl_flag 1007 of the second detector 1002 is cleared. Until completion, it requests a generation change to the value that is held in the second detector plus one.
Producer SyncThis sync completes if the ginfl_flag 1005 of the first detector 1001 is cleared. Until completion, it requests a generation change to the value held in the first detector plus one.
Generation Change SyncThis sync completes if either the ginfl_flag 1007 of the second detector 1002 is cleared or the if the last_gen 1006 of the second detector is different from gen_cnt 601. If it does not complete immediately, it requests a generation change to the value stored in the second detector plus one. The purpose of the operation is to advance the current generation (value of gen_cnt) to at least one higher than the generation of the last load or store. The generation of the last load or store is stored in the last_gen register of the second detector.
-
- 1) If the current generation equals the one of the last load/store, the current generation is advanced (exception is 3) below).
- 2) If the current generation is not equal to the one of the last load/store, it must have incremented at least once since the last load/store and that is sufficient;
- 3) There is a case when the generation counter has wrapped and now points again at the generation value of the last load/store. This case is distinguished from 1) by the cleared ginfl_flag (when we have wrapped, the original generation is no longer in flight). In this case, we are done as well, as we have incremented at least 8 times since the last load/store (wrap every 8 increments)
This sync completes if either the ginfl_flag 1005 of the first detector 1001 is cleared or if the last_gen 1004 of the first detector is different from gen_cnt 601. If it does not complete immediately, it requests a generation change to of the value stored in the first detector plus one. This operates similarly to the generation change sync except that it uses the generation of the last store, rather than load/store.
Consumer SyncThis sync completes if the ginfl_flag 1009 of the third detector 1003 is cleared. Until completion, it requests a generation change to of the value stored in the third detector plus one.
Local BarrierThis sync is executed by the L1P, it does not involve generation tracking.
From the above discussion, it can be seen that a memory synchronization instruction actually implicates a set of sub-tasks. For a comprehensive memory synchronization scheme, those sub-tasks might include one or more of the following:
-
- Requesting a generation change between memory access requests;
- Checking a given one of a group of possible generation indications in accordance with a desired level of synchronization strength;
- Waiting for a change in the given one before allowing a next memory access request; and
- Waiting for some other event.
In implementing the various levels of synchronization herein, sub-sets of this set of sub-tasks can be viewed as partial synchronization tasks to be allocated between threads in an effort to improve throughput of the system. Therefore address formats of instructions specifying a synchronization level effectively act as parameters to offload sub-tasks from or to the thread containing the synchronization instruction. If a particular sub-task implicated by the memory synchronization instruction is not performed by the thread containing the memory synchronization instruction, then the implication is that some other thread will pick up that part of the memory synchronization function. While particular levels of synchronization are specified herein, the general concept of distributing synchronization sub-tasks between threads is not limited to any particular instruction type or set of levels.
Physical DesignThe Global OR tree needs attention to layout and pipelining, as its latency affects the performance of the sync operations.
In the current embodiment, the cycle time is 1.25 ns. In that time, a signal will travel 2 mm through a wire. Where a wire is longer than 2 mm, the delay will exceed one clock cycle, potentially causing unpredictable behavior in the transmission of signals. To prevent this, a latch should be placed at each position on each wire that corresponds to 1.25 ns, in other words approximately every 2 mm. This means that every transmission distance delay of 4 ns will be increased to 5 ns, but the circuit behavior will be more predictable. In the case of the msync unit, some of the wires are expected to be on the order of 10 mm meaning that they should have on the order of five latches.
Due to quantum mechanical effects, it is advisable to protect latches holding generation information with Error Correcting Codes (“ECC”) (4b per 3b counter data). All operations may include ECC correction and ECC regeneration logic.
The global broadcast and generation change interfaces may be protected by parity. In the case of a single cycle upset, the request or counter value transmitted is ignored, which does not affect correctness of the logic.
Software InterfaceThe Msync unit will implement the ordering semantics of the PPC hwsync, lwsync and mbar instruction by mapping these operations to the full sync.
Although the embodiments of the present invention have been described in detail, it should be understood that various changes and substitutions can be made therein without departing from spirit and scope of the inventions as defined by the appended claims. Variations described for the present invention can be realized in any combination desirable for each particular application. Thus particular limitations, and/or embodiment enhancements described herein, which may have particular advantages to a particular application need not be used for all applications. Also, not all limitations need be implemented in methods, systems and/or apparatus including one or more concepts of the present invention.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The word “comprising”, “comprise”, or “comprises” as used herein should not be viewed as excluding additional elements. The singular article “a” or “an” as used herein should not be viewed as excluding a plurality of elements. Unless the word “or” is expressly limited to mean only a single item exclusive from other items in reference to a list of at least two items, then the use of “or” in such a list is to be interpreted as including (a) any single item in the list, (b) all of the items in the list, or (c) any combination of the items in the list. Ordinal terms in the claims, such as “first” and “second” are used for distinguishing elements and do not necessarily imply order of operation.
Claims
1. A computer method comprising carrying out operations in a multiprocessor system, the operations comprising:
- processing a plurality of software threads in parallel;
- responsive to a first thread, decoding a first memory synchronization instruction, the first instruction corresponding to a first synchronization level;
- responsive to the first synchronization level, implementing a first partial synchronization task;
- responsive to a second thread, decoding a second memory synchronization instruction, the second instruction corresponding to a second synchronization level different from, but compatible with, the first synchronization level;
- responsive to the second thread, implementing a second partial synchronization task responsive to the second synchronization level, the second partial synchronization task being complementary with the first partial synchronization task, so that the first and second synchronization tasks cooperate to achieve full synchronization.
2. The method of claim 1, wherein the first and second levels are chosen to reduce waiting time for one of the first and second threads, while increasing wait time for the other of the first and second threads.
3. The method of claim 2, wherein
- the first thread writes data;
- the second thread reads the data;
- the first synchronization level causes a generation change within the system related to the write; and
- the second synchronization level causes the second thread to wait for the generation change to complete.
4. The method of claim 3, wherein the central generation indication is derived responsive to a generation counter.
5. The method of claim 3, wherein the central generation indication is derived responsive to a reclaim pointer.
6. The method of claim 1, wherein the first and second instructions are memory synchronization instructions in accordance with a given instruction architecture modified by parameters conveyed in accordance with memory-mapped i/o.
7. The method of claim 1, wherein the first thread is a producer thread and the second thread is a consumer thread and the method comprises:
- in the producer thread, requesting to write data;
- in the producer thread, requesting a generation increment;
- in the producer thread, waiting for the generation increment, without any other data ready indication;
- in the producer thread, setting a data ready flag;
- in the consumer thread, waiting for the data ready flag;
- in the consumer thread, waiting for a value in a reclaim pointer to reach a desired generation; and
- in the consumer thread, using the data responsive to the data ready flag and the reclaim pointer value.
8. A multiprocessor system comprising:
- facilities adapted to run a plurality of threads in parallel;
- a central generation indication module adapted to associate generations with memory synchronization instructions; and
- facilities adapted to decode at least one memory synchronization instruction in at least one of the threads, in accordance with a memory synchronization protocol that implements a plurality of levels of memory synchronization each level having a respective distinct mode of operation responsive to the central generation indication module.
9. The system of claim 8, wherein each level is invoked responsive to respective parameters communicable in association with the memory synchronization instruction.
10. The system of claim 8, comprising a plurality of generation detectors, each adapted to detect a generation associated with a respective type of instruction, such that each level of memory synchronization instruction is associated with a respective distinctive use of the generation detectors.
11. The system of claim 10, wherein each thread is a associated with three generation detectors,
- a first detector detecting a generation of a last store;
- a second detector detecting a generation of a last load or store; and
- a third detector detecting a generation of a last memory synchronization instruction.
12. The system of claim 11, wherein each detector has an associated flag indicating whether a respective generation of an instruction detected by that detector has completed.
13. The system of claim 11, wherein
- the third detector is primed with the current generation;
- the memory synchronization instruction is a full sync that completes when the third detector indicates completion; and
- until completion, the memory synchronization instruction requests a generation change to one more than the generation detected by the third detector.
14. The system of claim 11, wherein
- the memory synchronization instruction is a non-cumulative barrier that completes when the second detector indicates completion; and
- until completion, the memory synchronization instruction requests a generation change to one more than the generation detected by the second detector.
15. The system of claim 11, wherein
- the memory synchronization instruction is a producer sync that completes when the first detector indicates completion; and
- until completion, the memory synchronization instruction requests a generation change to one more than the generation detected by the first detector.
16. The system of claim 11, wherein
- the memory synchronization instruction is a generation change sync that completes if either the second detector indicates completion or if the generation stored in the second detector differs from a central generation indication; and
- the system is adapted such that, if the memory synchronization instruction does not complete immediately, a generation change is requested for one more than the generation stored in the second detector.
17. The system of claim 11, wherein
- the memory synchronization instruction is a producer generation change sync that completes if the first detector indicates completion; or the generation detected by the first detector is different from a central generation indication; and
- the system is adapted such that, if the memory synchronization instruction does not complete immediately, a generation change is requested for one more than the generation detected by the first detector.
18. The system of claim 11, wherein
- the third detector is primed with the current generation minus one;
- the memory synchronization instruction is a consumer sync that completes if the third detector indicates completion; and
- the system is adapted such that, until completion, a generation change is requested for one more than the generation detected by the third detector.
19. A computer method comprising carrying out operations in a multiprocessor system, the operations comprising
- responsive to a given thread running on the system, recognizing a memory synchronization instruction, the instruction implicating a plurality of memory synchronization sub-tasks;
- responsive to the instruction, invoking at least one memory synchronization facility in accordance with a synchronization scheme including a plurality of synchronization levels; and
- distributing the sub-tasks responsive to the levels so as to offload sub-tasks from or allocate subtasks to the given thread.
20. The method of claim 19, wherein at least one of the sub-tasks comprises requesting a change of generation with respect to a central generation indication of the system.
21. The method of claim 19, wherein at least one of the sub-tasks comprises checking at least one generation detector associated with the given thread and indicating completion responsive to such checking.
22. The method of claim 19, wherein the given thread is one of a group of threads working together and one of the group is considered a bottleneck, so the distributing offloads sub-tasks from the bottleneck.
23. A computer program product for carrying out tasks within a multiprocessor system, the computer program product comprising. a storage medium readable by a processing circuit and storing instructions run by the processing circuit for performing a method comprising:
- implementing the tasks in accordance with a plurality of threads adapted to run in parallel;
- specifying first and second memory synchronization instructions in accordance with a memory synchronization protocol that implicates a plurality of memory synchronization sub-tasks, respective sub-sets of the sub-tasks corresponding to respective levels of synchronization, wherein the first and second memory synchronization instructions are adapted to offload given sub-tasks from a thread expected to be a bottleneck to a thread expected not to be a bottleneck.
24. The product of claim 23, wherein at least one of the sub-tasks comprises requesting a generation change from a central generation indication device between a memory access request and a guard location in at least one of the threads.
24. (canceled)
25. The product of claim 22, wherein at least one of the sub-tasks comprises monitoring completion of a generation associated with a particular type of instruction associated with a respective level of respective memory synchronization instruction.
Type: Application
Filed: Jun 8, 2010
Publication Date: May 19, 2011
Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION (Armonk, NY)
Inventor: Martin Ohmacht (Yorktown Heights, NY)
Application Number: 12/796,389
International Classification: G06F 9/38 (20060101);