ATOMICITY: A MULTI-PRONGED APPROACH

Info

Publication number: 20110219215
Type: Application
Filed: Jan 18, 2011
Publication Date: Sep 8, 2011
Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION (Armonk, NY)
Inventors: Matthias A. Blumrich (Ridgefield, CT), Dong Chen (Croton On Hudson, NY), Alan Gara (Mount Kisco, NY), Philip Heidelberger (Cortlandt Manor, NY), Martin Ohmarcht (Yorktown Heights, NY), Burkhard Steinmacher-Burow (Esslingen)
Application Number: 13/008,546

Abstract

In a multiprocessor system with speculative execution, atomicity can be approached in several fashions. One approach is to have atomic instructions that achieve multiple functions and are guaranteed to complete. Another approach is to have blocks of code that are grouped to succeed or fail together. A system can incorporate more than one such approach. In implementing more than one approach, the system may prioritize one over another. When conflict detection is done through a directory lookup in cache memory, atomic instructions and atomicity related operations may be implemented in a cache data array access pipeline in that cache memory. This implementation may include feedback to the pipeline for implementing multiple functions within an atomic instruction and also for cascading atomic instructions.

Description

Description

CROSS REFERENCE TO RELATED APPLICATIONS

Benefit is claimed of the following applications, in particular:

61/295,669, filed Jan. 15, 2010 and

61/299,911 filed Jan. 29, 2010,

both for “SPECULATION AND TRANSACTION IN A SYSTEM SPECULATION AND TRANSACTION SUPPORT IN L2 L1 SUPPORT FOR SPECULATION/TRANSACTIONS IN A2 PHYSICAL ALIASING FOR THREAD LEVEL SPECULATION MULTIFUNCTIONING L2 CACHE CACHING MOST RECENT DIRECTORY LOOK UP AND PARTIAL CACHE LINE SPECULATION SUPPORT”,

And of the following applications in general:

Benefit of the following applications is claimed and they are also incorporated by reference: U.S. patent application Ser. No. 12/796,411 filed Jun. 8, 2010; U.S. patent application Ser. No. 12/696,780, filed Jan. 29, 2010; U.S. provisional patent application Ser. No. 61/293,611, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/684,367, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/684,172, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/684,190, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/684,496, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/684,429, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/697,799 filed Feb. 1, 2010; U.S. patent application Ser. No. 12/684,738, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/684,860, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/684,174, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/684,184, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/684,852, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/684,642, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/684,804, filed Jan. 8, 2010; U.S. provisional patent application Ser. No. 61/293,237, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/693,972, filed Jan. 26, 2010; U.S. patent application Ser. No. 12/688,747, filed Jan. 15, 2010; U.S. patent application Ser. No. 12/688,773, filed Jan. 15, 2010; U.S. patent application Ser. No. 12/684,776, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/696,825, filed Jan. 29, 2010; U.S. patent application Ser. No. 12/684,693, filed Jan. 8, 2010; U.S. provisional patent application Ser. No. 61/293,494, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/731,796, filed Mar. 25, 2010; U.S. patent application Ser. No. 12/696,746, filed Jan. 29, 2010; U.S. patent application Ser. No. 12/697,015 filed Jan. 29, 2010; U.S. patent application Ser. No. 12/727,967, filed Mar. 19, 2010; U.S. patent application Ser. No. 12/727,984, filed Mar. 19, 2010; U.S. patent application Ser. No. 12/697,043 filed Jan. 29, 2010; U.S. patent application Ser. No. 12/697,175, Jan. 29, 2010; U.S. patent application Ser. No. 12/684,287, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/684,630, filed Jan. 8, 2010; U.S. patent application Ser. No. 12/723,277 filed Mar. 12, 2010; U.S. patent application Ser. No. 12/696,764, filed Jan. 29, 2010; U.S. patent application Ser. No. 12/696,817 filed Jan. 29, 2010; U.S. patent application Ser. No. 12/697,164, filed Jan. 29, 2010; U.S. patent application Ser. No. 12/796,411, filed Jun. 8, 2010; and, U.S. patent application Ser. No. 12/796,389, filed Jun. 8, 2010.

All of the above-listed applications are incorporated by reference herein.

GOVERNMENT CONTRACT

This invention was made with Government support under Contract No.: B554331 awarded by Department of Energy. The Government has certain rights in this invention.

BACKGROUND

The invention relates to guaranteeing atomicity in a multiprocessor environment.

In a multiprocessor environment, it is desirable to provide some memory operations that provide a atomicity for a sequence of accesses.

In some multiprocessor environments there may be more than one type of memory access construct that provides atomicity. For instance, in the Power PC Architecture, there is a pair of instructions called larx/stx that requests atomicity for a load/store access pair to the same memory location. Hardware Transactional Memory is another method for providing atomicity for a larger sequence of memory accesses.

The field of the invention is speculative execution in a multiprocessor system, and in particular guaranteeing atomicity in such a system.

The following definition of atomicity appears in IBM®, Power ISA™ Version 2.06, Jan. 30, 2009, “1.4 Single-copy Atomicity,” p. 655

- An access is single-copy atomic, or simply atomic, if it is always performed in its entirety with no visible fragmentation. Atomic accesses are thus serialized: each happens in its entirety in some order, even when that order is not specified in the program or enforced between processors . . . . An access that is not atomic is performed as a set of smaller disjoint atomic accesses.

This book will be referred to herein as “PowerPC Architecture” is incorporated by reference in its entirety.

In the Power PC architecture, atomicity is often thought of in the context of the larx/stcx type instructions. This type of instruction has several forms:

lwarx/stwcx, for word accesses;

lbarx/stbcx for byte accesses,

lharx/sthcx for halfword accesses, and

ldarx/stdcx for double word accesses.

These instructions come in pairs that delimit a block of instructions that the programmer desires to have complete atomically. If the stcx instruction indicates a failure of atomicity, then the whole block fails. More about an implementation of larx/stcx appears in co-pending application Ser. No. 12/697,799 filed Jan. 29, 2010, which is incorporated herein. This co-pending application is not conceded to be prior art.

With new multiprocessing architectures, new mechanisms for guaranteeing atomicity are desirable.

SUMMARY

In the latest IBM® Blue Gene® architecture, the point of coherence is a directory lookup mechanism in a cache memory. It would be desirable to guarantee a hierarchy of atomicity options within that architecture.

In one embodiment, a multiprocessor system includes a plurality of processors, a conflict checking mechanism, and an instruction implementation mechanism. The processors are adapted to carry out speculative execution in parallel. The conflict checking mechanism is adapted to detect and protect results of speculative execution responsive to memory access requests from the processors. The instruction implementation mechanism cooperates with the processors and conflict checking mechanism adapted to implement an atomic instruction that includes load, modify, and store with respect to a single memory location in an uninterruptible fashion.

In another embodiment, a system includes a plurality of processors and at least one cache memory. The processors are adapted to issue atomicity related operations. The operations include at least one atomic instruction and at least one other type of operation. The atomic instruction includes sub-operations including a read, a modify, and a write. The other type of operation includes at least one atomicity related operation. The cache memory includes an cache data array access pipeline and a controller. The controller is adapted to prevent the other types operations from entering the cache data array access pipeline, responsive to an atomic instruction in the pipeline, when those other types of operation compete with the atomic instruction in the pipeline for a memory resource.

In yet another embodiment, a multiprocessor system includes a plurality of processors, a central conflict checking mechanism, and a prioritizer. The processors are adapted to implement parallel speculative execution of program threads and to implement a plurality of atomicity related techniques. The central conflict checking mechanism resolves conflicts between the threads. The prioritizer prioritizes at least one atomicity related technique over at least one other atomicity related technique.

In a further embodiment, a computer method includes issuing an atomic instruction, recognizing the atomic instruction, and blocking other operations. The atomic instruction is issued from one of the processors in a multi-processor system and defines sub-operations that include reading, modifying, and storing with respect to a memory resource. A directory based conflict checking mechanism recognizes the atomic instruction. Other operations seeking to access the memory resource are blocked until the atomic instruction has completed.

Objects and advantages will be apparent in the following.

BRIEF DESCRIPTION OF DRAWING

Embodiments will now be described by way of non-limiting example with reference to the following figures:

FIG. 1 shows an overview of a multiprocessor system within which caching improvements may be implemented.

FIG. 1A shows some software running in a distributed fashion on the multiprocessor system.

FIG. 1B shows a timing diagram with respect to TM type speculative execution.

FIG. 2 shows a map of a cache slice.

FIG. 3 is a schematic of the control unit of an L2 slice.

FIG. 3A shows a request queue and retaining data associated with a previous memory access request.

FIG. 3B shows interaction between the directory pipe and directory SRAM.

FIG. 4 shows structure of the directory SRAM 309.

FIG. 5 shows more detail of operation of the L2 central unit.

FIG. 6 shows operation of a cache data array access pipe with respect to atomicity related functions.

FIG. 7 shows interaction between code sections embodying some different approaches to atomicity.

FIG. 8 is a flowchart relating to queuing atomic instructions.

DETAILED DESCRIPTION

The term “thread” is used herein. A thread can be either a hardware thread or a software thread. A hardware thread within a core processor includes a set of registers and logic for executing a software thread. The software thread is a segment of computer program code.

These threads can be the subject of “speculative execution,” meaning that a thread or threads can be started as a sort of wager or gamble, without knowledge of whether the thread can complete successfully. A given thread cannot complete successfully if some other thread modifies the data that the given thread is using in such a way as to invalidate the given thread's results. The terms “speculative,” “speculatively,” “execute,” and “execution” are terms of art in this context. These terms do not imply that any mental step or manual operation is occurring. All operations or steps described herein are to be understood as occurring in an automated fashion under control of computer hardware or software.

If speculation fails, the results must be invalidated and the thread must be re-run or some other workaround found. Generally, recovery from failure of any kind of speculative execution in the current embodiment relates to undoing changes made by a thread. Once a software thread is committed, the actions taken by the thread become irreversible.

Three modes of speculative execution are supported in the current embodiment: Thread Level Speculation (“TLS”), Transactional Memory (“TM”), and Rollback.

TM occurs in response to a specific programmer request. Generally the programmer will put instructions in a program delimiting sections in which TM is desired. This may be done by marking the sections as requiring atomic execution. “An access is single-copy atomic, or simply “atomic”, if it is always performed in its entirety with no visible fragmentation.” IBM® Power ISA™ Version 2.06, Jan. 30, 2009. In a transactional model, the programmer replaces critical sections with transactional sections at 1601 (FIG. 7), which can manipulate shared data without locking. When the section ends, the program will make another call that ultimately signals the hardware to do conflict checking and reporting.

Normally TLS occurs when a programmer has not specifically requested parallel operation. Sometimes a compiler will ask for TLS execution in response to a sequential program. When the programmer writes this sequential program, she may insert commands delimiting sections. The compiler can recognize these sections and attempt to run them in parallel.

Rollback occurs in response to “soft errors,” normally these errors occur in response to cosmic rays or alpha particles from solder balls. Rollback is discussed in more detail in co-pending application Ser. No. 12/696,780, which is incorporated herein by reference.

The present invention arose in the context of the IBM® Blue Gene® project, which is further described in the applications incorporated by reference above. FIG. 1 is a schematic diagram of an overall architecture of a multiprocessor system in accordance with this project, and in which the invention may be implemented. At 101, there are a plurality of processors operating in parallel along with associated prefetch units and L1 caches. At 102, there is a switch. At 103, there are a plurality of L2 slices. At 104, there is a main memory unit. It is envisioned, for the present embodiment, that the L2 cache should be the point of coherence.

FIG. 1A shows some software running in a distributed fashion, distributed over the cores of node 50. An application program is shown at 131. If the application program requests TLS or TM, a runtime system 132 will be invoked. This runtime system is particularly to manage TM and TLS execution and can request domains of IDs from the operating system 133. The operating system configures the hardware to define domains and modes of execution. “Domains” in this context are numerical groups of IDs that can be assigned to a mode of speculation. More about this use of domains can be found in the provisional applications 61/295,669, filed Jan. 15, 2010 and 61/299,911 filed Jan. 29, 2010, incorporated by reference above. The runtime system can also be called to request allocation of IDs and to start a speculative section, as well as to end a section and determine the outcome of the speculation. More about a runtime system and about allocation and commitment of ID's can be found in the provisional applications 61/295,669, filed Jan. 15, 2010 and 61/299,911 filed Jan. 29, 2010, incorporated by reference above.

The application program can also request various operation types, for instance as specified in a standard such as the PowerPC architecture. These operation types might include larx/stcx pairs or atomic instructions, to be discussed further below.

FIG. 1B shows a timing diagram explaining how TM execution might work on this system. At 141 the program starts executing. At the end of block 141, a call for TM is made. In 142 the run time system receives this request and conveys it to the operating system. At 143, the operating system confirms the availability of the mode. The operating system can accept, reject, or put on hold any requests for a mode. The confirmation is made to the runtime system at 144. The confirmation is received at the application program at 145. If there had been a refusal, the program would have had to adopt a different strategy, such as serialization or waiting for modes or domains to become available. Because the request was accepted, parallel sections can start running at the end of 145. The runtime system gets speculative IDs from the hardware at 146 and transmits them to the application program at 147, which then uses them. The program knows when to finish speculation at the end of 147. Then the run time system asks for the ID to commit at 148. Any conflict information can be transmitted back to the application program at 149, which then may try again or adopt other strategies. If there is a conflict, an interrupt is raised by the L2. The L2 will send the interrupt to the hardware thread that was using the ID. This hardware thread then has to figure out, based on the state the runtime system is in and the state the L2 central provides indicating a conflict, what to do in order to resolve the conflict. For example, it might execute the transactional memory section again which causes the software to jump back to the start of the transaction.

If the hardware determines that no conflict has occurred, the speculative results of the associated thread can be made persistent.

In response to a conflict, trying again may make sense where another thread completed successfully, which may allow the current thread to succeed. If both threads restart, there can be a “lifelock,” where both keep failing over and over. In this case, the runtime system may have to adopt other strategies like getting one thread to wait, choosing one transaction to survive and killing others, or other strategies, all of which are known in the art.

FIG. 2 shows a cache slice. It includes arrays of data storage 201, and a central control portion 202.

FIG. 3 shows features of an embodiment of the control section 102 of a cache slice 103.

Coherence tracking unit 301 issues invalidations, when necessary. These invalidations are issued centrally, while in the prior generation of the Blue Gene® project, invalidations were achieved by snooping.

The request queue 302 buffers incoming read and write requests. In this embodiment, it is 16 entries deep, though other request buffers might have more or less entries. The addresses of incoming requests are matched against all pending requests to determine ordering restrictions. The queue presents the requests to the directory pipeline 308 based on ordering requirements.

The write data buffer 303 stores data associated with write requests. This buffer passes the data to the cache data array access pipeline, which is here implemented as eDRAM pipeline 305, in case of a write hit or after a write miss resolution.

The directory pipeline 308 accepts requests from the request queue 302, retrieves the corresponding directory set from the directory SRAM 309, matches and updates the tag information, writes the data back to the SRAM and signals the outcome of the request (hit, miss, conflict detected, etc.).

The L2 implements four parallel eDRAM pipelines 305 that operate independently. They may be referred to as eDRAM bank 0 through eDRAM bank 3. The eDRAM pipeline controls the eDRAM access and the dataflow from and to this macro. If writing only subcomponents of a doubleword or for load-and-increment or store-add operations, it is responsible to schedule the necessary RMW cycles and provide the dataflow for insertion and increment.

The read return buffer 304 buffers read data from eDRAM or the memory controller and is responsible for scheduling the data return using the switch 102. In this embodiment it has a 32B wide data interface to the switch. It is used only as a staging buffer to compensate for backpressure from the switch. It is not serving as a cache.

The miss handler 307 takes over processing of misses determined by the directory. It provides the interface to the DRAM controller and implements a data buffer for write and read return data from the memory controller,

The reservation table 306 registers and invalidates reservation requests.

Per FIG. 3A, the L2 slice 103 includes a request queue 302. At 311, a cascade of modules tests whether pending memory access requests will require data associated with the address of a previous request, the address being stored at 313. These tests might look for memory mapped flags from the L1 or for some other identification. A result of the cascade 311 is used to create a control input at 314 for selection of the next queue entry for lookup at 315, which becomes an input for the directory look up module 312.

FIG. 3B shows more about the interaction between the directory pipe 308 and the directory SRAM 309. The vertical lines in the pipe represent time intervals during which data passes through a cascade of registers in the directory pipe. In a first time interval T1, a read is signaled to the directory SRAM. In a second time interval T2, data is read from the directory SRAM. In a third time interval, T3, a table lookup informs writes WR and WR DATA to the directory SRAM. In general, table lookup will govern the behavior of the directory SRAM to control cache accesses responsive to speculative execution. Only one table lookup is shown at T3, but more might be implemented.

FIG. 4 shows the formats of 4 directory SRAMs included at 309, to with:

- a base directory 321;
- a least recently used directory 322;
- a COH/dirty directory 323 and 323′; and
- a speculative reader directory 324.

In the base directory, 321, there are 15 bits that locate the line at 271. Then there is a seven bit speculative writer ID field 272 and a flag 273 that indicates whether the write is speculative. Then there is a two bit speculative read flag field 274 indicating whether to invoke the speculative reader directory 324, and a one bit “current” flag 275. The current flag 275 indicates whether the current line is assembled from more than one way or not. The processor, A2, does not know about the fields 272-275. These fields are set by the L2 directory pipeline.

If the speculative writer flag is checked, then the way has been written speculatively, not taken from main memory and the writer ID field will say what the writer ID was. If the flag clears, the writer ID field is irrelevant.

The LRU directory indicates “age”, in other words a period of time since a way was used. This directory is for allocating ways in accordance with the Least Recently Used algorithm.

The COH/dirty directory has two uses, and accordingly two possible formats. In the first format, 323, known as “COH,” there are 17 bits, one for each core of the system. This format indicates, when the writer flag is not set, whether the corresponding core has a copy of this line of the cache. In the second format, 323′, there are 16 bits. These bits indicate, if the writer flag is set in the base directory, which part of the line has been modified speculatively. The line has 128 bytes, but they are recorded at 323′ in groups of 8 bytes, so only 16 bits are used, one for each group of eight bytes.

The operation of the pipe control unit 310 and the EDRAM queue decoupling buffer 300 will be described more below with reference to FIG. 6.

The L2 implements a multitude of decoupling buffers for different purposes, e.g.

- The Request queue is an intelligent decoupling buffer (with reordering logic), allowing to receive requests from the switches even if the directory pipe is blocked
- The write data buffer accepts write data from the switch even if the eDRAM pipe is blocked or the target location in the eDRAM is not yet known
- The Coherence tracking implements two buffers: One decoupling the directory lookup sending to it requests from the internal coherence SRAM lookup pipe. And one decoupling the SRAM lookup results from the interface to the switch.
- The miss handler implements one from the DRAM controller to the eDRAM and one from the eDRAM to the DRAM controller
- There are more, almost every little subcomponent that can block for any reason is connected via a decoupling buffer to the unit feeding requests to it

The L2 caches may operate as set-associative caches while also supporting additional functions, such as memory speculation for Speculative Execution (SE—also referred to as TLS), Transactional Memory and local memory rollback, as well as atomic memory transactions. Support for such functionalities includes additional bookkeeping and storage functionality for multiple versions of the same physical memory line.

To reduce main memory accesses, the L2 cache may serve as the point of coherence for all processors. In performing this function, an L2 central unit will have responsibilities such as defining domains of speculation IDs, assigning modes of speculation execution to domains, allocating speculative IDS to threads, trying to commit the IDs, sending interrupts to the cores in case of conflicts, and retrieving conflict information. This function includes generating L1 invalidations when necessary. Because the L2 caches are inclusive of the L1s, they can remember which processors could possibly have a valid copy of every line, and they can multicast selective invalidations to such processors. The L2 caches are advantageously a synchronization point, so they coordinate synchronization instructions from the PowerPC architecture, such as larx/stcx.

Larx/stcx

The larx and stcx instructions are used to perform a read-modify-write operation to storage. If the store is performed, the use of the larx and stcx instruction pair ensures that no other processor or mechanism has modified the target memory location between the time the larx instruction is executed and the time the stcx. instruction completes. The larx instruction loads the word from the location in storage specified by the effective address into a target register. In addition, a reservation on the memory location is created for use by a subsequent stcx instruction. The stcx instruction is used in conjunction with a preceding larx instruction to emulate a read-modify-write operation on a specified memory location.

The L2 caches will handle larx/stcx reservations and ensure their consistency. They are a natural location for this responsibility because software locking is dependent on consistency, which is managed by the L2 caches.

The core basically hands responsibility for larx/stcx consistency and completion off to the external memory system. Unlike the core, it does not maintain an internal reservation and it avoids complex cache management through simple invalidation. Larx is treated like a cache-inhibited load, but invalidates the target line if it hits in the L1 cache. Similarly, stcx is treated as a cache-inhibited store and also invalidates the target line in L1 if it exists.

The L2 cache is expected to maintain reservations for each thread, and no special internal consistency action is taken by the core when multiple threads attempt to use the same lock. To support this, a thread is blocked from issuing any L2 accesses while a larx from that thread is outstanding, and it is blocked completely while a stcx is outstanding. The L2 cache will support larx/stcx as described in the next several paragraphs.

Each L2 slice has 17 reservation registers. Each reservation register consists of a 25-bit address register and an 9-bit thread ID register that identifies which thread has reserved the stored address and indicates whether the register is valid (i.e. in use).

When a larx occurs, the valid reservation thread ID registers are searched to determine if the thread has already made a reservation. If so, the existing reservation is cleared. In parallel, the registers are searched for matching addresses. If found, the thread ID is tried to be added to the thread identifier. If either no address is found or the thread ID could not be added to reservation registers with matching addresses, a new reservation is established. If a register is available, it is used, otherwise a random existing reservation is evicted and a new reservation is established in its place. The larx continues as an ordinary load and returns data.

Every store searches the valid reservation address registers. All matching registers are simply invalidated. The necessary back-invalidations to cores will be generated by the normal coherence mechanism.

When a stcx occurs, the valid reservation registers 306 are searched for entries with both a matching address and a matching thread ID. If both of these conditions are met, then the stcx is considered a success. Stcx success is returned to the requesting core and the stcx is converted to an ordinary store (causing the necessary invalidations to other cores by the normal coherence mechanism). If either condition is not met, then the stcx is considered a failure. Stcx fail is returned to the requesting core and the stcx is dropped. In addition, for every stcx any pending reservation for the requesting thread is invalidated.

To allow more than 17 reservations per slice, the actual thread ID field is encoded by the core ID and a vector of 4 bits, each representing a thread of the indicated core. If a reservation is established, first a check for matching address and core number n any register is made. If a register has both matching address and matching core, the corresponding thread bit is activated. Only if all bits are clear, the entire register is assumed invalidated and available for reallocation without eviction.

Atomic Operations

The L2 supports multiple atomic instructions or operations on 8B entities. These operations are sometimes of the type that perform read, modify, and write back atomically—in other words that combine several frequently used instructions and guarantee that they can perform successfully. The operation is selected based on address bits as defined in the memory map and the type of access. These operations will typically require RAW, WAW, and WAR checking. The directory lookup phase will be somewhat different from other instructions, because both read and write are contemplated.

FIG. 6 shows aspects of the L2 cache data array access pipeline, implemented as EDRAM pipeline 305 in the preferred embodiment, pertinent to atomic operations. In this pipeline, data is typically ready after five cycles. At 461, some read data is ready. Error correcting codes (ECC) are used to make sure that the read data is error free. Then read data can be sent to the core at 463. If it is one of these read/modify/write atomic operations or instructions, the data modification is performed at 462, followed by a write back to eDRAM at 465, which feeds back to the beginning of the pipeline per 464, while other matching requests are blocked from the pipeline, guaranteeing atomicity. Sometimes, two such compound instructions will be carried out sequentially, in other words cascaded. In such a case, any number of them can be linked using a feedback at 466. To assemble a line, several iterations of this pipeline structure may be undertaken. More about assembling lines can be found in the provisional applications incorporated by reference above. Thus atomic instructions, which reserve the EDRAM pipeline, can achieve performance results that a sequence of operations cannot while guaranteeing atomicity.

It is possible to feed two atomic operations or instructions to two different addresses together through the EDRAM pipe: read a, read b, then write a and b.

FIG. 7 shows a comparison between approaches to atomicity. At 1601 a thread executing pursuant to a TM model is shown. At 1602 a block of code protected by a larx/stcx pair is shown. At 1603 an atomic operation is shown.

Thread 1601 includes three parts,

a first part 1604 that involves at least one load instruction;

a second part 1605 that involves at least one store instruction; and

a third part 1606 where the system tries to commit the thread.

Arrow 1607 indicates that the reader set directory is active for that part. Arrow 1608 indicates that the writer set directory is active for that part.

Code block 1602 is delimited by a larx instruction 1609 and a stcx instruction 1610. Arrow 1611 indicates that the reservation table 306 is active. When the stcx instruction executes, if there has been any read or write conflict, the whole block 1602 fails.

Atomic operation 1603 is one of the types indicated in table below, for instance “load increment.” The arrows at 1612 show the arrival of the atomic operation during the periods of time delimited by double arrows at 1607 and 1611. The atomic operation is guaranteed to complete due to the block on the EDRAM pipe for the relevant memory accesses. Accordingly, if there is a concurrent use by a TM thread 1601 and/or by a block of code protected by LARX/STCX 1602, and if those uses access the same memory location as the atomic operation 1603, a conflict will be signaled and results of the code blocks 1601 and 1602 will be invalidated. An uninterruptible, persistent atomic instruction will be given priority over a reversible operation, e.g. TM transaction, or an interruptible operation, e.g., a LARX/STCX pair.

As between blocks 1601 and 1602, which is successful and which invalidates will depend on the order of operations, if they compete for the same memory resource. For instance, in the absence of 1603, if the stcx instruction 1610 completes before the commit attempt 1606, the larx/stcx box will succeed while the TM thread will fail. Alternatively, also in the absence of 1603, if the commit attempt 1606 completes before the stcx instruction 1610, then the larx/stcx block will fail. The TM thread can actually function a bit like multiple larx/stcx pairs together.

FIG. 8 shows some issues relating to queuing operations. At 1701, an atomic instruction issues from a processor. It takes the form of a memory access with the lower bits indicating an address of a memory location and the upper bits indicating which operation is desired. At 1702, the L1D and L1P treat this operation as an ordinary memory access to an address that is not cached. At 1703, in the pipe control unit of the L2 cache slice, the operation is recognized as an atomic instruction responsive to a directory lookup. The directory lookup also determines whether there are multiple versions of the data accessed by the atomic instruction. At 1704, if there are multiple versions, control is transferred to the miss handler.

At 1705, the miss handler treats the existence of multiple versions as a cache miss. It blocks further accesses to that set and prevents them from entering the queue, by directing them to the EDRAM decoupling buffer. With respect to the set, the EDRAM pipe is then made to carry out copy/insert operations at 1707 until the aggregation is complete at 1708. This version aggregation loop is used for ordinary memory accesses to cache lines that have multiple versions.

Once the aggregation is complete, or if there are not multiple versions, control passes to 1710 where the current access is inserted into the EDRAM queue. If there is already an atomic instruction relating to this line of the cache at 1711, then, at 1711, the current operation must wait in the EDRAM decoupling buffer. Non atomic operations or instructions will similarly have to be decoupled if they seek to access a cache line that is currently being accessed by an atomic instruction in the EDRAM queue. If there are no atomic instructions relating to this line in the queue, then control passes to 1713 where the current operation is transferred to the EDRAM queue. Then, at 1714, the atomic instruction traverses the EDRAM queue twice, once for the read and modify and once for the write. During this traversal, other operations seeking to access the same line may not enter the EDRAM pipe, and will be decoupled into the decoupling buffer.

The following atomic instructions are examples that are supported in the present embodiment, though others might be implemented. These operations are implemented in addition to the memory mapped i/o operations in the PowerPC architecture.

Load/ Opcode Store Operation Function Comment 000 Load Load Load the current value 001 Load Load Clear Fetch current value and store zero 010 Load Load Fetch current value and increment 0xFFFF FFFF FFFF Increment storage FFFF rolls over to 0. So when sw uses the counter as unsigned, +2{circumflex over ( )}64 − 1 rolls over to 0. Thanks to two's complement, sw can use the counter as signed or unsigned. When using as signed, +2{circumflex over ( )}63 − 1 rolls over to −2{circumflex over ( )}63. 011 Load Load Fetch current value and 0 rolls over to to Decrement decrement storage 0xFFFF FFFF FFFF FFFF. So when sw uses the counter as unsigned, 0 rolls over to +2{circumflex over ( )}64 − 1. Thanks to two's complement, sw can use the counter as signed or unsigned. When using as signed, −2{circumflex over ( )}63 rolls over to 2{circumflex over ( )}63 − 1. 100 Load Load The counter is the address given The 8 B counter and its Increment and the boundary is the 8 B boundary efficiently Bounded SUBSEQUENT address. support If counter and boundary values producer/consumer differ, increment counter and queue/stack/deque with return old value, else return multiple producers and 0x8000 0000 0000 0000. multiple consumers. if The counter and (*ptrCounter==*(ptrCounter+1)){ boundary pair must be return 0x8000 0000 0000 0000; within a 32Byte line. // +2{circumflex over ( )}63 unsigned Rollover and // −2{circumflex over ( )}63 signed signed/nusigned } else { software use are as for oldValue = *ptrCounter; ‘load increment’ ++*ptrCounter; instruction. return oldValue; On boundary, 0x8000 } 0000 0000 0000 is returned. So unsigned use is also restricted to the upper value 2{circumflex over ( )}63 − 1, instead of the optimal 2{circumflex over ( )}64 − 1. This factor 2 loss is not expected to be a problem in practice. 101 Load Load The counter is the address given Comments as for ‘Load Decrement and the boundary is the Increment Bounded’ Bounded PREVIOUS address. If counter and boundary values differ, decrement counter and return old value, else return 0x8000 0000 0000 0000. if (*ptrCounter==*(ptrCounter− 1)){ return 0x8000 0000 0000 0000; // +2{circumflex over ( )}63 unsigned // −2{circumflex over ( )}63 signed } else { oldValue = *ptrCounter; --*ptrCounter; return oldValue; } 110 Load Load The counter is the address given The 8 B counter and its Increment if and the compare value is the 8 B compare value equal SUBSEQUENT address. efficiently support If counter and boundary values trylock operations for are equal, increment counter and mutex locks. return old value, else return The counter and 0x8000 0000 0000 0000. boundary pair must be if within a 32Byte line. (*ptrCounter!=*(ptrCounter+l)){ Rollover and return 0x8000 0000 0000 0000; signed/nusigned // +2{circumflex over ( )}63 unsigned software use are as for // −2{circumflex over ( )}63 signed load increment' } else { instruction. oldValue = *ptrCounter; On mismatch, 0x8000 ++*ptrCounter; 0000 0000 0000 is return oldValue; returned. } So unsigned use is also restricted to the upper value 2{circumflex over ( )}63 − 1, instead of the optimal 2{circumflex over ( )}64 − 1. This factor 2 loss is not expected to be a problem in practice. 000 Store Store Store the given value 001 Store StoreTwin Store 8 B value to 8 B address Used for fast deque given and to the SUBSEQUENT implementations 8 B address, if these two locations The address pair must previously had the equal values. be within a 32Byte line. 010 Store Store Add Add store value to storage 0xFFFF FFFF FFFF FFFF and earlier rolls over to 0 and beyond. Vice versa in the other direction. So when sw uses the counter as unsigned, +2{circumflex over ( )}64 − 1 and earlier rolls over to 0 and beyond. Thanks to two's complement, sw can use the counter and ‘store value’ as signed or unsigned. When using as signed, and adding a positive store value, then ′+2{circumflex over ( )}63 − 1 and earlier rolls over to −2{circumflex over ( )}63 and beyond. Vice versa, when adding a negative store value. 011 Store Store As Store Add, but do not keep Add/Coherence L1-caches coherent unless on Zero storage value reaches zero 100 Store Store Or Logical OR value to storage 101 Store Store Xor Logical XOR value to storage 110 Store Store Max Store Max of value and storage, Unsigned values are interpreted as unsigned binary 111 Store Store Max Store Max of value and storage, Allows Max of floating Sign/Value values are interpreted as 1 b sign point numbers and 63 b absolute value If the encoding of either operand represents a NaN, the operand is assumed to be positive for comparison purposes.

A load increment acts similarly to a load. This instruction provides a destination address to be loaded and incremented. In other words, the load gets a special modification that tells the memory subsystem not to simply load the value, but also increment it and write the incremented data back to the same location. This instruction is useful in various contexts. For instance if there is a workload to be distributed to multiple threads, and it is not known how many threads will share the workload or which one is ready, then the workload can be divided into chunks. A function can associate a respective integer value to each of these chunks. Threads can use load-increment to get a workload by number and process it.

Each of these instructions acts like a modification of main memory. If any of the core/L1 units has a copy of the modified value, it will get a notification that the memory value has changed—and it evicts and invalidates its local copy. The next time the core/L1 unit needs the value, it has to fetch it from the l2. This process happens each time the location is modified in the l2.

A common pattern is that some of the core/L1 units will be programmed to act when a memory location modified by atomic instructions reaches a specific value. When polling for the value, repeated L1 misses, fetches from L2 followed by L1 invalidations due to atomic instructions occur.

Store_add_coherence_on_zero reduces the events of the local cache being invalidated and a new copy gotten from the l2 cache. With this atomic instruction, L1 cache lines will be left incoherent and not invalidated unless the modified value reaches zero. The threads waiting for zero can then keep checking whatever their local value its L1 cache is even if that local value is inaccurate, until the value is actually zero. This means that one thread might modify the value as far as the L2 is concerned, without generating a miss for other threads.

In general, the instructions in the above table, called “atomic” have an effect that the regular load and store does not have. They load, read, modify and write back in one atomic instruction, even within the context of speculation. This type of operation works in the context of speculation, because of the loop back in the EDRAM pipeline. It executes conflict checking equivalent to a sequence of a load and a store. Before the atomic instruction is loading, it does the version aggregation discussed further in the provisional applications incorporated by reference above.

Although the embodiments of the present invention have been described in detail, it should be understood that various changes and substitutions can be made therein without departing from spirit and scope of the inventions as defined by the appended claims. Variations described for the present invention can be realized in any combination desirable for each particular application. Thus particular limitations, and/or embodiment enhancements described herein, which may have particular advantages to a particular application need not be used for all applications. Also, not all limitations need be implemented in methods, systems and/or apparatus including one or more concepts of the present invention.

The present invention can be realized in hardware, software, or a combination of hardware and software. A typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and run, controls the computer system such that it carries out the methods described herein. The present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods.

Computer program means or computer program in the present context include any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after conversion to another language, code or notation, and/or reproduction in a different material form.

It is noted that the foregoing has outlined some of the more pertinent objects and embodiments of the present invention. This invention may be used for many applications. Thus, although the description is made for particular arrangements and methods, the intent and concept of the invention is suitable and applicable to other arrangements and applications. It will be clear to those skilled in the art that modifications to the disclosed embodiments can be effected without departing from the spirit and scope of the invention. The described embodiments ought to be construed to be merely illustrative of some of the more prominent features and applications of the invention. Other beneficial results can be realized by applying the disclosed invention in a different manner or modifying the invention in ways known to those familiar with the art.

The word “comprising”, “comprise”, or “comprises” as used herein should not be viewed as excluding additional elements. The singular article “a” or “an” as used herein should not be viewed as excluding a plurality of elements. Unless the word “or” is expressly limited to mean only a single item exclusive from other items in reference to a list of at least two items, then the use of “or” in such a list is to be interpreted as including (a) any single item in the list, (b) all of the items in the list, or (c) any combination of the items in the list. Ordinal terms in the claims, such as “first” and “second” are used for distinguishing elements and do not necessarily imply order of operation.

Items illustrated as boxes in flowcharts herein might be implemented as software or hardware as a matter of design choice by the skilled artisan. Software might include sequential or parallel code, including objects and/or modules. Modules might be organized so that functions from more than one conceptual box are spread across more than one module or so that more than one conceptual box is incorporated in a single module. Data and computer program code illustrated as residing on a medium might in fact be distributed over several media, or vice versa, as a matter of design choice. Such media might be of any suitable type, such as magnetic, electronic, solid state, or optical.

Any algorithms given herein can be implemented as computer program code and stored on a machine readable medium, to be performed on at least one processor. Alternatively, they may be implemented as hardware. They are not intended to be executed manually or mentally.

The use of variable names in describing operations or instructions in a computer does not preclude the use of other variable names for achieving the same function.

Claims

1. A multiprocessor system comprising:

a plurality of processors adapted to carry out speculative execution in parallel;

a conflict checking mechanism adapted to detect and protect results of speculative execution responsive to memory access requests from the processors; and

an instruction implementation mechanism cooperating with the processors and conflict checking mechanism adapted to implement an atomic instruction that includes load, modify, and store with respect to a single memory location in an uninterruptible fashion.

2. The system of claim 1, comprising a cache memory, wherein the cache memory includes:

a directory;

storage locations;

a directory lookup mechanism adapted to implement the conflict checking.

3. The system of claim 2, wherein the directory lookup mechanism is further adapted to detect the atomic instruction responsive to a memory access to a distinct address.

4. The system of claim 1, wherein

the atomic instruction specifies a memory access relating to a cache line,

the conflict checking mechanism is disposed within a cache memory unit containing the line, and

the system comprises a cache memory including at least one queue along with a blocking mechanism for preventing accesses to the queue corresponding to the cache line.

5. The system of claim 1, wherein the instruction implementation mechanism comprises:

a queue in a cache memory for queuing memory access requests; and

a feedback loop for feeding back at least one later part of the atomic instruction after a first part has completed.

6. The system of claim 5, wherein

the processors are further adapted to carry out at least one atomicity-related function other than the atomic instruction; and

the feedback loop is adapted to override the function and give the atomic instruction priority.

7. The system of claim 6, wherein the other atomicity related function comprises a larx/stcx type pair of instructions.

8. The system of claim 6, wherein the other atomicity related function comprises a thread executing under a TM model.

9. The system of claim 1, wherein the atomic instruction comprises sub-operations including a read, an increment, and a write.

10. A system comprising:

a plurality of processors adapted to issue atomicity related operations including at least one atomic instruction that includes sub-operations, the sub-operations including a read, a modify, and a write; at least one other type of atomicity related operation;

at least one cache memory comprising: a cache data array access pipeline; and at least one controller adapted to prevent the other types operations from entering the cache data array access pipeline, responsive to the atomic instruction in the pipeline, when those other types of operation compete with the atomic instruction in the pipeline for a memory resource.

11. A multiprocessor system comprising:

a plurality of processors adapted to implement parallel speculative execution of program threads, the processors being adapted to implement a plurality of atomicity related techniques;

a central conflict checking mechanism adapted to resolve conflicts between the threads; and

conflict resolution protocol apparatus adapted to prioritize at least one atomicity related technique over at least one other atomicity related technique.

12. A computer method comprising:

issuing an atomic instruction from a processor in a multi-processor system, the atomic instruction defining sub-operations that include loading, modifying, and storing with respect to a memory resource;

recognizing the atomic instruction in a directory based conflict checking mechanism; and

blocking other functions that seek to access the memory resource, until the atomic instruction has completed.

13. The method of claim 12, wherein the directory based conflict checking system is in a cache memory.

14. The method of claim 13, comprising recognizing the atomic instruction as part of a directory lookup in the cache memory.

15. The method of claim 13, wherein the atomic instruction seeks to access a cache line, the cache memory includes at least one queue and blocking comprises preventing operations accessing the cache line from entering the queue.

16. The method of claim 12, comprising:

issuing another atomicity related operation from the processor; and

blocking the other atomicity related operation responsive to the atomic instruction.

17. The method of claim 16, wherein the other atomicity related operation comprises a TM section of a program.

18. The method of claim 16, wherein the other atomicity related operation comprises a larx/stcx pair.

19. The method of claim 12, comprising aggregating versions of a memory resource prior to undertaking the atomic instruction.

20. The method of claim 12, wherein blocking comprises preventing operations from entering a pipeline when there is an atomic instruction in the pipeline.

21. The method of claim 20, comprising feeding back at least one of the sub-operations into the pipeline, during blocking.

22. The method of claim 20, comprising concatenating a plurality of atomic instructions using a same memory resource in the pipeline during blocking.