Mechanism for handling explicit writeback in a cache coherent multi-node architecture
A method and apparatus for a mechanism for handling explicit writeback in a cache coherent multi-node architecture is described. In one embodiment, the invention is a method. The method includes receiving a read request relating to a first line of data in a coherent memory system. The method further includes receiving a write request relating to the first line of data at about the same time as the read request is received. The method further includes detecting that the read request and the write request both relate to the first line. The method also includes determining which request of the read and write request should proceed first. Additionally, the method includes completing the request of the read and write request which should proceed first.
This application is a continuation of U.S. application Ser. No. 10/896,151 filed on Jul. 20, 2004, which is a continuation of U.S. application Ser. No. 09/823,791, filed on Mar. 31, 2001, entitled “Mechanism for Handling Explicit Writeback in a Cache Coherent Multi-Node Architecture.”
BACKGROUND OF THE INVENTION1. Field of the Invention
The invention relates to communications between integrated circuits and more specifically to data transfer and coherency in a multi-node or multi-processor system.
2. Description of the Related Art
Processors and caches have existed since shortly after the advent of the computer. However, the move to using multiple processors has posed new challenges. Previously, data existed in one place (memory for example) and might be copied into one other place (a cache for example). Keeping data coherent between the two possible locations for the data was a relatively simple problem. Utilizing multiple processors, multiple caches may exist, and each may have a copy of a piece of data. Alternatively, a single processor may have a copy of a piece of data which it needs to use exclusively.
If two copies of the data exist, or one copy exists aside from the original, some potential for a conflict in data exists in a multi-processor system. For example, a first processor with exclusive use of a piece of data may modify that data, and subsequently a second processor may request a copy of the piece of data from memory. If the first processor is about to write the piece of data back to memory when the second processor requests the piece of data, stale data may be read from memory, or corrupted data may be read from the write. The stale data results when the write should have completed before the read completed (but did not), thus allowing the read instruction to cause retrieval of the updated data. The corrupted data may result when the read retrieval of the updated data. The corrupted data may result when the read should have completed before the write completed (but did not), thus allowing the read instruction to cause retrieval of the data prior to the update.
BRIEF DESCRIPTION OF THE DRAWINGSThe present invention is illustrated by way of example and not limitation in the accompanying figures.
A method and apparatus for a mechanism for handling explicit writeback in a cache coherent multi-node architecture is described. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the invention. It will be apparent, however, to one skilled in the art that the invention can be practiced without these specific details. In other instances, structures and devices are shown in block diagram form in order to avoid obscuring the invention.
Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
A coherent data architecture should reduce conflicts between nodes within the architecture which need to read and write data at about the same time. For example, processor (or node) A may be reading a first data line for purposes of a calculation at the same time the processor B may be writing the first data line. In some instances, these conflicts will resolve themselves, but attempting to let the conflicts resolve themselves randomly might lead to a non-deterministic system. Therefore, it is preferable to resolve read-write conflicts such as these in a manner which is predictable.
Read-write conflicts may be resolved by sending reads and writes through some sort of controller or port, such as a scalability port. Within the port, addresses of reads and writes may be compared, such that conflicts may be detected. When a conflict is detected, a decision may be made as to whether to stall the read or the write. Such a decision may be made based on a variety of factors, depending on the design of the system, and may consider such things as when the requests were received by the port, the priority of the requests, the nature of the requests, and other considerations. Once a decision is made, one of the conflicting operations will complete, and then the other will complete. Since the decision making will be hardwired, any given situation will have a predictable result, and users of the system (such as system designers and programmers) may adapt their use to the predictable result.
Processors typically have caches incorporated within or associated with them, such that a processor may be viewed as including a cache. In multi-processor systems, it is not uncommon to have caches associated with each processor which maintain data lines in one of four states, those states being exclusive, shared, modified, or invalid. Exclusive state is for data lines in use by that processor and locked or otherwise allowed for use by that processor only within the system. Shared state is for data lines which are in use by the processor but may be used by other processors. Modified state is for data lines in use by the processor which have a data value the processor has modified from its original value. Invalid state is for data lines which have been invalidated within the cache. Invalidation may occur when a processor writes a line to memory or when another processor takes a shared line for exclusive use, thus calling into question the validity of the data in the copy of the line the first processor has.
In one embodiment, incoming requests and outgoing requests are generated and responded to by devices outside the scalability port. Each request is routed through the appropriate node controller 405, such that incoming requests (to the port 430) are placed in the IRB 420 and outgoing requests (to the port 430) are placed in the ORB 425. Additionally, within the switch 450, each port 455 receives incoming and outgoing requests which are routed through the switch 460. These requests may be targeted at another node coupled to the switch 450, or may be targeted at a node coupled to another switch 450, in which case the request may either be routed to the appropriate node or ignored respectively. Determining whether the target of the request is coupled to the switch 450 is the function of the snoop filter and table 465, which may be expected to maintain information on what data (by address for example) is being utilized by the nodes coupled to the switch 450.
The scalability port may be utilized to minimize the problem of read-write conflicts, as described below. Note that the discussion of reads and writes focuses on reading and writing lines, which typically refer to lines of data such as those stored in a cache (either onboard or associated with a processor for example). It will be appreciated that lines of data may refer to various amounts of data, depending on how a system is implemented to transfer data.
It will be appreciated that a variety of methods may be used to determine which of the two processes of
The embodiment described in the following section is implemented using a specific protocol. It will be appreciated that such a protocol may be implemented in a variety of ways which will be apparent to one skilled in the art. Furthermore, it will be appreciated that variations on such a protocol may be implemented within the spirit and scope of the invention.
Coherent Request Types
In some embodiments, a particular protocol is implemented including the method or by the apparatus in question. In these embodiments, the coherent requests supported on the scalability port are listed in the following table. The table lists all the requests that are used by the coherence protocol, and those requests are then discussed in the following text. In the discussion in this section, a line indicates the length of a coherence unit.
The Port Read Line (PRLC, PRLD and PRC) requests are used to read a cache line. They are used to both read form from memory and snoop the cache line in the caching agent(s) at the target node. The Port Read requests are always targeted to the coherence controller or the home node of a memory block. A node that is not the home if the block addressed by the transaction never receives a Port Read request. The code and data read and read current requests are different to facilitate different cache state transitions. The Port Read Current (PRC) request is used to fetch the most current copy of a cache line without changing the ownership of the cache line from the caching agent (typically used by an I/O node).
The Port Read and Invalidate Line (PRIL and PRILO) requests are used to fetch an exclusive copy of a memory block. They are used to both read from memory and snoop invalidate a cache line in the caching agent(s) at the node. The Port Read and Invalidate requests are always targeted to the coherence controller or the home node of a memory block. A node that is not home of the block addressed by the transactions never receives these requests. These two request types differ in their behavior when the memory block is found in the modified state at the snooped node. For a PRIL request, the data is supplied to the requesting node and the home memory is updated, whereas for a PRILO request, the data is supplied only to the source node, the home memory is not updated (the requesting node must cache the line in “M” state for PRILO).
The Port Invalidate Line (PIL) request is a special case of the PRIL request with zero length. This request is used by the requesting node to obtain exclusive ownership of a memory block already cached at the requesting node (for example when writing to a cache line in Shared state). Data can never be returned as a response to a PIL request on the scalability port. Due to concurrent invalidation requests, if the line is found modified at a remote caching node, then this condition must be detected either by the requesting node controller or the coherence controller and the PIL request must be converted to a PRIL request. The PIL request is always targeted to the coherence controller or the home node of the requested memory block. A node that is not home of the block addressed by the transaction never receives this request.
The Port Flush Cache Line (PFCL) request is a special case of the PIL request used to flush a memory block from all the caching agents in the system and update the home memory if the block is modified at a caching agent. The final state of all the nodes, including the requesting node, is Invalid and home memory has the latest data. This request is used to support the IA64 flush cache instruction. This request is always targeted to the coherence controller or the home node of the memory block. A node that is not home of the block addressed by the transaction never receives this request.
The Port Invalidate Line No Data (PILND) request is used by the requesting node to obtain exclusive ownership of a memory block without requesting data. The memory block may or may not be present at the requesting node. The memory block is invalidated in all other nodes in the system. If the line is modified at a remote caching node, then the home memory is updated but data is not returned to the requesting node. This request is intended to be used for efficient handling of full line writes which the requesting node does not intend to keep in its cache (for example I/O DMA writes). This request is always targeted to the coherence controller of the home node of the requested memory block. A node that is not home of the block addressed by the transaction never receives this request.
The Port Memory Write (PMWI_D, PMWE_D, PMWS_D) requests with Data are used to update the content of home memory and the state of the line in the coherence controller. Corresponding Port Memory Write (PMWI, PMWE, PMWS) requests without data are used to update the state of the line in the coherence controller. A PMW[IIE/S] request with or without data does not snoop the caching agent(s) at the node. These requests are very similar in nature except for the state of the line at the originating node. The PMWI request indicates that the memory block is no longer cached at the originating node, the PMWS request indicates that the line is in a shared state at the originating node and the PMWE request indicates that the line is in exclusive state at the originating node. The PMW[I/E/S] requests are always targeted to the coherence controller or the home node of the memory block.
The Port Cache Line Replacement (PCLR, PCLRC) requests are used to indicate to the coherence controller that the node no longer has a copy of the memory block in the caching agents at that node. They are intended to be used only by the originating node of the transaction. These requests are always targeted to the coherence controller to facilitate better tracking of the cache state by the coherence controller. A node can generate a PCLR or PCLRC request only when the state of the cache line at the node changes from S or E to I, generation of these requests when the cache line state at a node is I is not allowed to avoid starvation or livelock on accesses from other nodes. A PCLR or PCLRC request could be dropped or processed by the receiving agent without affecting its final state. The protocol supports two versions of this request to facilitate implementation optimization depending on the type of network implemented. The PCLR request does not expect any response back from the receiving agent and the requesting agent can stop tracking this request in its outbound queue as soon as it is sent on the scalability port. The PCLRC request expects a completion response back from the receiving agent and is tracked in the requesting agent till this response is received. Implementation should use the PCLRC request if it cannot guarantee sequential ordering between requests from the requesting node to the coherence controller over the network in order to properly handle race conditions between this request and subsequent reads to the same line. If the implementation can guarantee sequential ordering between requests over the network between two nodes, it can use the PCLR request to save network bandwidth (no completion response) and for reduced buffer requirements in the outbound queue at the requesting node.
The Port Snoop (PSLC, PSLD and PSC) requests are used to initiate a snoop request at a caching node. The snoops caused by the code or data snoop request and the read current request are different to facilitate different cache state transitions. The Port Snoop requests could be targeted to any caching node. These requests do not have any effect on the home memory blocks, they only affect the state of a memory block in the caching agents at the target node.
The Port Snoop (PSLC, PSLD and PSC) requests are used to initiate a snoop request at a caching node. The snoops caused by the code or data snoop request and the read current request are different to facilitate different cache state transitions. The Port Snoop requests could be targeted to any caching node. These requests do not have any effect on the home memory blocks, they only affect the state of a memory block in the caching agents at the target node.
The Port Snoop and Invalidate (PSIL, PSILO and PSILND) requests are used to snoop and invalidate a memory block at a caching node. These requests could be targeted to any caching node. These three request types differ in their behavior when the memory block is found in the modified state at the snooped node. For PSIL request, data is supplied to both the source node and the home memory is updated. For PSILO request, the data is supplied only to the source node, the home memory is not updated. For PSILND request, only the home memory is updated, the data is not supplied to the requesting node.
The Port Snoop Flush Cache Line (PSFCL) request is used to flush a memory block from all the caching agents and update the home memory if the block is modified at a caching agent. This request is used to support the IA64 flush cache instruction and to facilitate backward invalidates due to snoop filter evictions at the coherence controller. The PSFCL request could be targeted to any caching node.
The Port Memory Read (PMR) and Port Memory Read Speculative (PMRS) requests are used to read a home memory block. These requests are used to read memory and do not cause a snoop of caching agent(s) at the home node. They are always targeted to the home node of a memory block. The PMRS request is a speculative request whereas PMR is a non-speculative request. The Port Memory Read Speculative Cancel (PMRSX) request is used to cancel a speculative read request (PMRS) to a cache line. A PMRS request depends on a non-speculative request for the same cache line for confirmation. It is confirmed by a PMR, PRLC, PRLD, PRC, PRIL, or PRILO request for the same cache line. The confirmation request may or may not be due to the same transaction that caused the PMRS request. The PMRS request is cancelled by a PMW[I/E/S] or a PMRSX request for the same cache line. The cancellation request may or may not be due to the same transaction that caused the PMRS request. The PMRS request can be dropped by the responding agent without any functional issue.
Response Types for Coherent Requests
Response types for coherent request transactions on the scalability port are listed in Table 2. These responses are used under normal circumstances or could be combined with special circumstances with proper response status to indicate failed, unsupported or aborted requests.
The Port Snoop Result (PSNR) response is used to convey the result of snoop back to the requesting node. PSNR response indicates if the line was found in Modified state and the final state of the line at the snooped agent. The state of the line could be Invalid (except for PRC or PSC) at the snooped caching agent(s) (PSNRI), Shared (except for PRC or PSC) at the snooped caching agent(s) (PSNRS), Modified transitioning to Invalid (except for PRC or PSC) at the snooped caching agent (PSNRM) or Modified transitioning to Shared at the snooped caching agent (PSNRMS). For a PRC or PSC transaction, if the cache line state at node is E, S, or I then either a PSNRI or PSNRS response is allowed; if the cache line state is M then either a PSNRM of PSNRMS response if allowed.
The Port Completion (PCMP) response is used in determining the completion of a transaction under certain protocol conditions. This response can be generated only by the home node of the memory block or by the coherence controller for some transactions such as PRC, PSC. PRILO and PSILO.
The Port Retry (PRETRY) response is the protocol level retry response. The corresponding request is retried from the requesting node. This response is used to resolve conflict cases associated with multiple transactions to the same memory block. When the requesting agent receives the PRETRY response to a PMWx request, it retries the PMWx request if no conflict has been detected. If the requesting agent has already seen the conflict before it receives the PRETRY response, the PMWx request is converted into a response to the incoming request.
The Port Normal Data (PDATA) response is used to return the data requested by the corresponding read request. It does not have any other protocol level state information apart from the source node identifier and the transaction identifier of the request to associate it with the proper request.
The protocol also supports certain combined responses which could be used by the responding node to optimize use of bandwidth on SP. The PSNR[I/S/M/MS]_CMP response is same as PSNR[I/S/M/MS]+PCMP, the PSNR[I/S/M/MS]D response is same as PSNR[I/S/M/MS]+PDATA, the PCMP_D response is same as PCMP+PDATA and the PSNR[I/S/M/MS]_CMP_D response is same as PSNR[I/S/M/MS]+PCMP+PDATA.
Alternative Scalability Port Implementations
The following section addresses some of the alternative scalability port implementations which may be utilized within the spirit and scope of the invention. It will be appreciated that these are exemplary in nature rather than limiting. Other alternative embodiments will be apparent to those skilled in the art.
In the foregoing detailed description, the method and apparatus of the present invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the present invention. In particular, the separate blocks of the various block diagrams represent functional blocks of methods or apparatuses and are not necessarily indicative of physical or logical separations or of an order of operation inherent in the spirit and scope of the present invention. For example, the various blocks of FIGS. 1 or 2 (among others) may be integrated into components, or may be subdivided into components. Similarly, the blocks of
Claims
1. An apparatus comprising:
- an incoming request buffer to store requests relating to read and write operations, the requests including addresses to be read or written, an assigned priority, and a property comprising that the operation involves data that is for exclusive use, shared use, modified use, or is invalidated;
- an outgoing request buffer to store requests relating to read and write operations coupled to the incoming request buffer;
- bus logic configured to interface with a bus, the bus logic coupled to the incoming request buffer and the outgoing request buffer;
- a snoop pending table to contain entries related to cache lines coupled to the incoming request buffer and the outgoing request buffer;
- a snoop filter coupled to the snoop pending table;
- control logic to interface with and coupled to the incoming request buffer, the outgoing request buffer, and the bus logic, the control logic to compare addresses of requests of the incoming request buffer and outgoing request buffer and detect identical addresses among requests of the incoming request buffer and the outgoing request buffer, the control logic to stall a second request of the incoming request buffer and outgoing request buffer pending completion of a first request of the incoming request buffer and outgoing request buffer when the second request and the first request include identical addresses; and
- an arbitration device to determine which request should proceed first based on the property of the requests.
2. The apparatus of claim 1 wherein:
- the outgoing request buffer to receive read requests and write requests from a bus through the bus logic.
3. The apparatus of claim 2 wherein:
- the control logic to pass requests to the outgoing request buffer and incoming request buffer to read data from or write data to a cache associated with a processor.
4. The apparatus of claim 2 further comprising:
- a memory controller to interface with and control a memory, the memory controller coupled to the incoming request buffer, the outgoing request buffer, the bus logic, and the control logic; and wherein:
- the control logic to pass requests to the memory controller to read data from or write data to the memory.
5. The apparatus of claim 1 wherein:
- the arbitration device determines requests relating to a read operation should proceed first based on the property of the requests.
6. The apparatus of claim 1 wherein:
- the arbitration device determines requests relating to a write operation should proceed first based on the property of the requests.
7. A system comprising:
- a first processor;
- a second processor;
- a scalability port coupled through a bus to the first processor and coupled through the bus to the second processor, the scalability port including:
- an incoming request buffer to store requests relating to read and write operations, the requests including addresses to be read or written, an assigned priority, and a property comprising that the operation involves data that is for exclusive use, shared use, modified use, or is invalidated;
- an outgoing request buffer to store requests relating to read and write operations, the requests including addresses to be read or written, coupled to the incoming request buffer;
- bus logic to interface with the bus, the bus logic coupled to the incoming request buffer and the outgoing request buffer;
- a snoop pending table to contain entries related to cache lines coupled to the incoming request buffer and the outgoing request buffer;
- a snoop filter coupled to the snoop pending table;
- control logic to interface with and coupled to the incoming request buffer, the outgoing request buffer, and the bus logic, the control logic to compare addresses of requests of the incoming request buffer and outgoing request buffer and detect identical addresses among requests of the incoming request buffer and the outgoing request buffer, the control logic to stall a second request of the incoming request buffer and the outgoing request buffer pending completion of a first request of the incoming request buffer and the outgoing request buffer when the second request and the first request include identical addresses; and
- an arbitration device to determine which request should proceed first depending on the property of the requests.
8. The system of claim 7 further comprising:
- a memory coupled to the scalability port; and wherein the scalability port further includes:
- a memory controller to interface with and control the memory, the memory controller coupled to the incoming request buffer, the outgoing request buffer, the bus logic, and the control logic; and wherein:
- the control logic to pass requests to the memory controller to read from or write data to the memory.
9. The system of claim 7 wherein:
- the outgoing request buffer and incoming request buffer to receive read requests and write requests from the bus through the bus logic, the read requests and write requests each individually originating from one of the first processor or the second processor.
10. The system of claim 7 wherein:
- the control logic to pass requests to the outgoing request buffer and to the incoming request buffer to write data to or read data from a cache associated with the first processor.
11. The system of claim 10 wherein:
- the control logic further to pass requests to the outgoing request buffer and to the incoming request buffer to write data to or read data from a cache associated with the second processor.
12. The system of claim 7 wherein:
- the arbitration device determines requests relating to a read operation should proceed first based on the property of the requests.
13. The system of claim 7 wherein:
- the arbitration device determines requests relating to a write operation should proceed first based on the property of the requests.
Type: Application
Filed: Dec 28, 2005
Publication Date: May 18, 2006
Inventors: Manoj Khare (Saratoga, CA), Lily Looi (Portland, OR), Akhilesh Kumar (Sunnyvale, CA)
Application Number: 11/321,632
International Classification: G06F 12/00 (20060101);