SELECTIVE DOWNSTREAM CACHE PROCESSING FOR DATA ACCESS
A first request is received to access a first set of data in a first cache. A likelihood that a second request to a second cache for the first set of data will be canceled is determined. Access to the first set of data is completed based on the determining the likelihood that the second request to the second cache for the first set of data will be canceled.
The present disclosure relates to computing systems that employ one or more caches. More particularly, the present disclosure relates to completing data requests based on selective downstream cache processing.
Cache memories in a computing system can improve processor, application, and/or computing system performance by storing data (e.g., a computer instruction, or an operand of a computer instruction) in a memory that has a lower access latency (time to read or write data) as compared to other memories, such as a main memory (e.g., primary RAM) or a non-volatile storage device (e.g., a disk). Cache memory can be included in a processor, and/or between a processor and another memory (e.g., another cache memory and/or a main memory) and can store a copy of data otherwise stored in a main memory. For example, processors can include a local, or “Level 1” (L1), cache, and computing systems can include additional caches, such as “level 2” (L2) and “level 3” (L3) caches, between a processor (or, a local cache of a processor) and another memory (e.g., a main memory).
SUMMARYVarious embodiments are directed to a computer-implemented method, a system, and a computer program product. In some embodiments, the computer-implemented method includes receiving a first request to access a first set of data in a first cache. A likelihood may be determined that a second request to a second cache for the first set of data will be canceled. Access to the first set of data may be completed based on the determining the likelihood that the second request to the second cache for the first set of data will be canceled.
In some embodiments, a system comprises a computing device that includes a processor and at least a first cache and a second cache. The computing device further includes a set predictor configured to predict whether there will be a cache hit or cache miss within the first cache for a first set of data of a first request. The computing device further includes a request buffer configured to at least delay a second request to the second cache for the first set of data when the set predictor module predicts the cache hit, wherein the buffer does not delay the second request when the set predictor module predicts the cache miss. The computing device further includes a directory configured to at least indicate an actual cache hit or miss.
In some embodiments, A computer program product comprises computer readable storage medium having program instructions embodied therewith. The program instructions are readable or executable by a processor to perform a method. The method comprises receiving a first request to access a first set of data in a first memory. The method further comprises predicting whether there will likely be a hit or a miss in the first memory. The method also comprises initiating, in parallel with the predicting whether there will likely be the hit or the miss in the first memory, a determination of whether there is an actual hit or actual miss in the first memory. Moreover, the method comprises generating, based on the predicting, a first action to facilitate access of the first set of data, the generating of the first action occurring before completion of the determination of whether there is an actual hit or actual miss in the first memory.
The above summary is not intended to describe each illustrated embodiment or every implementation of the present disclosure.
The drawings included in the present application are incorporated into, and form part of, the specification. They illustrate embodiments of the present disclosure and, along with the description, serve to explain the principles of the disclosure. The drawings are only illustrative of certain embodiments and do not limit the disclosure.
While the invention is amenable to various modifications and alternative forms, specifics thereof have been shown by way of example in the drawings and will be described in detail. It should be understood, however, that the intention is not to limit the invention to the particular embodiments described. On the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.
DETAILED DESCRIPTIONAspects of the present disclosure relate to selective downstream cache processing for data access. While the present disclosure is not necessarily limited to such applications, various aspects of the disclosure may be appreciated through a discussion of various examples using this context.
A processor can determine if a copy of a cache line is included in a local cache, such as when the processor executes an instruction that references a memory location within a particular cache line. As used herein, “cache line” refers interchangeably to a location in a memory, and/or a cache, corresponding to a cache line of data, and data stored within that cache line, as will be clear from the context of the reference. If the cache line is stored (“cached”) within a local cache, the processor can use data from within the cached copy of the cache line. When a particular set of data or cache line is stored to a particular cache, this is known as a cache “hit”. If there is no cached copy of the data or cache line in a particular cache, the processor can incur a “cache miss”. A cache “miss” in one level of memory (e.g., L1) can trigger a fetch request to another level of memory (e.g., L2). Accordingly, in response to the cache miss, the processor can fetch the cache line from the corresponding memory location, from another cache, and/or from another processor having a valid (e.g., an unmodified or, alternatively, most recently modified) copy of the cache line in a local cache.
Typical cache hit/miss processes may cause significant overhead. For example, final L2 cache hit/miss determinations (e.g., as looked up in a cache directory) may cause poor access latency performance such that it may take a long time to process a fetch request. Further, a fetch request to a higher level of cache after a lower level miss also takes a significant amount of time, particularly if it occurs after determining the actual hit/miss result. Moreover, some systems are prone to cancel a lot of fetch requests, which also take a significant quantity of time to process.
In embodiments, a processor core can be a component of a processor chip, and the chip can include multiple cores of the same or different type. Embodiments can include one or more processor chips in a processor module. As used herein, in addition to a “processor” including a local cache, “processor” further refers, interchangeably to any of a thread, a core, a chip, a module, and/or any other configuration or combination thereof.
In embodiments, an instruction pipeline, such as pipeline 114, can enable a processor, such as 110, to execute multiple instructions, each in various stages of execution, concurrently. To illustrate, pipeline 114 can be an instance of an instruction pipeline such as example pipeline 150. Pipeline 150 comprises a plurality of instruction processing stages for a processor to execute multiple instructions, or portions of a single instruction, concurrently.
While example pipeline 150 is shown comprising 5 stages, each having four units, this is not intended to limit embodiments. Embodiments can include additional (or, fewer) stages, and/or stages within an execution pipeline can contain additional (or, fewer) units in each stage as compared to the example of
In embodiments, instructions under execution by a core can proceed sequentially through an instruction pipeline, such as 150. Fetch stage 160 can fetch multiple instructions for execution using fetch units F1-F4. For example, instructions fetched by fetch stage 160 can proceed to decode stage 162, for concurrent decode using decode units D1-D4. Decoded instructions can be issued for execution via issue units I1-I4 of issue stage 164. Issued instructions can proceed to execution stage 166, and execution units E1-E4 can perform particular execution actions of those issued instructions, such as performing Arithmetic Logic Unit (ALU) or other computation unit operations, and/or loading or storing memory operands of the instructions. Completion units C1-C4 of complete/reject stage 168 can complete, and/or flush or terminate/cancel, instructions from other stages of pipeline 150. In embodiments, a pipelined processor can process a plurality of instructions, or portions of instructions, concurrently by means of the stages and units of the stages comprising an instruction pipeline.
Embodiments can utilize non-pipelined processors (e.g., multi-cycle processors), and these processors can include a local cache. If an operand is not cached in a local cache, the processor can initiate cache miss processing. In such non-pipelined embodiments, cache miss processing can further include stopping or delaying execution of instructions using those operands, and/or instructions that may depend on the results of instructions using those operands.
Alternative embodiments can utilize pipelined processors, such as illustrated in
In embodiments, L1 cache can be an instance of a cache such as illustrated by example cache 120. Cache 120 comprises a request interface module 126 and memory 122. Memory 122 includes cache lines 124-1-124-4 (collectively, “lines 124”), which can, in embodiments, store copies of cache lines in use by core 110. In some embodiments, the request interface module 126 performs one or more operations for cache hit/miss control, as explained in more detail below. In some embodiments, the request interface module 126 performs one or more operations associated with the execution steps 166 and/or the complete/reject steps 168.
The cache 120 also includes directory 128. In embodiments, the directory 128 records the identities (e.g., a memory address, or subset or hash thereof) of cache lines stored in the cache 120. The cache directory 128 can include other information about cache lines 124, such as most (or, alternatively, least) recent time it was referenced, or a number of times it has been referenced. The directory 128 can include a status associated with each of the cache lines 124 stored in the cache 120. Such status can include, for example, whether the cache line has shared vs. exclusive status, whether the cache line is valid (e.g., contains an unmodified, or most recently modified copy), which processor (e.g., which core within a processor chip if, for example, a local cache is shared by multiple cores), and other attributes of the cache line and/or its usage in the cache.
Execution units (e.g., E1) and/or other components (e.g., the request interface module 126) can determine, when using data in a cache line, whether the operands are stored in a local cache, such as L1 in E1. If it is determined that an operand is not cached in L1, the execution unit(s), the request interface module 126, and/or other components of core 110, can initiate cache miss processing. In embodiments, cache miss processing can further include stopping or delaying execution of instructions (or, portions of instructions) using those operands, and/or instructions that may depend on the results of instructions using those operands.
In some embodiments, a processor core, such as 110, can execute an instruction, or portions of an instruction, out of order and/or speculatively. Out of order execution can allow a processor to execute portions of an instruction or program as soon as an execution unit (e.g., a stage in a pipeline) is available, rather than delay execution to wait for completion of other portions of an instruction, or other instructions in a program. In this way, a processor can keep most or all of its execution units busy to improve computing throughput.
Speculative execution can allow a processor to execute an instruction, or a portion of an instruction, based on a likelihood that the processor will execute that instruction (or, portion thereof). For example, a processor can speculatively execute one or more instructions that follow a particular branch path in a program, prior to executing a conditional test that determines that path, based on a likelihood that the program will take that branch. In this way, a processor can utilize otherwise idle elements (e.g., stages of a pipeline) and can achieve higher computational throughput, in the event that the results of the speculatively-executed instruction (or portion thereof) can be used as other instructions (or, portions of an instruction) complete execution.
If the cancel probability generator 207 determines that there is a low probability that request cancel 205 will be sent (e.g., because there is a predicted miss in a current level of cache being analyzed), then another level of cache may automatically be queried (i.e., via the request_valid 209). The request_valid 209 fetch request may be automatically issued to the request resource allocation/request handling 215 regardless of whether the actual hit or miss data result is known (e.g., via a directory lookup). This automated process may occur because fetch requests may take a relatively long amount of time to process and when combined with the amount of time that it takes to determine the actual hit or miss information, it may delay the process even more. Accordingly, the request_valid 209 fetch request may be issued to the next level of cache or memory.
If the cancel probability generator 207 determines that there is a high likelihood that the request_cancel 205 will be sent (e.g., because there was a predicted hit in the current level of cache being analyzed), then the request to the request resource allocation 215 (i.e., the request_valid_late 211) may be delayed 213 (e.g., buffered, temporarily terminated, or discontinued, etc.). The request for the needed data within another level of cache may be delayed because of the high likelihood that a processor will locate the needed data in a current level of cache being analyzed. Accordingly, delaying 213 the request keeps the processing environment 200 from utilizing downstream resources (e.g., other levels of cache) if there is a high likelihood that they are not needed.
The request_cancel 205 request is generated by the request generation logic 201. The request_cancel 205 request is also delayed 213 so it can cancel the request_valid_late 211 request in time. For example, if the request_cancel request 205 was not buffered, and there was an unexpected actual hit/miss result, there may have already been an inaccurate cancel message transmitted to another cache level. Conversely, the delay 213 allows for an actual hit/miss lookup such that even if the result was unexpected, the delay action can be aborted in a buffer instead of re-generating a request-cancel 205 and/or reversing downstream actions already communicated to the request resource allocation 215. In embodiments, the request_cancel 205 is transmitted straight to the request resource allocation 215 so all non-delayed requests (e.g., request_valid 209) are canceled, as there is a low probability that the request will be canceled.
At operation 301 an L2 cache request is generated (e.g., by the request generation logic 201 of
In some embodiments, if the set predictor 307 predicts that there will be a miss 309, then a “request_valid” request 315 (e.g., another fetch request for the same data) may be intercepted by the request arbiter 316. If the set predictor 307 predicts that there will be a hit 311, then a “request_valid_late” request 318 may be transferred to the request buffer 313.
The request buffer 313 is used to initially delay or pause the request_valid_late request 318 (e.g., another request to L3 cache for the same data) from going to arbitration by the request arbiter 316. The request buffer 313 (or the delay 213 of
The request arbiter 316 is able to hold multiple pending fetch requests and determines whether to take or choose a “request_valid” request 315 or a buffered request from the request buffer 313. Regardless of whether the buffered request or the “request_valid” request 315 is selected, the request arbiter 316 translates the request into an “L3 request_valid” request to L3 cache 322 (another fetch request for the same data). In some embodiments, such as in a more general L2 cache design, multiple requests from previous L1 cache levels (e.g., the request_valid request and/or the request_valid_late request) can be pending in the request arbiter 316. This may increase the number of pending fetch requests being held in the request arbiter 316 waiting for arbitration to the next cache level.
In some situations it might be beneficial to override the result of the set predictor 307. In some embodiments, for example, the force module/logic 312 can force the prediction result to always indicate a cache hit (regardless of the set predictor results), when it is determined that a cache line is to be promoted from shared status to exclusive status. This means the cache line is already in the cache (cache hit) and does not need to be fetched from the next cache level. The term “exclusive” refers to a processor (or core) that has exclusive rights, or “exclusivity”, to a particular cache line (i.e., the processor does not share access rights to a cache line with any other processor). In embodiments, a processor having exclusivity to a cache line can change the status of the cache line from “shared” to “exclusive”. In some embodiments, while a cache line has exclusive status, a controlling processor can modify, in a local cache, data within that cache line. In some embodiments, the force logic 312 can alternatively force a cache miss prediction (regardless of the set predictor results). For example, if the L2 cache reaches a task threshold or is otherwise busy, it may be desirable to force a cache miss at L2 in order to fetch the same data from the L3 cache, which may not be associated with as many tasks.
At block 402, an Lx (e.g., L2 cache) request may be received. For example, a lower cache level, such as L1, may have incurred a cache miss for a first cache line. Consequently, the L1 may transmit and the L2 cache may receive a fetch request for the same first cache line. Per block 404, it may be determined what the Lx predictor result is. The Lx predictor result at block 404 may include a set predictor (e.g., the set predictor 307 of
The Lx predictor may process predictions in any suitable manner. For example, in some embodiments, a pair of cache lines includes a steering bit table (SBT) and a rehash bit that are utilized to render prediction. In these embodiments, when fetching a cache line entry, the effective address is used to index into the actual cache. A prediction index is used to select a particular steering bit. The steering bits are accessed prior to the cache access. Each entry “steers” references to the appropriate cache block. A rehash bit is utilized to avoid examining another line when that line cannot contain the requested address. A rehash bit reduces the number of probes, which allows misses to be started earlier or reduces the time the cache is busy. Various types of prediction sources may be utilized, such as effective addresses (as described above), register contents and offset (e.g., using contents and offset to form a prediction address), register number and offset (combining register number and offset several cycles before cache access), and/or instruction and previous references (using address of the instruction issuing the reference and variants of the previous cache reference).
Per block 418, if it is predicted that there will be a cache miss, then the Lx request may be allowed to be transferred to arbitration (e.g., as processed by the request arbiter 316 of
Per block 406, if the Lx predictor result predicted a miss at block 404, then a request may be initiated to lookup the actual hit/miss result in the Lx directory lookup at block 406 (e.g., using the directory 128 of
Per block 408, if the actual directory lookup result at block 406 is a “hit,” then an Lx+1 cancel request (e.g., the request_cancel 305 of
Per block 410, if it is predicted at block 404 that there will be a cache hit, then an Lx+1 request may still be generated but buffered or temporarily paused (e.g., the delay 213 of
Per block 412, while the Lx+1 request is buffered, the Lx directory lookup result is determined (e.g., by the directory 128 of
At block 502, a first request may be received (e.g., from a processor and at a particular level of cache) to access (e.g., fetch) a first set of data in a first cache. For example, the request generation logic 201 of
Per block 504, a cancel probability that a second request to a second cache for the first set of data will be canceled may be generated (e.g., by the cancel probability generator 207 of
Per block 506, based on the probability at block 504, it may be determined whether the second request to the second cache is likely to be canceled. If the second request is not likely to be canceled, this means that the second request will likely need to be transmitted to another cache/memory to retrieve the first set of data. Per block 508, if it is determined that the second request is not likely to be canceled, then the second request may be transmitted (e.g., by the request generation logic 201 of
Per block 510, it may be determined (e.g., via the cancel probability generator 207) whether to continue processing the second request. For example, it may be determined whether there is an actual cache miss in the first cache and whether the first set of data is exclusive to another processor. Continuing with this example, if there is an actual cache miss at the current level of cache being analyzed and the first set of data is not exclusive, then per block 520, the second request at the second cache may continue to be executed, as the request was initiated at block 508. Continuing with this example, if there is an actual cache hit (inconsistent with the likelihood result at block 506), then per block 512 the second request may be canceled such that the first cache may transmit a third request to the second cache to cancel the second request. And because there is a cache hit at the first cache, per block 522, the first request at the first cache may be executed or completed.
Per block 514, if it is determined at block 506 that the second request to the second cache for the first set of data will likely be canceled, then the second request may be delayed (e.g., the delay 213 of
Per block 516, it may be determined whether to proceed with the cancellation projected at block 506. For example, the block 516 determination may be based on whether there is an actual cache hit in the first cache and the read/write status of the first set of data. Per block 512, if the request should actually be canceled (e.g., if it determined that there is an actual cache hit in the first cache and/or the read/write status doesn't match the second request), then the second request is canceled to the second cache such that the second request is not transmitted to the second cache. For example, the second request may be buffered as part of the delay at block 514. In response to the determining of an actual cache hit at block 516, the second request in the buffer may be deleted or cleared. At block 522, the first request may be executed at the first cache.
Per block 518, if it is determined that the second request should not be canceled, then the second request may be transmitted to the second cache in order to access the first set of data in the second cache. Accordingly, at block 520 the second request at the second cache may be executed. In some situations, the first set of data will also not be located in the second cache, but rather a third or n level of cache or memory. In these cases, a similar process to the process 500 may occur with respect to the third or n level of cache.
At block 602, a first request (e.g., a fetch request) to access a first set of data in a first memory (e.g., L1 cache) may be received. Per block 604 it may be predicted (e.g., by the set predict logic 307 of
Per block 606, actual hit/miss processing may be initiated (e.g., beginning a search for the first set of data in a directory). In some computing systems, completing actual hit/miss processing takes a relatively long quantity of time compared to block 608. For example, searching in a cache directory and locating/not locating the first set of data may take twice as long as block 608. Per block 608, a second request may be generated (e.g., by the request generation logic 201 of
Per block 610, the actual hit/miss processing may complete after it has been initiated at block 606 and after the transmission of the second request at block 608. For example, the completion may occur when the first set of data is located in a directory in the first cache or each entry in the directory was searched without locating the first set of data. Per block 612, it may be determined (e.g., via a directory) whether there was an actual miss (and/or hit). If there was an actual miss, then the process 600 may stop. Per block 614, if there was not an actual miss (e.g., there was a hit), the second request may be canceled since the first set of data was located in the first memory. Accordingly, any downstream processing by the second memory should be aborted.
Per block 616, if it was predicted at block 604 that there would be a hit then a second request may be generated and buffered (e.g., in the request buffer 313 of
Per block 618, it is determined (e.g., via a directory) whether there is an actual hit (and/or miss) in the first memory. If there is not a hit (e.g., there is a miss), then per block 620 the second request may be resumed and transferred from the buffer to the second memory in order to access the first set of data. In some embodiments, the process 600 may then repeat for other levels of memory in order to access the first set of data.
Per block 622, if there was an actual hit for the first set of data at the first memory, the second request is cleared from the buffer. The second request is cleared or deleted from the buffer because with an actual hit in the first memory there is no need to transmit the access request for the first set of data in the second memory. Per block 624, in response to the hit, the first set of data is accessed from the first memory to complete the first request.
As shown in
Likewise, in some embodiments, L2 30 is a cache similar to cache 120 of
As previously described, a memory and/or a cache can be organized as cache lines of a particular size. For example, MEMORY 40 can be organized as cache lines, and the cache lines can be, for example, 128 bytes in size. In embodiments, a processor (e.g., core 212-3) can include a cache, such as a local cache, and store a copy of data stored in a cache line of a memory, in the L1 cache. For example, MEMORY 40 includes cache line 46, which further contains data at locations 42 and 44. In embodiments, location 42 and/or 44 can be a location, in memory 40, of any unit of data ranging from a minimum size unit of data used by a processor (e.g., one byte) up to and including the amount of data comprising cache line 46 (e.g., 128 bytes).
In the example of
The example of
In other embodiments, CONNECT 22 and/or interconnects 14, 16, and 18 can comprise a combination of buses, links, and/or switches. For example, while not shown, it would be apparent to one of ordinary skill in the art that cores of a processor chip, such as 12-1-212-4 can interconnect amongst each other internal to CHIP 10-1—such as by means of buses, links, and/or switches—and that interconnect 14 can be a single connection between CHIP 10-1 and CONNECT 22. It would be further apparent to one of ordinary skill in the art that CONNECT 22, and the manner of connecting processor cores, chips, modules and/or caches and memories, can comprise a variety of types, combinations, and/or arrangements of interconnection mechanisms such as are known in the art, such as buses, links, and/or switches, and that these can be arranged as centralized, distributed, cascaded, and/or nested elements.
An SMP network, and/or component thereof, can control and/or maintain status of cache lines amongst the plurality of caches. To illustrate, in the example of
Embodiments can implement cache line request/response functions within a centralized unit, such as illustrated by CACHE REQ-RSP 24 in
As used herein, “SMP network” refers interchangeably to an SMP network as a whole (e.g., NETWORK 220) and components of the SMP network (e.g., CACHE REQ-RSP 24), processors (e.g., chips 10 and/or cores 12), and/or caches (e.g., local caches of cores 12 and/or L2 30) used performing functions associated with cache line requests and responses. Continuing the example of
In embodiments, a processor can operate on data for one or multiple instructions using the cached copy of a memory cache line. For example, with reference to
As previously described, in embodiments, under some circumstances (e.g., when a cache line has shared status), multiple processors in a computing system can cache a copy of a cache line in a respective local cache of the processors. In processing a cache line fetch request, the request can be satisfied by providing a copy of the cache line from one of the processors having a copy. For example, CORE 12-1 can request a copy of cache line 46 and, if a local cache of another core among cores 12, has a valid copy of the cache line, a copy of cache line 46 can be transferred from the local cache of that core to CORE 12-1 to satisfy the fetch request. However, if another core does not have a valid copy of cache line 46, but L2 30 has a valid copy, a copy of cache line 46 can be transferred from L2 30 to CORE 12A to satisfy the fetch request. If no caches in the computing system have a valid copy of cache line 46, a copy of cache line 46 can be transferred from MEMORY 40 to CORE 12A to satisfy the fetch request.
From the example of
Transfer latency (time required to receive a cache line following a fetch request) can increase based on which element (e.g., a particular cache or a memory) provides a copy of a cache line to satisfy a fetch request. For example, transferring a cache line from a core within a different chip, or from another cache not local to a processor, can have a much higher latency in comparison to transferring a cache line from a core with the same chip, or a cache more close (having fewer interconnections) to a requesting processor. High transfer latency can cause a processor to wait longer to perform an operation, or to complete an instruction, that uses data within that cache line, and in turn this can reduce processor performance. For example, fetching data not included in a local cache of a processor can correspond to many hundreds or thousands of processor execution cycles. Accordingly, it can be advantageous to processor and/or overall computing system performance to reduce cache line fetches associated with multiple processors using a cache line.
It is to be understood that although the computer 01 of
Interface 05 can be configured to enable human input, or to couple computer 53 to other input devices, such as described later in regard to components of computer 53. It would be apparent to one of ordinary skill in the art that the interface can be any of a variety of interface types or mechanisms suitable for a computer, or a program operating in a computer, to receive or otherwise access or receive a source netlist.
Processors included in computer 53 are connected by a memory interface 15 to memory 17. In embodiments a “memory” can be a cache memory, a main memory, a flash memory, or a combination of these or other varieties of electronic devices capable of storing information and, optionally, making the information, or locations storing the information within the memory, accessible to a processor. A memory can be formed of a single electronic (or, in some embodiments, other technologies such as optical) module or can be formed of a plurality of memory modules. A memory, or a memory module (e.g., an electronic packaging of a portion of a memory), can be, for example, one or more silicon dies or chips, or can be a multi-chip module package. Embodiments can organize a memory as a sequence of bytes, words (e.g., a plurality of contiguous or consecutive bytes), or pages (e.g., a plurality of contiguous or consecutive bytes or words).
In embodiments, the computer 53 can include a plurality of memories. A memory interface, such as 15, between a processor (or, processors) and a memory (or, memories) can be, for example, a memory bus common to one or more processors and one or more memories. In some embodiments, a memory interface, such as 15, between a processor and a memory can be point to point connection between the processor and the memory, and each processor in the computer can have a point-to-point connection to each of one or more of the memories. In other embodiments, a processor (for example, 13-1) can be connected to a memory (e.g., memory 17) by means of a connection (not shown) to another processor (e.g., 13-2) connected to the memory (e.g., 17 from processor 13-2 to memory 17).
The computer 53 includes an IO bridge 25, which can be connected to a memory interface, or (not shown), to a processor, for example. In some embodiments, an IO bridge can be a component of a processor or a memory. An IO bridge can interface the processors and/or memories of the computer (or, other devices) to IO devices connected to the bridge. For example, computer 53 includes IO bridge 25 interfacing memory interface 15 to IO devices, such as IO device 27. In some embodiments, an IO bridge can connect directly to a processor or a memory, or can be a component included in a processor or a memory. An IO bridge can be, for example, a PCI-Express or other IO bus bridge, or can be an IO adapter.
An IO bridge can connect to IO devices by means of an IO interface, or IO bus, such as IO bus 31 of computer 53. For example, IO bus 31 can be a PCI-Express or other IO bus. IO devices can be any of a variety of peripheral IO devices or IO adapters connecting to peripheral IO devices. For example, IO device 29 can be a graphic card, keyboard or other input device, a hard drive or other storage device, a network interface cards, etc. IO device 29 can be an IO adapter, such as a PCI-Express adapter, that connects components (e.g., processors or memories) of a computer to IO devices (e.g., disk drives, Ethernet networks, video displays, keyboards, mice, etc.).
A computer can include instructions executable by one or more of the processors (or, processing elements, such as threads of a processor). The instructions can be a component of one or more programs. The programs, or the instructions, can be stored in, and/or utilize, one or more memories of a computer. As illustrated in the example of
Programs can be “stand-alone” programs that execute on processors and use memory within the computer directly, without requiring another program to control their execution or their use of resources of the computer. For example, computer 53 includes stand-alone program 11-2. A stand-alone program can perform particular functions within the computer, such as controlling, or interfacing (e.g., access by other programs) an IO interface or IO device. A stand-alone program can, for example, manage the operation, or access to, a memory. A Basic I/O Subsystem (BIOS), or a computer boot program (e.g., a program that can load and initiate execution of other programs) can be a standalone program.
A computer can include one or more operating systems, and an operating system can control the execution of other programs such as, for example, to start or stop a program, or to manage resources of the computer used by a program. For example, computer 53 includes operating systems (Os) 07-1 and 07-2, each of which can include, or manage execution of, one or more programs, such as OS 07-2 including (or, managing) program 11-1. In some embodiments, an operating system can function as a hypervisor.
A program can be embodied as firmware (e.g., BIOS in a desktop computer, or a hypervisor) and the firmware can execute on one or more processors and, optionally, can use memory, included in the computer. Firmware can be stored in a memory (e.g., a flash memory) of the computer. For example, computer 53 includes firmware 19 stored in memory 17. In other embodiments, firmware can be embodied as instructions (e.g., comprising a computer program product) on a storage medium (e.g., a CD ROM, a flash memory, or a disk drive), and the computer can access the instructions from the storage medium.
The example computer system 800 and computer 53 are not intended to limiting to embodiments. In embodiments, computer system 800 can include a plurality of processors, interfaces, and <inputs> and can include other elements or components, such as networks, network routers or gateways, storage systems, server computers, virtual computers or virtual computing and/or IO devices, cloud-computing environments, and so forth. It would be evident to one of ordinary skill in the art to include a variety of computing devices interconnected in a variety of manners in a computer system embodying aspects and features of the disclosure.
In embodiments, computer 53 can be, for example, a computing device having a processor capable of executing computing instructions and, optionally, a memory in communication with the processor. For example, computer 53 can be a desktop or laptop computer; a tablet computer, mobile computing device, or cellular phone; or, a server computer, a high-performance computer, or a super computer. Computer 53 can be, for example, a computing device incorporated into a wearable apparatus (e.g., an article of clothing, a wristwatch, or eyeglasses), an appliance (e.g., a refrigerator, or a lighting control), a vehicle and/or traffic monitoring device, a mechanical device, or (for example) a motorized vehicle. It would be apparent to one of ordinary skill in the art that a computer embodying aspects and features of the disclosure can be any of a variety of computing devices having processors and, optionally, memories and/or programs.
The present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems and/or methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
The descriptions of the various embodiments of the present disclosure have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Described below are particular definitions specific to the present disclosure:
“And/or” is the inclusive disjunction, also known as the logical disjunction and commonly known as the “inclusive or.” For example, the phrase “A, B, and/or C,” means that at least one of A or B or C is true; and “A, B, and/or C” is only false if each of A and B and C is false.
A “set of” items means there exists one or more items; there must exist at least one item, but there can also be two, three, or more items. A “subset of” items means there exists one or more items within a grouping of items that contain a common characteristic.
The terms “receive,” “provide,” “send,” “input,” “output,” and “report” should not be taken to indicate or imply, unless otherwise explicitly specified: (i) any particular degree of directness with respect to the relationship between an object and a subject; and/or (ii) a presence or absence of a set of intermediate components, intermediate actions, and/or things interposed between an object and a subject.
A “module” is any set of hardware, firmware, and/or software that operatively works to do a function, without regard to whether the module is: (i) in a single local proximity; (ii) distributed over a wide area; (iii) in a single proximity within a larger piece of software code; (iv) located within a single piece of software code; (v) located in a single storage device, memory, or medium; (vi) mechanically connected; (vii) electrically connected; and/or (viii) connected in data communication. A “sub-module” is a “module” within a “module.”
The terms first (e.g., first cache), second (e.g., second cache), etc. are not to be construed as denoting or implying order or time sequences. Rather, they are to be construed as distinguishing two or more elements. In some embodiments, the two or more elements, although distinguishable, have the same makeup. For example, a first memory and a second memory may indeed be two separate memories but they both may be RAM devices that have the same storage capacity (e.g., 4 GB). Moreover, a “first cache” and a “second cache,” etc. is not to be construed as a particular level of cache (e.g., L1 and L2), but are to be construed as different caches in general.
As used herein, “processor” refers to any form and/or arrangement of a computing device using, or capable of using, data stored in a cache, including, for example, pipelined and/or multi-cycle processors, graphical processing units (GPUs), and/or neural networks. Also, as used herein, “computing system” refers to a computing system that employs processors utilizing data stored in one or more caches. However, this is not intended to limit embodiments, and it would be appreciated by one of ordinary skill in the art that embodiments can employ other varieties and/or architectures of processors within the scope of the disclosure.
Claims
1. A computer-implemented method comprising:
- detecting that a data fetch request to a level 1 (L1) cache line resulted in an L1 cache miss;
- generating, by logic in the L1 cache and in response to detecting the L1 cache miss, a first request to access a first set of data in a level two (L2) cache;
- transmitting the first request to the L2 cache;
- predicting, by logic in the L2 cache, a location of the first set of data in the L2 cache;
- predicting that there will be a cache miss in the L2 cache for the first request;
- generating, in response to predicting that there will be a cache miss in the L2 cache for the first request, a second request to access the first set of data in a level 3 (L3) cache;
- generating a cancel probability score based on at least one of the workload of a cache, whether a cache line is read-only or write-only, and the exclusive or shared status of a processor;
- determining that the second request to the L3 cache for the first set of data will likely not be canceled based on the cancel probability score;
- transmitting, prior to a directory lookup indicating an actual L2 cache hit or actual L2 cache miss and in response to determining that the second request to the L3 cache for the first set of data will likely not be canceled, the second request to the L3 cache;
- performing a directory lookup for the first set of data in the L2 cache;
- determining, via the directory lookup, that there is an actual cache hit in the L2 cache for the first request;
- transmitting, in response to the determining that there is the actual cache hit in the L2 cache for the first request, a third request to the L3 cache to cancel the second request;
- clearing, by the L3 cache and in response to receiving the third request, the second request from a request buffer for the L3 cache;
- retrieving the first set of data from the L2 cache;
- receiving a fourth request for a second set of data in the L2 cache;
- predicting, by logic in the L2 cache, a location of the second set of data in the L2 cache;
- predicting that there will be a cache hit in the L2 cache for the fourth request;
- transmitting, in response to predicting that there will be a cache hit in the L2 cache for the fourth request, the fourth request to a request buffer for the L2 cache;
- determining, via a second directory lookup, that there is an actual L2 cache miss for the second set of data;
- sending, in response to determining that a directory indicates the actual L2 cache miss for the second set of data, the fourth request from the request buffer to a request arbiter for the L2 cache;
- transmitting the fourth request from the request arbiter to the L3 cache; and
- retrieving the second set of set of data from the L3 cache.
Type: Application
Filed: Mar 19, 2019
Publication Date: Jul 11, 2019
Inventors: Willm Hinrichs (Holzgerlingen), Markus Kaltenbach (Holzgerlingen), Eyal Naor (Tel Aviv), Martin Recktenwald (Schoenaich)
Application Number: 16/358,438