Dynamic fill policy for a shared cache
Technologies are provided in embodiments to dynamically fill a shared cache. At least some embodiments include determining that data requested in a first request for the data by a first processing device is not stored in a cache shared by the first processing device and a second processing device, where a dynamic fill policy is applicable to the first request. Embodiments further include determining to deallocate, based at least in part on a threshold, an entry in a buffer, the entry containing information corresponding to the first request for the data. Embodiments also include sending a second request for the data to a system memory, and sending the data from the system memory to the first processing device. In more specific embodiments, the data from the system memory is not written to the cache based, at least in part, on the determination to deallocate the entry.
Latest Intel Patents:
This disclosure relates in general to the field of computing architectures, and more particularly, to a dynamic fill policy for a shared cache in a computing architecture.
BACKGROUNDComputing architectures that integrate multiple diverse on-chip processing devices are becoming a dominant computing platform for many types of applications. A system that integrates more than one type of processor or core generally also includes certain memory that is shared between the processors or cores. For example, a last level cache (LLC) may be shared between multiple on-chip processing devices such as a central processing unit (CPU) and a graphics processing unit (GPU). An LLC is a critical resource because it can impact system performance. Designing a system with multiple diverse on-chip processing devices sharing a memory resource, however, can be complex due to conflicting requirements of the devices. For example, a common requirement in computing platforms to maximize resource utilization may be difficult to achieve when trying to minimize shared resource conflicts between a CPU and GPU. Thus, computer architectures that integrate multiple diverse on-chip processing devices could benefit from new solutions that manage conflicting requirements and characteristics of diverse on-chip processing devices.
To provide a more complete understanding of the present disclosure and features and advantages thereof, reference is made to the following description, taken in conjunction with the accompanying figures, where like reference numerals represent like parts, in which:
The following disclosure provides various possible embodiments, or examples, for implementing features disclosed in this specification. These features are related to a computing system in which a dynamic fill policy (DFP) is used to manage a shared cache. A dynamic fill policy includes logic that is invoked when requests for data from certain processing devices in a multi-processor architecture are received. The logic is to dynamically determine whether to fill in a shared cache with data from system memory to satisfy future requests for the data, or instead, to bypass the shared cache and provide the data from the system memory directly to the requesting processing device. In an example, a dynamic fill policy may be used in an architecture in which one processing device has a different shared cache sensitivity related to data latency and/or bandwidth than at least one other processing device in the same architecture.
In at least one embodiment, a determination of whether to fill in a shared cache with data from system memory or to bypass the shared cache can be based on a threshold associated with a lower level cache buffer (LRB), which holds outstanding requests for data. The threshold can be used as a basis for determining whether the LRB is too full, in which case the LRB entry that corresponds to the relevant request for data can be deallocated and the shared cache is bypassed and therefore not filled with the data that satisfies that request. Rather, the requested data may be provided directly to the requesting processing device from system memory.
For purposes of illustrating certain example techniques of a computing system for dynamically filling a last level cache, it is important to understand the activities that may be occurring in such systems with multiple diverse on-chip processing devices. The following foundational information may be viewed as a basis from which the present disclosure may be properly explained.
In typical computer architectures, different types of memory can be used to store data that is accessed by a processor. System memory (also referred to as ‘main memory’) includes a memory element that typically holds current programs and data that are being used by a processor. Dynamic random access memory (DRAM) is often used for system memory in modern computer architectures due to its speed and low cost. A cache is a memory element that may be used to store data from other memory so that future requests for that data can be served more quickly. The other memory can include, for example, memory-mapped input/output devices (MMIO), hard disks, a basic input/output system (BIOS) read only memory (ROM), and/or random access memory (e.g., DRAM, static RAM, etc.). Some caches, such as level 1 (L1) and level 2 (L2) caches can be described as processor side caches, which are provisioned on or close to a processor. Other caches may be referred to as memory side caches, which are provisioned closer to main memory. For example, one or more processor side caches are typically provisioned for each CPU (e.g., L1 cache, L2 cache, etc.) and each GPU (e.g., texture cache, L2 cache, etc.). A memory side cache, also referred to as ‘last level cache’ or ‘LLC’, is also typically provisioned in a computing architecture near the system memory and typically holds more data than the processor side caches.
In computing architectures involving multiple on-chip processing devices (e.g., CPUs, GPUs, etc.), certain resources can be shared. For example, a last level cache (LLC) and cache controller are elements that can be shared by cores of the same processor and by diverse on-chip processing devices, such as CPUs and GPUs. The combination of diverse processing devices in a single computing system, however, can make it difficult to maximize resource utilization while minimizing shared resource conflicts. For example, CPUs usually have fewer parallel threads executing, can have comparatively higher hit rates in an LLC, and can be very sensitive to data access latency. Application memory footprints of CPU workloads typically have good spatial and temporal locality. Temporal locality refers to the likelihood that data that is referenced at a point in time will be referenced again in the near future. Spatial locality refers to the likelihood that referencing particular data is higher if other data with a nearby address was recently referenced. Because applications that run on CPUs tend to reuse data and thus, tend to have good spatial and temporal locality, a last level cache can become a primary provider to those CPUs of data with low latency and high bandwidth. Accordingly, a cache controller for the LLC is typically optimized to provide a very low access latency for CPUs even in a loaded scenario, where the lower level cache is full or nearly full. Hence, when hit rates are high, the cache controller can operate at a very high frequency and can be banked (e.g., divided into instruction cache and data cache) in order to increase the throughput. Conversely, the cache controller can be inefficient when miss rates go high.
In contrast to CPU workloads, application memory footprints of many GPU workloads are large. GPUs usually have a large number of parallel independent threads and comparatively lower hit rates in a last level cache. Thus, GPUs tend to monopolize shared hardware resources such as system memory. Furthermore, GPUs are generally less sensitive to data access latency and are more sensitive to the overall bandwidth delivered. Consequently, GPU workloads with poor hit rates (also referred to as ‘high miss rates’) in the LLC can cause the cache controller to operate very inefficiently both in terms of power and performance. The extra dynamic power is wasted because new requests continue to access the cache controller every cycle, only to find that it does not have enough resources to make forward progress. The performance degrades since the LLC request buffer is not sufficiently sized to hold enough outstanding read requests (or indications of the outstanding read requests) to cover the system memory latency.
Due to the characteristics of CPUs and GPUs, CPU workloads need to be optimized more for latency and high LLC hit rates, while GPU workloads need to be optimized for overall bandwidth delivery and low LLC hit rates. These conflicting requirements can make it difficult to optimize a cache controller for a shared cache. With multiple parallel threads from a GPU, a system memory scheduler (e.g., DRAM scheduler), which may be part of a memory controller, needs a large number of outstanding requests from the GPU in order to exploit locality and hence optimize bandwidth. Requests are considered to have locality when they are located in the same memory page (e.g., a DRAM page). A greater number of read requests is indicative of greater locality (i.e., a greater number of requests located in the same memory page), and greater locality enables the system memory to optimize bandwidth.
When a request is made from a processing device (e.g., a CPU, a GPU), the LLC may be searched for data to fill the request. This search is also referred to herein as a ‘lookup’. If there is a hit (i.e., requested data is found in the LLC) during a lookup of the LLC, then the data found in the LLC can be used to fulfill the request to the requesting device without having to access system memory. If there is not a hit (i.e., requested data is not found in the LLC) during the LLC lookup, then the request is considered outstanding and system memory can be accessed to obtain the requested data. If system memory is accessed, the requested data is written to the LLC and then provided to the requesting device. When requests are filled from system memory to the LLC, an LLC request buffer (LRB) tracks these outstanding requests to the system memory. Hence, the size of the LRB defines the fabric depth throughput and the extent of look-ahead achieved by the request streams at system memory. Consequently, the LRB becomes the limiter in bandwidth delivery.
Current approaches to maximize resource utilization by diverse on-chip processing devices having conflicting requirements have not been adequate. For example, choosing not to fill the LLC when a GPU request misses during an LLC lookup so that the LRB does not cover the system memory latency, can still negatively impact the performance of some GPU workloads. Specifically, while caching only CPU requests and not GPU requests may improve the performance of GPU workloads having poor hit rates in an LLC, the performance of GPU workloads having low memory footprints with comparatively higher hit rates in the LLC can suffer. In another scenario, adding a high bandwidth memory side cache for the GPU requests can result in a high cost of area and power and is not feasible for all power envelopes. In yet another scenario, increasing the size of the LRB can necessitate higher silicon area and higher power. Thus, the results can include limiting the operating frequency, increasing the hardware cost, and contributing to higher static and dynamic power. In addition, increasing the LRB can potentially lead to reduced frequency or higher latency of operation. Thus, an approach is needed to intelligently utilize and optimize shared resources, such as an LLC, by diverse on-chip processing devices having conflicting requirements and characteristics.
Embodiments disclosed herein can resolve the aforementioned issues (and more) associated with computing systems that include diverse on-chip processing devices that utilize shared resources and have conflicting requirements. An embodiment of computing system 10 implements a dynamic fill policy to dynamically determine whether to fill a last level cache for processing device read requests based on bandwidth demand. In at least one embodiment, the dynamic fill policy can be applicable to requests for data from processing devices that may have low hit rates, such as GPUs, but not applicable to requests for data from processing devices with typically high hit rates, such as CPUs. For a read request to which the dynamic fill policy is applicable, a cache controller dynamically decides whether to fill in the cache with data from system memory based on current hit rates in the cache. The current hit rates can be inferred based on a comparison of the amount of outstanding (i.e., filled) or remaining (i.e., free) entries in the LLC request buffer (LRB) relative to an appropriate threshold. Since an LRB does not track requests that do not fill in the LLC, dynamically determining whether to fill a last level cache based on the LRB allows for more outstanding requests at the system memory (e.g., DRAM), thus enabling the system memory to optimize bandwidth delivery.
More specifically, when the dynamic fill policy is applicable to a request for data from a processing device (also referred to herein as a ‘DFP processing device’) and the LRB becomes too full or is ‘in pressure’ (e.g., near its maximum capacity), requests from the processing device do not cause the LLC to be filled in with data to serve the requests. In at least some embodiments, when a read request is received from a DFP processing device and the LRB is too full, a determination is made, based at least in part on a threshold, to deallocate an entry in the LRB that corresponds to the read request. The threshold may be predetermined and selected based on a fill limit that indicates the LRB is unable to proceed with new requests. In at least some embodiments, the fill limit is an amount that is near, but not equal to a maximum capacity of the LRB. In addition, in at least some embodiments, requests from DFP processing devices do not fill in the LLC only when the LRB is too full (e.g., reaches near its maximum capacity) and is not able to proceed with new requests. Furthermore, alternative techniques may be used to determine whether to deallocate a relevant entry in the LRB and will be further described herein.
For workloads of DFP processing devices that have good hit rates in LLC, the LRB does not reach the fill limit often, since the residency time of hit requests in LRB is low. Consequently, the dynamic fill policy is not invoked for such workloads. For workloads of DFP processing devices that have poor hit rates in LLC, however, long residency pressure can build in the LRB as it gets drained only at the system memory bandwidth. In embodiments disclosed herein, this can trigger the dynamic fill policy logic to cause a read request to not fill in the LLC and to deallocate the entry in the LRB for that request, thus relieving pressure on the LRB. Hence, when the dynamic fill policy is invoked, outstanding requests at system memory increase and enable the system memory scheduler to optimize system memory bandwidth by exploiting locality. Invoking the dynamic fill policy can also reduce fabric queues cross interference between requests that do not have high LLC hit rates (e.g., from GPUs) and requests that have high LLC hit rates (e.g., from CPUs).
Computing system 10 implementing a dynamic fill policy provides several advantages. Embodiments herein contribute to overall higher bandwidth delivery at a cost of reduced LLC hit rates. Embodiments described herein can also improve the power and performance of a cache controller for high bandwidth and low LLC hit rate device workloads (e.g., GPU workloads) while not impairing the performance for high LLC hit rate device workloads (e.g., CPU workloads). For embodiments in which a dynamic fill policy is applied to one or more graphics processing units, the last layer cache and memory efficiency can be improved and, through that, contribute to higher graphics performance. Embodiments of computing system 10 can also enable software to extract the maximum benefit from an LLC by providing intelligent information regarding LLC re-use.
Turning to
A brief discussion is now provided about some of the possible infrastructure that may be included in computing system 10. Computing system 10 includes at least two types of processing devices (also referred to herein as ‘processors’), such as CPUs and GPUs in the present example. CPUs 20(1)-20(N) may be capable of performing certain computing tasks including, but not limited to, executing instructions of a computer program including performing calculations, logical operations, control operations, and input/output operations. GPUs 30(1)-30(M) may perform mathematically intensive operations to create images (e.g., 3-D images) intended for output to a display. CPU side caches 25(1)-25(N) can include one or more levels of cache. For example, the CPU side caches may include level 1 (L1) and level 2 (L2) caches. GPU side caches 35(1)-35(M) may also include one or more types of cache. For example, the GPU side caches may include level 1 (L1) caches and texture caches for graphics processing. In an example, the CPU and GPU side caches may be implemented as static random access memory (SRAM).
An on-chip interconnect 40 can serve as communication paths between the CPUs and other components and between the GPUs and other components. In particular, the interconnect may provide communication between the multiple diverse processing devices and the shared resources, such as LLC 50 and cache controller 60. Although interconnect 40 is also illustrated as a shared resource, it should be apparent that other implementations, such as a partitioned interconnect (e.g., one for CPUs and one for GPUs) could also be implemented in embodiments disclosed herein. Interconnect 40 may also be coupled to other elements (not shown) including, but not necessarily limited to, input devices (e.g., keyboard, mouse, touch screen, trackball, etc.), output devices (e.g., graphics controller coupled to a video display, printer, etc.), and storage devices (e.g., hard disks, floppy disks, universal serial bus devices (USBs) etc.).
LLC 50 is shared by the multiple diverse processing devices (e.g., CPUs and GPUs). LLC 50 is typically bigger than the CPU and GPU side caches. In an embodiment, LLC 50 can be partitioned into cache lines, with each cache line holding a block from system memory 70. In at least one embodiment, system memory 70 may be implemented as dynamic random access memory (DRAM), and may be logically partitioned into blocks that can be mapped to cache lines in LLC 50. At any given time, however, only a subset of blocks from system memory 70 are mapped to LLC 50.
In at least some examples, LLC 50 can provide hardware-based coherency or software-based coherency (or some suitable combination of both) between the diverse processing devices, such as CPUs 20(1)-20(N) and GPUs 30(1)-30(M). The hardware-based coherency between CPUs and GPUs may need to be modified to account for cases where, as a result of the dynamic fill policy logic, data forwarding from the cache in cases of a cache hit will not result in data forwarding the cache but instead will be accompanied by a memory reference. For example, this may be the case in an implementation of an inclusive cache (i.e., data in higher level caches are included in lower level caches). In one possible implementation, an inclusive cache may be modified to hold a control bit for each cache-line, where the control bit indicates whether tag entry points to a valid data segment or not.
In other cases, the hardware-based coherency protocol may not be changed as a result of the dynamic fill policy logic, since it already includes an option for HW-based coherency without always allocating data segments into the cache. For example, this may be the case in an implementation of a non-inclusive cache. In this latter example, the cache controller may decide during a cache-miss request phase whether to issue a memory reference that will result in a cache fill or not, based on the dynamic fill policy logic. In this case, the cache controller may further notify the processing device on its decision to fill or not-fill the data into the cache, leaving an option to the processing device to utilize the cache as a victim-cache in case the data was not filled.
Cache controller 60 is a device that manages the flow of data between processing devices, system memory and cache memory. For example, this device can be computer hardware in the form of special digital circuitry. Cache controller 60 can receive and process memory access requests (e.g., read request, write request) from processing devices such as CPUs 20(1)-20(N) and GPUs 30(1)-30(M). Cache controller 60 may be implemented separately from the processing devices and communicatively coupled to the processing devices via a communication channel (e.g., interconnect 40). Cache controller 60 can also access system memory 70, for example, to send memory access requests when a read request from a processing device cannot be served by the data in LLC 50. Memory controller 72, which can include a scheduler (not shown), may be a digital circuit that manages the flow of data going to and from system memory 70. Cache controller 60 can communicate with memory controller 72 to access system memory 70. Memory controller 72 can also access processing devices directly, for example, when cache controller 60 indicates that data should be provided directly to a processing device rather than filling LLC 50.
In at least one embodiment, cache controller 60 is provisioned with LLC fill logic 62 and dynamic fill policy logic 64. LLC fill logic 62 enables cache controller 60 to receive memory access requests from processing devices, perform LLC lookups, and respond when a lookup in the LLC results in a hit. Generally, data requested by the CPUs and GPUs can be stored in LLC 50, and when subsequent requests are made for the same data, LLC fill logic 62 can retrieve the data from the LLC and provide the data to the requesting device rather than accessing system memory 70 again for the data. When a lookup in the LLC results in a miss, the request is typically held in LLC request buffer 55 until the data is filled into the LLC from system memory 70.
In embodiments disclosed herein, dynamic fill policy logic 64 enables cache controller 60 to operate more efficiently when an LLC lookup results in a miss for particular processing devices. When an LLC lookup for requested data results in a miss, dynamic fill policy logic 64 is invoked if a dynamic fill policy is applicable to the data request (e.g., read request) based on the requesting device. For example, dynamic fill policy logic 64 can be invoked for data requests received from devices that have low hit rates, such as GPUs 30(1)-30(M), relative to other processing devices in the same system. Dynamic fill policy logic 64 enables determining whether to fill LLC 50 based on the occupancy of LLC request buffer (LRB) 55. In one example, when a request for data is received from a processing device, a comparison of the actual filled amount of LRB 55 to a threshold that represents a fill limit can indicate whether to deallocate a relevant entry of the LRB because the LRB is too full (e.g., when it contains outstanding requests nearing maximum capacity). If the LRB is not too full, as determined based on the threshold, then the request (or an indication of the request) is held in the LRB until the data is filled into the LLC from system memory 70. Thus, the outstanding request is tracked in the LRB while cache controller 60 accesses system memory 70 to obtain the requested data. However, if the LRB is too full due to too many entries being filled, as determined based on the threshold, then the entry corresponding to the request is removed from the LRB. In addition, the data retrieved from the system memory is sent directly to the requesting device from system memory 70 and is not filled in the LLC.
As used herein, an entry in an LRB is considered to ‘correspond to’ a request from a processing device during the period that information indicating the request is stored in the entry. In one non-limiting example, the entry could include a request for data generated by the cache controller to be sent to the system memory. In another example, the entry could include the request for data (e.g., read request) received from the processing device. Generally, the entry could include any information that is associated with and provides an indication of the request for data.
In an example, dynamic fill policy heuristics that favor higher cache rates versus better memory bandwidth utilizations can be controlled by software. For example, software may be configured to control DFP-related parameters, such as a threshold used determine whether to deallocate an entry in the LRB and not fill the lower level cache with data from system memory. Such parameters can be statically set or adaptively modified until a desired optimum run-time configuration is achieved. Such adaptive modifications could be based on run-time information (e.g., performance, bandwidth, latency, etc.) collected during actual run-times in which the dynamic fill policy is invoked for one or more requests for data to determine whether to deallocate a relevant entry in the LRB and not fill the lower level cache with the requested data.
It should be appreciated that a computing system with diverse processing devices such as GPUs and CPUs offers a particular, non-limiting example implementation in which a dynamic fill policy may be advantageously applied to the GPUs, due to their comparatively lower hit rates. References to GPUs and CPUs are used herein for ease of explanation and are not intended to limit the broad application of the concepts contained herein. Accordingly, it should be noted and appreciated that a dynamic fill policy could be advantageously applied to a computing system that incorporates any types of diverse processing devices where at least one processing device incorporated in the computing system has a different shared cache sensitivity than one or more other processing devices incorporated in the computing system. For example, accelerated processing units (APUs), field-programmable gate arrays (FPGAs), digital signal processors (DSPs), application-specific integrated circuits (ASICs), and other similar elements could potentially benefit from the broad concepts disclosed herein related to a dynamic fill policy. Thus, references herein to ‘processing device’ are intended to include any such elements.
Requests are a form of electronic communications in computing system 10. In at least one embodiment, requests include read requests and write requests. Generally, a read request is a request to access data. The requested data is typically stored in system memory and may or may not also be stored in cache at any given time. Electronic communications (also referred to herein as ‘communications’), may be inclusive of signals, bits, bytes, data, objects, etc., and can be sent and received by components of computing system 10 according to any suitable communication messaging protocols. Suitable communication messaging protocols can include bus protocols, pipelined protocols, etc. The term ‘data’ as used herein, refers to any type of binary, numeric, voice, video, textual, photographic, or script data, or any type of source or object code, or any other suitable information in any appropriate format that may be communicated from one point to another in a computing system and/or networks. Like requests, other types of information such as messages, responses, replies, queries, etc. are also forms of communications.
Turning to
At 202, a read request is received by cache controller 60 from a processing device, such as CPUs 20(1)-20(N) or GPUs 30(1)-30(M). In this example, the read request may be requesting access to data. An entry in an LLC request buffer, such as LRB 55, can be allocated and filled with information corresponding to the read request. At 204, an LLC lookup is performed. The LLC lookup is a search of LLC 50 to determine whether data responsive to the read request is present in the lower level cache. At 206, a determination is made as to whether a hit resulted from the search. If it is determined that a hit resulted from the search (i.e., data requested by the read request is found in the LLC), then at 208, the data is directly returned from the LLC to the requesting device. In at least one embodiment, the data is returned by the cache controller. For example, if CPU 20(1) requested the data, then at 208, the requested data is provided by cache controller 60 directly to CPU 20(1) from LLC 50. Similarly, if GPU 30(1) requested the data, then at 208, the requested data is provided by cache controller 60 directly to GPU 30(1) from LLC 50. At 210, the entry in LRB 55 that corresponds to the read request is deallocated. Accordingly, the information that indicates the satisfied read request is removed from LRB 55. It should be noted, however, that the LRB may still contain other outstanding requests from processing devices.
If, at 206, it is determined that a miss resulted from the search of LLC 50, then at 212, a determination may be made as to whether a dynamic fill policy is applicable to the read request. In an example, the dynamic fill policy may be applicable to a read request received from a processing device that has a comparatively lower hit rate. In some embodiments, processing devices in a computing system may be predetermined to have low hit rates. For example, GPUs 30(1)-30(M) may be predetermined to have lower hit rates and less sensitivity to data access latency than the other on-chip processing devices. Thus, the dynamic fill policy can be predetermined to be applicable to read requests from all of the GPUs. In this example, CPUs 20(1)-20(N) may be predetermined to have higher hit rates and more sensitivity to data access latency than other on-chip processing devices. Thus, the dynamic fill policy can be predetermined to be not applicable to read requests from the CPUs. Accordingly, in this example at 212, a determination could be made as to whether the read request was received from a GPU with a predetermined low hit rate or from any GPU if all GPUs have predetermined low hit rates. In other implementations, a dynamic determination may be made as to whether a read request was issued by a processing device having a low hit rate based on a hit-rate threshold.
In yet another embodiment, at 212, a read request may be evaluated to determine whether it is marked to indicate that the dynamic fill policy is to be applied to that read request. In this embodiment, a software driver may be configured to exploit additional information available from a software application that is causing read requests to be generated. The software application may provide more coarse, and comparatively less accurate, information regarding LLC re-use of certain software buffers. The dynamic fill policy can be invoked only for requests that the software marks as having a lesser chance of getting re-used from the LLC. Thus, the software may mark a request for data based on a probability threshold that the data will be re-used from the LLC. This can maximize the benefits from LLC re-use as well as achieving peak system memory (e.g., DRAM) bandwidth.
If at 212, a determination is made that a dynamic fill policy is applicable to the read request, then at 213, a determination can be made as to whether to deallocate an entry in the buffer based, at least in part, on a threshold. The entry to potentially be deallocated contains information indicating the read request. In at least one example, the determination at 213 of whether to deallocate the entry is a determination of whether the LRB is getting too full (e.g., nearing maximum capacity) according to a threshold. It will be apparent that this determination can achieve the same result using various types of evaluations.
For example, in a first technique, the threshold could represent an amount of occupied or filled entries in the LRB. Specifically, this threshold could represent a maximum number of filled entries or percentage of filled entries that the LRB may contain without being in pressure or too full. Thus, if the actual number of filled entries or percentage of filled entries in the LRB exceeds the threshold, then the determination at 213 can be that the relevant entry in the LRB (i.e., the entry that corresponds to the read request) is to be deallocated because the LRB is too full. If the actual number of filled entries or percentage of filled entries in the LRB does not exceed the threshold, then the determination at 213 can be that the relevant entry in the LRB is not to be deallocated because the LRB is not too full.
In a variation, the threshold can represent a minimum number of filled entries or percentage of filled entries in the LRB that indicate the LRB is too full. Thus, if the actual number of filled entries or percentage of filled entries in the LRB meets or exceeds the threshold, then the determination at 213 can be that the relevant entry in the LRB is to be deallocated because the LRB is too full. If the actual number of filled entries or percentage of filled entries in the LRB does not meet or exceed the threshold, then the determination at 213 can be that the relevant entry in the LRB is not to be deallocated because the LRB is not too full.
In a second technique, the threshold could represent an amount of unoccupied or free entries in the LRB. Specifically, this threshold could represent a minimum number of free entries or percentage of free entries that indicate the LRB is not in pressure or too full. Thus, if the actual number of free entries or percentage of free entries in the LRB meets or exceeds the threshold, then the determination at 213 can be that the relevant entry in the LRB (i.e., the entry that corresponds to the read request) is not to be deallocated because the LRB is not too full. If the actual number of free entries or percentage of free entries in the LRB does not meet or exceed the threshold, however, then the determination at 213 can be that the relevant entry in the LRB is to be deallocated because the LRB is too full.
In a variation, the threshold can represent a maximum number of free entries or percentage of free entries in the LRB that indicate the LRB is in pressure or too full. Thus, if the actual number of free entries or percentage of free entries in the LRB does not exceed the threshold, then the determination at 213 can be that the relevant entry in the LRB is to be deallocated because the LRB is too full. If the actual number of free entries or percentage of free entries in the LRB exceeds the threshold, then the determination at 213 can be that the relevant entry in the LRB is not to be deallocated because the LRB is not too full.
It should also be noted that the unit of measurement is described in terms of entries in the LRB. However, any other suitable unit of measurement may also be used. For example, number or percentages of bits, bytes, lines, blocks, etc. may be compared to a threshold based on the same unit of measurement. Furthermore, the above techniques for determining whether to deallocate an LRB entry based on the LRB being in pressure or too full are for illustrative purposes only, and are not intended to limit the broad scope of this disclosure. For example, any other suitable evaluation may be used to determine whether the LRB is in pressure or too full according to the present disclosure.
If a determination is made at 213 that the relevant entry in the LRB is not to be deallocated based on the threshold, or if a determination is made at 212 that the dynamic fill policy is not applicable to the read request, then the read request (or an indication of the read request) is held in the LRB while a request for the data is sent to the system memory and the data is filled into the LLC from system memory. More specifically, at 214, a request is sent to the system memory for the data. In at least one embodiment, the request sent to the system memory is generated by the cache controller based on the read request. At 216, the data is written from the system memory to the LLC. At 218, the data is sent by the cache controller from the LLC to the requesting device. At 220, the entry in LRB 55 corresponding to the read request is deallocated. Thus, once the read request has been satisfied, the information that indicates the read request is removed from the LRB.
If a determination is made at 213 that the relevant entry in the LRB is to be deallocated based on the threshold, and if a determination is made at 212 that the dynamic fill policy is applicable to the read request, then the LLC is not filled with more data to serve the request. Instead, the request is served directly from system memory. More specifically, at 222, the entry in LRB 55 corresponding to the read request is deallocated. Thus, the information that indicates the outstanding read request is removed from the LRB. At 224, a request for the data is sent to the system memory. In at least one embodiment, the request sent to the system memory is generated by the cache controller based on the read request. The request sent to the system memory provides an indication that the requested data is to be sent directly to the requesting device.
Based on the received request, at 226, the requested data is sent directly from the system memory to the requesting device (e.g., GPU) and is not stored in the LLC. In at least one embodiment, memory controller 72 of system memory 70 understands the indication in the request from cache controller 60 to send the requested data directly to the requesting device. Accordingly, the requested data may be retrieved from system memory and sent to the requesting device by memory controller 72. In at least one embodiment, each subsequent request from a processing device is processed using the same or similar flows as described with reference to
A benefit of filling a cache line in LLC 50 from system memory 70 is that a future reference to this cache line can be fetched from the LLC at a lower latency and higher bandwidth. However, this comes at a cost of larger LRB entries since they must cover the system memory latency. In general, the LRB occupancy can provide an indication of the LLC hit rates. When the hit rates are high for a certain application phase, there will be less outstanding requests to system memory and hence, more LRB entries will be free. When the LLC hit rates are very poor, however, then most of the LRB entries will be used to cover the system memory latency for filling the cache lines of the LLC. In the scenario where LLC hit rates are low, the size of the LRB limits the number of outstanding requests to system memory, which determines the fabric depth and the look ahead required for the memory controller to achieve higher bandwidth. In this scenario where the size of the LRB limits the system memory bandwidth, dynamic fill policy logic 64 of cache controller 60 converts requests to not fill in cache lines of the LLC and deallocates the entries in the LRB that correspond to these requests (i.e., the information contained in each entry being deallocated is removed). This relieves the LRB pressure, which helps the system memory scheduler achieve higher bandwidth, at a cost of sacrificing potential LLC hit rates. As shown by graphs in
In embodiments according to the present disclosure, as shown in
The figures described below detail exemplary architectures and systems to implement embodiments of the above. In some embodiments, one or more hardware components and/or instructions described above are emulated as detailed below, or implemented as software modules.
Processor cores may be implemented in different ways, for different purposes, and in different processors. For instance, implementations of such cores may include: 1) a general purpose in-order core intended for general-purpose computing; 2) a high performance general purpose out-of-order core intended for general-purpose computing; 3) a special purpose core intended primarily for graphics and/or scientific (throughput) computing. Implementations of different processors may include: 1) a CPU including one or more general purpose in-order cores intended for general-purpose computing and/or one or more general purpose out-of-order cores intended for general-purpose computing; and 2) a coprocessor including one or more special purpose cores intended primarily for graphics and/or scientific (throughput) computing. Such different processors lead to different computer system architectures, which may include: 1) the coprocessor on a separate chip from the CPU; 2) the coprocessor on a separate die in the same package as a CPU; 3) the coprocessor on the same die as a CPU (in which case, such a coprocessor is sometimes referred to as special purpose logic, such as integrated graphics and/or scientific (throughput) logic, or as special purpose cores); and 4) a system on a chip that may include on the same die the described CPU (sometimes referred to as the application core(s) or application processor(s)), the above described coprocessor, and additional functionality. Exemplary core architectures are described next, followed by descriptions of exemplary processors and computer architectures.
In
The front end unit 530 includes a branch prediction unit 532 coupled to an instruction cache unit 534, which is coupled to an instruction translation lookaside buffer (TLB) 536, which is coupled to an instruction fetch unit 538, which is coupled to a decode unit 540. The decode unit 540 (or decoder) may decode instructions, and generate as an output one or more micro-operations, micro-code entry points, microinstructions, other instructions, or other control signals, which are decoded from, or which otherwise reflect, or are derived from, the original instructions. The decode unit 540 may be implemented using various different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read only memories (ROMs), etc. In one embodiment, core 590 includes a microcode ROM or other medium that stores microcode for certain macroinstructions (e.g., in decode unit 540 or otherwise within the front end unit 530). The decode unit 540 is coupled to a rename/allocator unit 552 in the execution engine unit 550.
The execution engine unit 550 includes the rename/allocator unit 552 coupled to a retirement unit 554 and a set of one or more scheduler unit(s) 556. The scheduler unit(s) 556 represents any number of different schedulers, including reservations stations, central instruction window, etc. The scheduler unit(s) 556 is coupled to the physical register file(s) unit(s) 558. Each of the physical register file(s) units 558 represents one or more physical register files, different ones of which store one or more different data types, such as scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point, status (e.g., an instruction pointer that is the address of the next instruction to be executed), etc. In one embodiment, the physical register file(s) unit(s) 558 comprises a vector registers unit, a write mask registers unit, and a scalar registers unit. These register units may provide architectural vector registers, vector mask registers, and general purpose registers. The physical register file(s) unit(s) 558 is overlapped by the retirement unit 554 to illustrate various ways in which register renaming and out-of-order execution may be implemented (e.g., using a reorder buffer(s) and a retirement register file(s); using a future file(s), a history buffer(s), and a retirement register file(s); using a register maps and a pool of registers; etc.). The retirement unit 554 and the physical register file(s) unit(s) 558 are coupled to the execution cluster(s) 560. The execution cluster(s) 560 includes a set of one or more execution units 562 and a set of one or more memory access units 564. The execution units 562 may perform various operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include a number of execution units dedicated to specific functions or sets of functions, other embodiments may include only one execution unit or multiple execution units that all perform all functions. The scheduler unit(s) 556, physical register file(s) unit(s) 558, and execution cluster(s) 560 are shown as being possibly plural because certain embodiments create separate pipelines for certain types of data/operations (e.g., a scalar integer pipeline, a scalar floating point/packed integer/packed floating point/vector integer/vector floating point pipeline, and/or a memory access pipeline that each have their own scheduler unit, physical register file(s) unit, and/or execution cluster—and in the case of a separate memory access pipeline, certain embodiments are implemented in which only the execution cluster of this pipeline has the memory access unit(s) 564). It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order issue/execution and the rest in-order.
The set of memory access units 564 is coupled to the memory unit 570, which includes a data TLB unit 572 coupled to a data cache unit 574 coupled to a level 2 (L2) cache unit 576. In one exemplary embodiment, the memory access units 564 may include a load unit, a store address unit, and a store data unit, each of which is coupled to the data TLB unit 572 in the memory unit 570. The instruction cache unit 534 is further coupled to a level 2 (L2) cache unit 576 in the memory unit 570. The L2 cache unit 576 is coupled to one or more other levels of cache and eventually to a main memory.
By way of example, the exemplary register renaming, out-of-order issue/execution core architecture may implement the pipeline 500 as follows: 1) the instruction fetch 538 performs the fetch and length decoding stages 502 and 504; 2) the decode unit 540 performs the decode stage 506; 3) the rename/allocator unit 552 performs the allocation stage 508 and renaming stage 510; 4) the scheduler unit(s) 556 performs the scheduling stage 512; 5) the physical register file(s) unit(s) 558 and the memory unit 570 perform the register read/memory read stage 514; the execution cluster 560 performs the execute stage 516; 6) the memory unit 570 and the physical register file(s) unit(s) 558 performs the write back/memory write stage 518; 7) various units may be involved in the exception handling stage 522; and 8) the retirement unit 554 and the physical register file(s) unit(s) 558 perform the commit stage 524.
The core 590 may support one or more instructions sets (e.g., the x86 instruction set (with some extensions that have been added with newer versions); the MIPS instruction set of MIPS Technologies of Sunnyvale, Calif.; the ARM instruction set (with optional additional extensions such as NEON) of ARM Holdings of Sunnyvale, Calif.), including the instruction(s) described herein. In one embodiment, the core 590 includes logic to support a packed data instruction set extension (e.g., AVX1, AVX2), thereby allowing the operations used by many multimedia applications to be performed using packed data.
It should be understood that the core may support multithreading (executing two or more parallel sets of operations or threads), and may do so in a variety of ways including time sliced multithreading, simultaneous multithreading (where a single physical core provides a logical core for each of the threads that physical core is simultaneously multithreading), or a combination thereof (e.g., time sliced fetching and decoding and simultaneous multithreading thereafter such as in the Intel® Hyperthreading technology).
While register renaming is described in the context of out-of-order execution, it should be understood that register renaming may be used in an in-order architecture. While the illustrated embodiment of the processor also includes separate instruction and data cache units 534/574 and a shared L2 cache unit 576, alternative embodiments may have a single internal cache for both instructions and data, such as, for example, a Level 1 (L1) internal cache, or multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache that is external to the core and/or the processor. Alternatively, all of the cache may be external to the core and/or the processor.
The local subset of the L2 cache 604 is part of a global L2 cache that is divided into separate local subsets, one per processor core. Each processor core has a direct access path to its own local subset of the L2 cache 604. Data read by a processor core is stored in its L2 cache subset 604 and can be accessed quickly, in parallel with other processor cores accessing their own local L2 cache subsets. Data written by a processor core is stored in its own L2 cache subset 604 and is flushed from other subsets, if necessary. The ring network ensures coherency for shared data. The ring network is bi-directional to allow agents such as processor cores, L2 caches and other logic blocks to communicate with each other within the chip. Each ring data-path can be 1012-bits wide per direction.
Thus, different implementations of the processor 700 may include: 1) a CPU with the special purpose logic 708 being integrated graphics and/or scientific (throughput) logic (which may include one or more cores), and the cores 702A-N being one or more general purpose cores (e.g., general purpose in-order cores, general purpose out-of-order cores, a combination of the two); 2) a coprocessor with the cores 702A-N being a large number of special purpose cores intended primarily for graphics and/or scientific (throughput); and 3) a coprocessor with the cores 702A-N being a large number of general purpose in-order cores. Thus, the processor 700 may be a general-purpose processor, coprocessor, or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high-throughput many integrated core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 700 may be a part of and/or may be implemented on one or more substrates using any of a number of process technologies, such as, for example, BiCMOS, CMOS, or NMOS.
The memory hierarchy includes one or more levels of cache within the cores, a set of one or more shared cache units 706, and external memory (not shown) coupled to the set of integrated memory controller units 714. The set of shared cache units 706 may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof. While in one embodiment a ring based interconnect unit 712 interconnects the integrated graphics logic 708, the set of shared cache units 706, and the system agent unit 710/integrated memory controller unit(s) 714, alternative embodiments may use any number of well-known techniques for interconnecting such units. In one embodiment, coherency is maintained between one or more cache units 704A-N and cores 702-A-N.
In some embodiments, one or more of the cores 702A-N are capable of multithreading. The system agent 710 includes those components coordinating and operating cores 702A-N. The system agent unit 710 may include for example a power control unit (PCU) and a display unit. The PCU may be or include logic and components needed for regulating the power state of the cores 702A-N and the special purpose logic 708, such as integrated graphics logic. The display unit is for driving one or more externally connected displays.
The cores 702A-N may be homogenous or heterogeneous in terms of architecture instruction set; that is, two or more of the cores 702A-N may be capable of executing the same instruction set, while others may be capable of executing only a subset of that instruction set or a different instruction set.
Typically, the dynamic fill policy concepts disclosed herein can be implemented in a heterogeneous system. For example, in a system with integrated GPU and CPU cores, the dynamic fill policy may be applicable to data requests from the GPU and not applicable to data requests from the CPU cores (e.g., simultaneous multithreading (SMT) cores). It may be possible, however, that different cores of a homogenous system, such as processor 700 in
Code 804, which may be one or more instructions to be executed by processor 800, may be stored in memory 802. Code 804 can include instructions of various logic and components that may be stored in software, hardware, firmware, or any suitable combination thereof, or in any other internal or external component, device, element, or object where appropriate and based on particular needs. In one example, processor 800 can follow a program sequence of instructions indicated by code 804. Each instruction enters a front-end logic 806 and is processed by one or more decoders 808. The decoder may generate, as its output, a micro operation such as a fixed width micro operation in a predefined format, or may generate other instructions, microinstructions, or control signals that reflect the original code instruction. Front-end logic 806 also includes register renaming logic 810 and scheduling logic 812, which generally allocate resources and queue the operation corresponding to the instruction for execution.
Processor 800 can also include execution logic 814 having a set of execution units 816-1 through 816-X. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function. Execution logic 814 can perform the operations specified by code instructions.
After completion of execution of the operations specified by the code instructions, back-end logic 818 can retire the instructions of code 804. In one embodiment, processor 800 allows out of order execution but requires in order retirement of instructions. Retirement logic 820 may take a variety of known forms (e.g., re-order buffers or the like). In this manner, processor 800 is transformed during execution of code 804, at least in terms of the output generated by the decoder, hardware registers and tables utilized by register renaming logic 810, and any registers (not shown) modified by execution logic 814.
Although not shown in
Referring now to
The optional nature of additional processors 915 is denoted in
The memory 940 may be, for example, dynamic random access memory (DRAM), phase change memory (PCM), or a combination of the two. For at least one embodiment, the controller hub 920 communicates with the processor(s) 910, 915 via a multi-drop bus, such as a frontside bus (FSB), point-to-point interface such as QuickPath Interconnect (QPI), or similar connection 995.
In one embodiment, the coprocessor 945 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. In one embodiment, controller hub 920 may include an integrated graphics accelerator.
There can be a variety of differences between the physical resources 910, 915 in terms of a spectrum of metrics of merit including architectural, microarchitectural, thermal, power consumption characteristics, and the like.
In one embodiment, processor 910 executes instructions that control data processing operations of a general type. Embedded within the instructions may be coprocessor instructions. The processor 910 recognizes these coprocessor instructions as being of a type that should be executed by the attached coprocessor 945. Accordingly, the processor 910 issues these coprocessor instructions (or control signals representing coprocessor instructions) on a coprocessor bus or other interconnect, to coprocessor 945. Coprocessor(s) 945 accepts and executes the received coprocessor instructions.
Referring now to
Computer system 1000 includes a first processor 1070 and a second processor 1080 coupled via a point-to-point interconnect 1050. Each of the processors 1070 and 1080 may be some version of the processing devices (e.g., CPUs 20(1)-20(N), 30(1)-30(M), processors 700, 800, 910, 915, 945, core 590, etc.) described herein. In at least one embodiment, processors 1070 and 1080 are respectively processors 910 and 915, while coprocessor 1038 is coprocessor 945. In another embodiment, processors 1070 and 1080 are respectively processor 910 and coprocessor 945.
Processors 1070 and 1080 may each include one or more cores 1074a-1074b and 1084a-1084b, respectively. Processors 1070 and 1080 may also include respective integrated memory controller units (MC) 1072 and 1082, which couple the processors to respective memories, such as a memory 1032 and a memory 1034. In alternative embodiments, memory controller units 1072 and 1082 may be discrete logic separate from processors 1070 and 1080. Memories 1032 and/or 1034 may store various data to be used by processors 1070 and 1080 in achieving operations outlined herein. In an embodiment memories 1032 and 1034 may be at least portions of main memory (e.g., system memory 70) locally coupled to their respective processors.
Processors 1070 and 1080 may be any type of processor, such as those discussed with reference to CPUs 20(1)-20(N), GPUs 30(1)-30(M), and processors 700, 800, 910, 915, 945, and core 590. Processors 1070 and 1080 may exchange information via a point-to-point (PtP) interface 1050 using point-to-point interface circuits 1078 and 1088, respectively. Processors 1070 and 1080 may each exchange information with a chipset 1090 via individual point-to-point interfaces 1052 and 1054 using point-to-point interface circuits 1076, 1086, 1094, and 1098. As shown herein, chipset 1090 is separated from processing elements 1070 and 1080. However, in an embodiment, chipset 1090 is integrated with processing elements 1070 and 1080. Also, chipset 1090 may be partitioned differently with fewer or more integrated circuits. Additionally, chipset 1090 may optionally exchange information with a coprocessor 1038 via a high-performance interface 1039, using an interface circuit 1092, which could be a PtP interface circuit. In one embodiment, the coprocessor 1038 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. Optionally, chipset 1090 may also communicate with a display 1033 for displaying data that is viewable by a human user.
A shared cache (e.g., 1071 or 1081) may be included in either processor 1070 and 1080 and/or may be outside of both processors and other processors such as co-processor 1038, yet coupled to the processors via, for example, a PtP interconnect. This shared cache may be used to store the processors' local cache information (e.g., data requested by a processor), for example, if a processor is placed into a low power mode. This shared cache may include a last level cache, such as LLC 50, which was previously described herein at least with reference to
Chipset 1090 may be coupled to a first bus 1010 via an interface circuit 1096. In an embodiment, first bus 1010 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the embodiments disclosed herein is not so limited. Various I/O devices 1016 may be coupled to first bus 1010, along with a bus bridge 1018, which couples first bus 1010 to a second bus 1020. In an embodiment, one or more additional processor(s) 1015, such as coprocessors, high-throughput MIC processors, GPGPU's, accelerators (e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processor, can be coupled to first bus 1010. In an embodiment, second bus 1020 may be a low pin count (LPC) bus. Second bus 1020 may be in communication with other devices such as a keyboard/mouse 1012 or other input devices (e.g., a touch screen, trackball, joystick, etc.), communication devices 1026 (e.g., modems, network interface devices, or other types of communication devices that may communicate through a computer network 1060), audio I/O devices 1014, and/or a storage unit 1028 (e.g., a disk drive or other mass storage device, which may include instructions/code and data 1030). In alternative embodiments, any portions of the bus architectures could be implemented with one or more PtP links.
In one example, code and data 1030 of storage unit 1028 may contain a software driver that can be run to exploit information from a software application that is causing read requests to be generated. The information may be related to LLC re-use of certain software buffers. As previously discussed herein, the dynamic fill policy may be invoked only for requests that the software marks as having a lesser chance of getting re-used from the LLC in order to maximize the benefits from LLC re-use as well as achieving peak system memory performance.
Other software may also be stored in code and data 1030 to enable configuration and control of DFP-related parameters. In one example, DFP-related parameters may be manually configured by a user, for example, via input devices (e.g., keyboard/mouse 1012) in conjunction with a user interface displayed on a display screen (e.g., display 1033). One example of DFP-related parameters includes a threshold used to determine whether to deallocate an entry in an LRB if the LRB is too full. Another parameter could be a hit-rate threshold that can be used to determine whether the dynamic fill policy is applicable to a data request from a particular processing device based on its actual hit rates during a run-time or averaged over one or more prior run-times. The software can allow DFP-related parameters, such as a threshold for determining whether to deallocate an LRB entry and/or a hit-rate threshold to be statically set or adaptively modified, for example, until a desired optimum run-time configuration is achieved.
The computing system depicted in
Referring now to
Referring now to
Embodiments of the dynamic cache filling mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Embodiments of the present disclosure related to the dynamic fill policy may be implemented as digital circuitry and/or as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
Program code, such as code 1030 illustrated in
The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.
One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the one or more of the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.
Such machine-readable storage media may include, without limitation, non-transitory, tangible arrangements of articles manufactured or formed by a machine or device, including storage media such as hard disks, any other type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritable's (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMS) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), phase change memory (PCM), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
Accordingly, embodiments of the present disclosure also include non-transitory, tangible machine-readable media containing instructions or containing design data, such as Hardware Description Language (HDL), which defines structures, circuits, apparatuses, processors and/or system features described herein. Such embodiments may also be referred to as program products.
Turning to
In this example of
SOC 1300 may also include a subscriber identity module (SIM) I/F 1330, a boot read-only memory (ROM) 1335, a synchronous dynamic random access memory (SDRAM) controller 1340, a flash controller 1345, a serial peripheral interface (SPI) master 1350, a suitable power control 1355, a dynamic RAM (DRAM) 1360, and a flash memory 1365. In addition, one or more example embodiments include one or more communication capabilities, interfaces, and features such as instances of Bluetooth™ 1370, a 3G (or other nG or cellular technology) modem 1375, a global positioning system (GPS) 1380, and an 802.11 Wi-Fi 1385.
In operation, the example of
In some cases, an instruction converter may be used to convert an instruction from a source instruction set to a target instruction set. For example, the instruction converter may translate (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise convert an instruction to one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on processor, off processor, or part on and part off processor.
Regarding possible structures associated with embodiments disclosed herein, diverse processors (e.g., CPUs, GPUs, FPGAs, APUs, DSPs, ASICs, etc.) are connected to a memory element (e.g., system memory 70), which represents one or more types of memory including volatile and/or nonvolatile memory elements for storing data and information, including instructions, logic, and/or code, to be accessed by the processor. Computing system 10 may keep data and information in any suitable memory element (e.g., static random access memory (SRAM), dynamic random access memory (DRAM), read-only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically EPROM (EEPROM), a disk drive, a floppy disk, a compact disk ROM (CD-ROM), a digital versatile disk (DVD), flash memory, a magneto-optical disk, an application specific integrated circuit (ASIC), or other types of nonvolatile machine-readable media that are capable of storing data and information), software, hardware, firmware, or in any other suitable component, device, element, or object where appropriate and based on particular needs. Any of the memory items discussed herein (e.g., processor side caches 25(1)-25(N) and 35(1)-35(M), last level cache 50, LLC request buffer 55, system memory 70) should be construed as being encompassed within the broad term ‘memory element.’
In an example implementation, cache controller 60 includes logic to achieve (or to foster) the dynamic fill policy activities, as outlined herein. In some embodiments, at least some of these dynamic fill policy activities may be carried out by hardware (e.g., digital circuit), implemented externally to the cache controller, or included in some other component coupled to processing devices (e.g., CPUs, GPUs) and/or the cache controller to achieve the intended functionality. The cache controller may also include logic (or reciprocating logic) that can coordinate with other components in order to achieve the intended functionality, as outlined herein. In still other embodiments, one or several elements may include any suitable algorithms, hardware, firmware, software, components, modules, interfaces, or objects that facilitate the operations thereof. Logic may be suitably combined or partitioned in any appropriate manner, which may be based on particular configuration and/or provisioning needs.
The architectures presented herein are provided by way of example only, and are intended to be non-exclusive and non-limiting. Furthermore, the parts disclosed are intended to be logical divisions only (e.g., cache controller 60, LLC fill logic 62, dynamic fill logic 64), and may represent integrated hardware and/or software or physically separate hardware and/or software. Certain computing systems may include the cache controller as a separate chip or integrated into another chip, such as being placed on the same die or as an integral part of a processing device (e.g., CPU, GPU, APU, FPGA, DSP, ASIC, etc.). In yet other computing systems, the cache controller may be separately provisioned or combined with other cache controllers (e.g., other cache memory controllers, memory controllers, DRAM controllers, etc.).
It is also important to note that the operations in the preceding flowcharts and diagrams illustrating interactions (e.g.,
As used herein, unless expressly stated to the contrary, use of the phrase ‘at least one of’ refers to any combination of the named elements, conditions, or activities. For example, ‘at least one of X, Y, and Z’ is intended to mean any of the following: 1) at least one X, but not Y and not Z; 2) at least one Y, but not X and not Z; 3) at least one Z, but not X and not Y; 4) at least one X and at least one Y, but not Z; 5) at least one X and at least one Z, but not Y; 6) at least one Y and at least one Z, but not X; or 7) at least one X, at least one Y, and at least one Z. Additionally, unless expressly stated to the contrary, the numbering adjectives ‘first’, ‘second’, ‘third’, etc., are intended to distinguish the particular terms (e.g., element, condition, module, activity, operation, claim element, etc.) they precede, but are not intended to indicate any type of order, rank, importance, temporal sequence, or hierarchy of the modified term. For example, ‘first X’ and ‘second X’ are intended to designate two separate X elements that are not necessarily limited by any order, rank, importance, temporal sequence, or hierarchy of the two elements. Also, references in the specification to “one embodiment,” “an embodiment,” “some embodiments,” etc., indicate that the embodiment(s) described may include a particular feature, structure, or characteristic, but every embodiment may or may not necessarily include that particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment.
Numerous other changes, substitutions, variations, alterations, and modifications may be ascertained to one skilled in the art and it is intended that the present disclosure encompass all such changes, substitutions, variations, alterations, and modifications as falling within the scope of the appended claims. In order to assist the United States Patent and Trademark Office (USPTO) and, additionally, any readers of any patent issued on this application in interpreting the claims appended hereto, Applicant wishes to note that the Applicant: (a) does not intend any of the appended claims to invoke paragraph six (6) of 35 U.S.C. section 112 as it exists on the date of the filing hereof unless the words “means for” or “step for” are specifically used in the particular claims; and (b) does not intend, by any statement in the specification, to limit this disclosure in any way that is not otherwise reflected in the appended claims.
OTHER NOTES AND EXAMPLESThe following examples pertain to embodiments in accordance with this specification.
Example S1 may include an system, comprising: a first processing device; a second processing device; a cache controller coupled to a cache shared by the first and second processing devices, the cache controller to determine that data requested in a first request for the data by the first processing device is not stored in the cache, where a dynamic fill policy is applicable to the first request, determine to deallocate, based at least in part on a threshold, an entry in a buffer, the entry containing information corresponding to the first request for the data, and send a second request for the data to a system memory; and a memory controller to send the data from the system memory to the first processing device.
In Example S2, the subject matter of Example S1 can optionally include that the data from the system memory is not written to the cache based, at least in part, on the determination to deallocate the entry.
In Example S3, the subject matter of any one of Examples S1-S2 can optionally include that, prior to sending the second request, the entry in the buffer is deallocated in response to determining to deallocate the entry.
In Example S4, the subject matter of any one of Examples S1-S3 can optionally include that the second request includes an indication that the data is to be sent directly to the first processing device from the system memory.
In Example S5, the subject matter of any one of Examples S1-S4 can optionally include that the cache controller is further to determine that the dynamic fill policy is applicable to the first request based on the first processing device, where the dynamic fill policy is not applicable to requests for data from the second processing device.
In Example S6, the subject matter of any one of Examples S1-S5 can optionally include that the first processing device is a graphics processing unit (GPU) and the second processing device has a higher hit rate in the cache for read requests than the first processing device.
In Example S7, the subject matter of any one of Examples S1-S6 can optionally include that the first request for the data is a read request.
In Example S8, the subject matter of Example S7 can optionally include that the cache controller is further to search for the data in the cache upon receiving the read request.
In Example S9, the subject matter of any one of Examples S1-S8 can optionally include that the threshold is one of a whole number or a percentage associated with filled entries contained in the buffer.
In Example S10, the subject matter of any one of Examples S1-S8 can optionally include that the threshold is one of a whole number or a percentage associated with free entries contained in the buffer.
In Example S11, the subject matter of any one of Examples S1-S10 can optionally include that the cache controller is further to, based on determining not to deallocate the entry, write the data from the system memory to the cache and deallocate the entry subsequent to writing the data to the cache.
In Example S12, the subject matter of any one of Examples S1-S11 can optionally include one or more memory elements including a set of instructions that when executed, are to cause at least one processing device of the system to either statically set the threshold or adaptively modify the threshold.
In Example S13, the subject matter of Example S12 can optionally include that the threshold is adaptively modified based, at least in part, on run-time information associated with sending the second request for the data to the system memory.
In Example S14, the subject matter of any one of Examples S1-S11 can optionally include one or more memory elements including a set of instructions that when executed, are to cause at least one processing device of the system to mark the first request to indicate the dynamic fill policy is applicable to the first request based on a probability of the data being re-used from the cache.
Example A1 may include an apparatus, comprising: a cache shared by at least a first processing device and a second processing device; a cache controller coupled to the cache to receive a first request for data from the first processing device, where a dynamic fill policy is applicable to the first request, determine that the data is not stored in the cache, determine to deallocate, based at least in part on a threshold, an entry in a buffer, the entry containing information corresponding to the first request for the data, and send a second request for the data to a system memory; and a memory controller to send the data from the system memory to the first processing device.
In Example A2, the subject matter of Example A1 can optionally include that the data from the system memory is not written to the cache based, at least in part, on the determination to deallocate the entry.
In Example A3, the subject matter of any one of Examples A1-A2 can optionally include that, prior to sending the second request, the entry in the buffer is deallocated in response to determining to deallocate the entry.
In Example A4, the subject matter of any one of Examples A1-A3 can optionally include that the second request includes an indication that the data is to be sent directly to the first processing device from the system memory.
In Example A5, the subject matter of any one of Examples A1-A4 can optionally include that the cache controller is further to determine that the dynamic fill policy is applicable to first request based on the first processing device, where the dynamic fill policy is not applicable to requests for data from the second processing device.
In Example A6, the subject matter of any one of Examples A1-A5 can optionally include that the first processing device is a graphics processing unit (GPU) and the second processing device has a higher hit rate in the cache for read requests than the first processing device.
In Example A7, the subject matter of any one of Examples A1-A6 can optionally include that the first request for the data is a read request.
In Example A8, the subject matter of Example A7 can optionally include that the cache controller is further to search for the data in the cache upon receiving the read request.
In Example A9, the subject matter of any one of Examples A1-A8 can optionally include that the threshold is one of a whole number or a percentage associated with filled entries contained in the buffer.
In Example A10, the subject matter of any one of Examples A1-A8 can optionally include that the threshold is one of a whole number or a percentage associated with free entries contained in the buffer.
In Example A11, the subject matter of any one of Examples A1-A10 can optionally include that the cache controller is further to, based on determining not to deallocate the entry, write the data from the system memory to the cache and deallocate the entry subsequent to writing the data to the cache.
In Example A12, the subject matter of any one of Examples A1-A11 can optionally include one or more memory elements including a set of instructions that when executed, are to cause at least one processing device of the apparatus to either statically set the threshold or adaptively modify the threshold.
In Example A13, the subject matter of Example A12 can optionally include that the threshold is adaptively modified based, at least in part, on run-time information associated with sending the second request for the data to the system memory.
In Example A14, the subject matter of any one of Examples A1-A11 can optionally include one or more memory elements including a set of instructions that when executed, are to cause at least one processing device of the apparatus to mark the first request to indicate the dynamic fill policy is applicable to the first request based on a probability of the data being re-used from the cache.
The following examples pertain to embodiments in accordance with this specification. Example M1 provides an a method, at least one machine-readable storage medium including instructions, and/or hardware-, firmware-, and/or software-based logic, where the Example of M1 comprises determining that data requested in a first request for the data by a first processing device is not stored in a cache shared by the first processing device and a second processing device, where a dynamic fill policy is applicable to the first request, determining to deallocate, based at least in part on a threshold, an entry in a buffer, the entry containing information corresponding to the first request for the data, sending a second request for the data to a system memory; and sending the data from the system memory to the first processing device.
In Example M2, the subject matter of Example M1 can optionally include that the data from the system memory is not written to the cache based, at least in part, on the determination to deallocate the entry.
In Example M3, the subject matter of any one of Examples M1-M2 can optionally include, prior to sending the second request, deallocating the entry in the buffer in response to determining to deallocate the entry.
In Example M4, the subject matter of any one of Examples M1-M3 can optionally include that the second request includes an indication that the data is to be sent directly to the first processing device from the system memory.
In Example M5, the subject matter of any one of Examples M1-M4 can optionally include determining that the dynamic fill policy is applicable to the first request based, at least in part, on a hit rate of the first processing device according to a hit-rate threshold.
In Example M6, the subject matter of any one of Examples M1-M5 can optionally include that the first processing device is a graphics processing unit (GPU) and the second processing device has a higher hit rate in the cache for read requests than the first processing device.
In Example M7, the subject matter of any one of Examples M1-M6 can optionally include that the first request for the data is a read request.
In Example M8, the subject matter of Example M7 can optionally include searching for the data in the cache upon receiving the read request.
In Example M9, the subject matter of any one of Examples M1-M8 can optionally include that the threshold is one of a whole number or a percentage associated with filled entries contained in the buffer.
In Example M10, the subject matter of any one of Examples M1-M8 can optionally include that the threshold is one of a whole number or a percentage associated with free entries contained in the buffer.
In Example M11, the subject matter of any one of Examples M1-M10 can optionally include, based on determining not to deallocate the entry, writing the data from the system memory to the cache and deallocating the entry subsequent to writing the data to the cache.
In Example M12, the subject matter of any one of Examples M1-M11 can optionally include statically setting the threshold or adaptively modifying the threshold.
In Example M13, the subject matter of Example M12 can optionally include that the threshold is adaptively modified based, at least in part, on run-time information associated with sending the second request for the data to the system memory.
In Example M14, the subject matter of any one of Examples M1-M13 optionally includes marking the first request to indicate the dynamic fill policy is applicable to the first request based on a probability of the data being re-used from the cache.
Example Y1 provides an apparatus for dynamically filling a cache, where the apparatus comprises means for performing the method of any one of Examples M1-M14.
In Example Y2, the subject matter of Example Y1 can optionally include that the means for performing the method comprise at least one digital circuit.
In Example Y3, the subject matter of any one of Examples Y1-Y2 can optionally include that the means for performing the method comprise a memory element, the memory element comprising machine readable instructions that when executed, cause, at least in part, the apparatus to perform the method of any one of Examples M1-M14.
In Example Y4, the subject matter of any one of Examples Y1-Y3 can optionally include that the apparatus is one of a computing system or a system-on-a-chip.
Example Y5 provides at least one machine readable storage medium comprising instructions for dynamically filling a cache, where the instructions when executed realize an apparatus, realize a system, or implement a method as in any one of the preceding Examples.
Claims
1. An apparatus, the apparatus comprising:
- a cache shared by at least a first processing device and a second processing device;
- a cache controller coupled to the cache, the cache controller to: receive a first request for data from the first processing device; determine that the data is not stored in the cache; determine whether a dynamic fill policy is to be applied to the first request for the data based, at least in part, on which processing device sent the first request for the data, wherein applying the dynamic fill policy to the first request for the data is to: determine whether to deallocate an entry in a buffer based on a threshold related to the buffer, the entry containing information indicating the first request for the data; and send a second request for the data to a system memory; and
- a memory controller to send the data from the system memory to the first processing device.
2. The apparatus of claim 1, wherein the data from the system memory is not written to the cache based, at least in part, on a determination to deallocate the entry.
3. The apparatus of claim 1, wherein, prior to sending the second request, the entry in the buffer is deallocated in response to a determination to deallocate the entry.
4. The apparatus of claim 1, wherein, based on a determination to deallocate the entry, the second request includes an indication that the data requested in the second request for the data is to be sent directly to the first processing device from the system memory.
5. The apparatus of claim 1, wherein the cache controller is further to:
- determine that the dynamic fill policy is to be applied to requests for data from at least one processing device of a plurality of processing devices in the apparatus; and
- determine that the dynamic fill policy is not to be applied to requests for data from at least one other processing device of the plurality of processing devices in the apparatus.
6. The apparatus of claim 5, wherein the at least one processing device has a higher hit rate in the cache for read requests than the at least one other processing device.
7. The apparatus of claim 1, wherein the first request for the data is a read request.
8. The apparatus of claim 7, wherein the cache controller is further to:
- search for the data in the cache upon receiving the read request.
9. The apparatus of claim 1, wherein the threshold is one of a whole number or a percentage associated with filled entries contained in the buffer.
10. The apparatus of claim 1, wherein the threshold is one of a whole number or a percentage associated with free entries contained in the buffer.
11. The apparatus of claim 1, wherein the cache controller is further to, based on a determination not to deallocate the entry:
- write the data from the system memory to the cache; and
- deallocate the entry subsequent to writing the data to the cache.
12. The apparatus of claim 1, further comprising:
- one or more memory elements including a set of instructions that when executed, are to cause at least one processing device of the apparatus to either statically set the threshold or adaptively modify the threshold.
13. The apparatus of claim 12, wherein the threshold is adaptively modified based, at least in part, on run-time information associated with sending the second request for the data to the system memory.
14. The apparatus of claim 1, further comprising:
- one or more memory elements including a set of instructions that when executed, are to cause at least one processing device of the apparatus to: mark the first request for the data to indicate the dynamic fill policy is to be applied to the first request for the data based on a probability of the data being re-used from the cache.
15. A system, the system comprising:
- a first processing device;
- a second processing device;
- a cache controller coupled to a cache shared by the first and second processing devices, the cache controller to: determine that data requested in a first request for the data by the first processing device is not stored in the cache; determine whether a dynamic fill policy is to be applied to the first request for the data based, at least in part, on which processing device sent the first request for the data, wherein applying the dynamic fill policy to the first request for the data is to: determine whether to deallocate an entry in a buffer based on a threshold related to the buffer, the entry containing information indicating the first request for the data; and send a second request for the data to a system memory; and
- a memory controller to send the data from the system memory to the first processing device.
16. The system of claim 15, wherein the data from the system memory is not written to the cache based, at least in part, on a determination to deallocate the entry.
17. The system of claim 15, wherein, prior to sending the second request, the entry in the buffer is deallocated in response to a determination to deallocate the entry.
18. The system of claim 15, wherein the cache controller is further to:
- determine that the dynamic fill policy is to be applied to requests for data from at least one processing device of a plurality of processing devices in the apparatus; and
- determine that the dynamic fill policy is not to be applied to requests for data from at least one other processing device of the plurality of processing devices in the apparatus.
19. The system of claim 15, wherein the cache controller is further to:
- search for the data in the cache upon receiving the first request.
20. The system of claim 15, wherein the threshold is associated with either filled entries or free entries contained in the buffer.
21. The system of claim 15, further comprising:
- one or more memory elements including a set of instructions that when executed, are to cause at least one processing device of the apparatus to either statically set the threshold or adaptively modify the threshold.
22. A method, the method comprising:
- determining that data requested in a first request for the data by a first processing device is not stored in a cache shared by the first processing device and a second processing device;
- determining whether a dynamic fill policy is to be applied to the first request for the data based, at least in part, on which processing device sent the first request for the data, wherein applying the dynamic fill policy to the first request for the data includes: determining whether to deallocate an entry in a buffer based on a threshold related to the buffer, the entry containing information indicating the first request for the data; and sending a second request for the data to a system memory; and
- sending the data from the system memory to the first processing device.
23. The method of claim 22, wherein the data from the system memory is not written to the cache based, at least in part, on a determination to deallocate the entry.
24. The method of claim 22, wherein, based on a determination to deallocate the entry, the second request includes an indication that the data is to be sent directly to the first processing device from the system memory.
25. The method of claim 22, further comprising:
- determining that the dynamic fill policy is to be applied to the first request based, at least in part, on a hit rate of the first processing device according to a hit-rate threshold.
20130326164 | December 5, 2013 | Jeddeloh |
20140189065 | July 3, 2014 | van der Schaar |
20180004668 | January 4, 2018 | Azizi |
- “PL310 Cache Controller Technical Reference Manual”, copyright 2007 ARM Limited, 148 pages, retrieved on Mar. 25, 2017 from http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246a/DDI0246A_l2cc_pl310_r0p0_trm.pdf.
- Arora, Manish, “The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing”, pp. 1-12, retrieved on Feb. 21, 2017 from http://cseweb.ucsd.edu/˜marora/files/papers/REReport_ManishArora.pdf.
- Kayiran, Onur et al., “Managing GPU Concurrency in Heterogeneous Architectures”, pp. 1-13, retrieved on Feb. 21, 2017 from https://users.ece.cmu.edu/˜omutlu/pub/gpu-concurrency-management_micro14.pdf.
- Mittal, Sparsh, “A Survey of Cache Bypassing Techniques,” Journal of Low Power Electronics and Applications, Apr. 28, 2016, (30 pages), retrieved on Mar. 5, 2017 from file:///C:/Users/admin/Downloads/jlpea-06-00005%20(1).pdf.
Type: Grant
Filed: Mar 31, 2017
Date of Patent: Mar 12, 2019
Patent Publication Number: 20180285261
Assignee: Intel Corporation (Santa Clara, CA)
Inventors: Ayan Mandal (Karnataka), Eran Shifer (Tel Aviv), Leon Polishuk (Haifa)
Primary Examiner: Than Nguyen
Application Number: 15/476,816
International Classification: G06F 12/08 (20160101); G06F 12/084 (20160101); G06F 12/0846 (20160101); G06F 12/0855 (20160101); G06F 9/50 (20060101); G06F 12/02 (20060101); G06F 12/0888 (20160101); G06F 12/1027 (20160101); G06F 3/06 (20060101); G06F 9/455 (20180101);