Adaptive Caching of Input / Output Data
To improve caching techniques, so as to realize greater hit rates within available memory, of the present invention utilizes a entropy signature from the compressed data blocks to supply a bias to pre-fetching operations. The method of the present invention for caching data involves detecting a data I/O request, relative to a data object, and then selecting appropriate I/O to cache, wherein said selecting can occur with or without user input, or with or without application or operating system preknowledge. Such selecting may occur dynamically or manually. The method further involves estimating an entropy of a first data block to be cached in response to the data I/O request; selecting a compressor using a value of the entropy of the data block from the estimating step, wherein each compressor corresponds to one of a plurality of ranges of entropy values relative to an entropy watermark; and storing the data block in a cache in compressed form from the selected compressor, or in uncompressed form if the value of the entropy of the data block from the estimating step falls in a first range of entropy values relative to the entropy watermark. The method can also include the step of prefetching a data block using gap prediction with an applied entropy bias, wherein the data block is the same as the first data block to be cached or is a separate second data block. The method can also involve the following additional steps: adaptively adjusting the plurality of ranges of entropy values; scheduling a flush of the data block from the cache; and suppressing operating system flushes in conjunction with the foregoing scheduling step.
This application is a continuation of U.S. patent application Ser. No. 11/152,363, filed on Jun. 14, 2005, entitled “Adaptive Input/Output Compressed System and Data Cache and System Using Same”, invented by John E. Kellar, which claims benefit of priority of U.S. provisional application Ser. No. 60/579,344 titled “Adaptive Input/Output Cache and System Using Same,” filed Jun. 14, 2004, and which are all hereby incorporated by reference in their entirety as though fully and completely set forth herein.
FIELD OF THE INVENTIONThe present invention relates, in general, to data processing systems and more particularly to adaptive data caching in data processing systems to reduce transfer latency or increase transfer bandwidth of data movement within these systems.
DESCRIPTION OF THE RELATED ARTIn modern data processing systems, the continual increase in processor speeds has outpaced the rate of increase of data transfer rates from peripheral persistent data storage devices and sub-systems. In systems such as enterprise scale server systems in which substantial volumes of volatile, or persistent data are manipulated, the speed at which data can be transferred may be the limiting factor in system efficiency. Commercial client/server database environments are emblematic of such systems. These environments are usually constructed to accommodate a large number of users performing a large number of sophisticated database queries and operations to a large distributed database. These compute, memory and I/O intensive environments put great demands on database servers. If a database client or server is not properly balanced, then the number of database transactions per second that it can process can drop dramatically. A system is considered balanced for a particular application when the CPU(s) tends to saturate about the same time as the I/O subsystem.
Continual improvements in processor technology have been able to keep pace with ever-increasing performance demands, but the physical limitations imposed on retrieving data from disk has caused I/O transfer rates to become an inevitable bottleneck. Bypassing these physical limitations has been an obstacle to overcome in the quest for better overall system performance.
In the computer industry, this bottleneck, known as a latency gap because of the speed differential, has been addressed in several ways. Caching the data in memory is known to be an effective way to diminish the time taken to access the data from a rotating disk. Unfortunately, memory resources are in high demand on many systems, and traditional cache designs have not made the best use of memory devoted to them. For instance, many conventional caches simply cache data existing ahead of the last host request. Implementations such as these, known as Read Ahead caching, can work in unique situations, but for non-sequential read requests, data is fruitlessly brought into the cache memory. This blunt approach to caching however has become quite common due to simplicity of the design. In fact, this approach has been put in use as read buffers within the persistent data storage systems such as disks and disk controllers.
Encoding or compressing cached data in operating system, system caches increase the logical effective cache size and cache hit rate, and thus improves system response time. On the other hand, compressed data requires variable-length record management, free space search and garbage collection. This overhead may negate performance improvements achieved by increasing effective cache size. Thus, there is a need for a new operating system, system file, data and buffer cache data managing method with low overhead, transparent to the operating systems in conventional data managing methods. With such an improved method, it is expected that the effective, logically accessible, memory available for file and data buffer cache size will increase by 30% to 400%, effectively improving system-cost performance.
Ideally, a client should not notice any substantial degradation in response time for a given transaction even as the number of transactions requested per second by other clients to the database server increases. The availability of main memory plays a critical role in a database server's ability to scale for this application. In general, a database server will continue to scale up until the point that the application data no longer fits in main memory. Beyond this point, the buffer manager resorts to swapping pages between main memory and storage sub-systems. The amount of this paging increases exponentially as a function of the fraction of main memory available, causing application performance and response time to degrade exponentially as well. At this point, the application is said to be I/O bound.
When a user performs a sophisticated data query, thousands of pages may be needed from the database, which is typically distributed across many storage devices, and possibly distributed across many systems. To minimize the overall response time of the query, access times must be as small as possible to any database pages that are referenced more than once. Access time is also negatively impacted by the enormous amount of temporary data that is generated by the database server, which normally cannot fit into main memory, such as the temporary files generated for sorting. If the buffer cache is not large enough, then many of those pages will have to be repeatedly fetched to and from the storage sub-system.
Independent studies have shown that when 70% to 90% of the working data fits in main memory, most applications will run several times slower. When only 50% fits, most run 5 to 20 times slower. Typical relational database operations run 4 to 8 times slower when only 66% of the working data fits in main memory. The need to reduce or eliminate application page faults, data or file system I/O is compelling. Unfortunately for system designers, the demand for more main memory by database applications will continue to far exceed the rate of advances in memory density. Coupled with this demand from the application area comes competing demands from the operating system, as well as associated I/O controllers and peripheral devices. Cost-effective methods are needed to increase the, apparent, effective size of system memory.
It is difficult for I/O bound applications to take advantage of recent advances in CPU, processor cache, Front Side Bus (FSB) speeds, >100 Mbit network controllers, and system memory performance improvements (e.g., DDR2) since they are constrained by the high latency and low bandwidth of volatile or persistent data storage subsystems. The most common way to reduce data transfer latency is to add memory. Adding memory to database servers may be expensive since these applications demand a lot of memory, or may even be impossible, due to physical system constraints such as slot limitations. Alternatively, adding more disks and disk caches with associated controllers, or Network Attached Storage (NAS) and network controllers or even Storage Aware Network (SAN) devices with Host Bus Adapters (HBA's) can increase storage sub-system request and data bandwidth. It may be even necessary to move to a larger server with multiple, higher performance I/O buses. Memory and disks are added until the database server becomes balanced.
First, the memory data encoding/compression increases the effective size of system wide file and/or buffer cache by encoding and storing a large block of data into a smaller space. The effective available reach of these caches is typically doubled, where reach is defined as the total immediately accessible data requested by the system, without recourse to out-of-core (not in main memory) storage. This allows client/server applications, which typically work on data sets much larger than main memory, to execute more efficiently due to the decreased number of volatile, or persistent, storage data requests. The numbers of data requests to the storage sub-systems are reduced because pages or disk blocks that have been accessed before are statistically more likely to still be in main memory when accessed again due to the increased capacity of cache memory. A secondary effect of such compression or encoding is reduced latency in data movement due to the reduced size of the data. Basically, the average compression ratio tradeoff against the original data block size as well as the internal cache hash bucket size must be balanced in order to reap the greatest benefit from this tradeoff. The Applicant of the present invention believes that an original uncompressed block size of 4096 bytes with an average compression ratio of 2:1 stored internally in the cache, in a data structure known as an open hash, in blocks of 256 bytes results in the greatest benefit towards reducing data transfer latency for data movement across the north and south bridge devices as well as to and from the processors across the Front-Side-Bus. The cache must be able to modify these values in order to reap the greatest benefits from this second order effect.
There is a need to improve caching techniques, so as to realize greater hit rates within the available memory of modern systems. Current hit rates, from methods such as LRU (Least Recently Used), LFU (Least Frequently Used), GCLOCK and others, have increased very slowly in the past decade and many of these techniques do not scale well with the availability of the large amounts of memory that modern computer systems have available today. To help meet this need, the present invention utilizes a entropy signature from the compressed data blocks to supply a bias to pre-fetching operations. This signature is produced from the entropy estimation function described herein, and stored in the tag structure of the cache. This signature provides a unique way to group previously seen data; this grouping is then used to bias or alter the pre-fetching gaps produced by the prefetching function described below. Empirical evidence shows that this entropy signature improve pre-fetching operations over large data sets (greater than 4 GBytes of addressable space) by approximately 11% over current techniques that do not have this feature available.
There is also a need for user applications to be able to access the capabilities for reducing transfer latency or increasing transfer bandwidth of data movement within these systems. There is a further need to supply these capabilities to these applications in a transparent way, allowing an end-user application to access these capabilities without requiring any recoding or alteration of the application. The Applicant of the present invention believes this may be accomplished through an in-core file-tracking database maintained by the invention. Such a core file-tracking data base would offer seamless access to the capabilities of the invention by monitoring file open and close requests from the user-application/operating system interface, decoding the file access flags, while maintaining an internal list of the original file object name and flags, and offering the capabilities of the invention to appropriate file access. The in-core file-tracking database would also allow the end-user to over-ride an application's caching request and either allow or deny write-through or write-back or non-conservative or no-caching to an application on a file by file basis, through the use of manual file tracking or, on a system wide basis, through the use of dynamic file tracking. This capability could also be offered in a more global, system-wide way by allowing caching of file system metadata; this caching technique (the caching of file system metadata specifically) is referred to throughout this document as “non-conservative caching.”
There is a further need to allow an end-user application to seamlessly access PAE (Physical Address Extension) memory for use in file caching/data buffering, without the need to re-code or modify the application in any way. The PAE memory addressing mode is limited to the Intel, Inc. ×86 architecture. There is a need for replacement of the underlying memory allocator to allow a PAE memory addressing mode to function on other processor architectures. This would allow end-user applications to utilize the modern memory addressing capabilities without the need to re-code or modify the end-user application in any way. This allows transparent seamless access to PAE memory, for use by the buffer and data cache, without user intervention or system modification.
Today, large numbers of storage sub-systems are added to a server system to satisfy the high I/O request rates generated by client/server applications. As a result, it is common that only a fraction of the storage space on each storage device is utilized. By effectively reducing the I/O request rate, fewer storage sub-system caches and disk spindles are needed to queue the requests, and fewer disk drives are needed to serve these requests. The reason that the storage sub-system space is not efficiently utilized is that, on today's hard-disk, storage systems, access latency increases as the data written to the storage sub-system moves further inward from the edge of the magnetic platter, in order to keep access latency at a minimum system designers over-design storage sub-systems to take advantage of this phenomenon. This results in under-utilization of available storage. There is a need to reduce average latency to the point that this trade-off is not needed, resulting in storage space associated with each disk that can be more fully utilized at an equivalent or reduced latency penalty.
In addition, by reducing the size of data to be transferred between local and remote persistent storage and system memory, the I/O and Front Side Buses (FSB) are utilized less. This reduced bandwidth requirement can be used to scale system performance beyond its original capabilities, or allow the I/O subsystem to be cost reduced due to reduced component requirements based on the increased effective bandwidth available.
Thus, there is a need in the art for mechanisms to balance the increases in clock cycles of the CPU and data movement latency gap without the need for adding additional volatile or persistent storage and memory sub-systems or increasing the clock cycle frequency of internal system and I/O buses. Furthermore, there is a need to supply this capability transparently to end user applications so that they can take advantage of this capability in both a dynamic and a directed way.
SUMMARY OF THE INVENTIONThere is a need to improve caching techniques, so as to realize greater hit rates within the available memory of modern systems. Current hit rates, from methods such as LRU (Least Recently Used), LFU (Least Frequently Used), GCLOCK and others, have increased very slowly in the past decade and many of these techniques do not scale well with the availability of the large amounts of memory that modern computer systems have available today. To help meet this need, the present invention utilizes a entropy signature from the compressed data blocks to supply a bias to pre-fetching operations. This signature is produced from the entropy estimation function described herein, and stored in the tag structure of the cache. This signature provides a unique way to group previously seen data; this grouping is then used to bias or alter the pre-fetching gaps produced by the prefetching function described below. Empirical evidence shows that this entropy signature improve pre-fetching operations over large data sets (greater than 4 GBytes of addressable space) by approximately 11% over current techniques that do not have this feature available.
The method for caching data in accordance with the present invention involves detecting a data input/output request, relative to a data object, and then selecting appropriate I/O to cache, wherein said selecting can occur with or without user input, or with or without application or operating system preknowledge. Such selecting may occur dynamically or manually. The method of the present invention further involves estimating an entropy of a data block to be cached in response to the data input/output request; selecting a compressor using a value of the entropy of the data block from the estimating step, wherein each compressor corresponds to one of a plurality of ranges of entropy values relative to an entropy watermark; and storing the data block in a cache in compressed form from the selected compressor, or in uncompressed form if the value of the entropy of the data block from the estimating step falls in a first range of entropy values relative to the entropy watermark. The method for caching data in accordance with the present invention can also include the step of prefetching a data block using gap prediction with an applied entropy bias, wherein the data block is the data block to be cached, as referenced above, or is a separate second data block. The method of the present invention can also involve the following additional steps: adaptively adjusting the plurality of ranges of entropy values; scheduling a flush of the data block from the cache; and suppressing operating system flushes in conjunction with the foregoing scheduling step.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter, which form the subject of the claims of the invention.
The present invention may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings. The use of the same reference number throughout the several figures designates a like or similar element.
In the following description, numerous specific details are set forth such as specific word or byte lengths, etc. to provide a thorough understanding of the present invention. However, it will be obvious to those skilled in the art that the present invention may be practiced without such specific details. In other instances, well-known circuits have been shown in block diagram form in order not to obscure the present invention in unnecessary detail. For the most part, details concerning timing considerations and the like have been omitted inasmuch as such details are not necessary to obtain a complete understanding of the present invention and are within the skills of persons of ordinary skill in the relevant art.
Refer now to the drawings wherein depicted elements are not necessarily shown to scale and wherein like or similar elements are designated by the same reference numeral through the several views.
In accordance with the present inventive principles, filter driver 118 intercepts the operating system file access and performs caching operations, described further herein below, transparently. That is, the caching, file tracking and, in particular, the compression associated therewith, is transparent to the application 104. Data selected for caching is stored in a (compressed) cache (denoted as ZCache 120). (The “ZCache” notation is used as a mnemonic device to call attention to the fact that the cache in accordance with the present invention is distinct from the instruction/data caches commonly employed in modern microprocessor systems, and typically denoted by the nomenclature “L1”, “L2” etc. cache. Furthermore the Z is a common mnemonic used to indicate compression or encoding activity.) In an embodiment of the present invention, ZCache 120 may be physically implemented as a region in main memory. Filter 118 maintains a file tracking database (DB) 122 which contains information regarding which files are to be cached or not cached, and other information useful to the management of file I/O operations, as described further herein below. Although logically part of filter driver 118, physically, file tracking DB 122 may be included in ZCache 120.
A few notes on
1) The preferred embodiment of the File system driver layers itself between boxes #2 (I/O Manager Library) and #18 (FS Driver).
2) The disk filter layers itself between boxes #18 (FS Driver) and the boxes in the peer group depicted by #19 (Disk Class), #20 (CD-ROM Class), and #21 (Class).
3) The ZCache module exists as a stand-alone device driver adjunct to the file system filter and disk filter device drivers.
4) A TDI Filter Driver is inserted between box (TDI) 8, with connection tracking for network connections that operates the same as the file tracking modules in the compressed data cache, and the peer group of modules that consist of (AFD) 3, (SRV) 4, (RDR) 5, (NPFS) 6, and (MSFS) 7. A complete reference on TDI is available on the Microsoft MSDN website at
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/network/hh/network/303tdi.sub.--519j.asp, which is incorporated herein by reference.
5) A NDIS intermediate cache driver is inserted between the bottom edge of the transport drivers and the upper edge of the NDIS components.
The definitions of the states shown in
1) Invalid: The cache line does not contain valid data.
2) Shared: The cache line contains data, which is consistent with the backing store in the next level of the memory hierarchy.
3) Modified: The cache line contains the most recent data, and is different than data contained in backing store.
Other caching protocols factors to consider are:
1) Read/Write ordering consistency
2) Allocate on Write Policy
3) Write-through, Write-Back, and Non-cacheable attributes
4) Blocking vs. a Non-Blocking design
5) Support for hardware codec support
6) Squashing support to save I/O Requests
Another important item to consider when applying this concept to the invention's cache protocol is the high latencies associated with issuing and completing disk I/Os. It is necessary to break apart the MSI Shared and Modified states to take into consideration the following cases:
1) A cache line is allocated, but the disk I/O may not complete in hundreds if not thousands of microseconds. During this time, additional I/O requests could be made against the same allocated cache line.
2) Dynamically changing cache policies based on file-stream attributes, in different process contexts.
3) Take maximum advantage of the Asynchronous I/O model.
Application of these considerations is shown in the state diagram
Many operating systems have features that can be exploited for maximum performance benefit. As previously mentioned, some of these feature are Asynchronous I/O models, I/O Request Packets or IRPS that can be pended, managed and queued by intermediate drivers, and internal list manipulation techniques, such as look-aside lists or buddy lists. These features may vary slightly from operating system to operating system; none of these features are precluded or required by the present inventive design principles.
Refer now to
Methodology 200 watches for I/O operations involving data block moves, in step 502. See
Firstly, the user may specify a list of files to be ignored. If, in step 204, the subject file of the data move is in the “ignored” list, process 200 returns to step 208 to continue to watch for data block moves. Otherwise, in step 206, it is determined if caching is turned off in accordance with a global caching policy. As discussed in conjunction with
In decision block 210, it is determined if dynamic, manual or alternatively, non-conservative tracking is set. This may be responsive to a value of Dynamic flag 28,
Untracked files include metadata and files that may have been opened before the caching process started. Metadata files are files that contain descriptions of data such as information concerning the location of files and directories; log files to recover corrupt volumes and flags which indicate bad clusters on a physical disk. Metadata can represent a significant portion of the I/O to a physical persistent store because the contents of small files (e.g., <4,096 bytes) may be completely stored in metadata files. In step 214 it is determined if non-conservative caching is enabled. In an embodiment of the present invention using file tracking flags 21,
In step 214, it is determined if the subject file is a pagefile. If so, in step 214 it is determined if caching of pagefiles is enabled. The flag 28 (
Process 200 having determined that the subject data is to be cached, in step 220 file object information is extracted from the I/O request packet and stored in the file tracking DB, step 222 (
In
In step 226 (
In step 232 (
In step 234, which may be viewed as a three-way decision block if three levels of compression are provided, the subject data block is compressed using an entropy estimate based compressor selection. This may be further understood by referring to
Moreover, the bands may be adaptively adjusted. If, for example, the CPU is being underutilized, it may be advantageous to use a more aggressive compressor, even if the additional compression might not otherwise be worth the tradeoff. In this circumstance, the width of bands 404a, b and 406a, b may be expanded. Conversely, if CPU cycles are at a premium relative to memory, it may be advantageous to increase the width of bands 402a, b, and shrink the width of bands 406a, b. A methodology for adapting the compressor selection is described in conjunction with
In
Returning to step 288 in
If the block has been previously read, in step 260 it is determined if a gap prediction is stored in the tag (e.g., gap prediction member 318,
Otherwise, in step 264 the next sequential block is prefetched and a prefetch counter is set for the block. Referring to
Returning to step 258, if the block has not been previously read, in
Similarly, if there is no gap prediction, a prefetch based on solely on entropy is performed via the “No” branch of decision block 260.
In step 204 the read is returned.
Preferred implementations of the invention include implementations as a computer system programmed to execute the method or methods described herein, and as a computer program product. According to the computer system implementation, sets of instructions for executing the method or methods are resident in the random access memory 714 of one or more computer systems configured generally as described above. These sets of instructions, in conjunction with system components that execute them may perform operations in conjunction with data block caching as described hereinabove. Until required by the computer system, the set of instructions may be stored as a computer program product in another computer memory, for example, in disk drive 720 (which may include a removable memory such as an optical disk or floppy disk for eventual use in the disk drive 720). Further, the computer program product can also be stored at another computer and transmitted to the user's workstation by a network or by an external network such as the Internet. One skilled in the art would appreciate that the physical storage of the sets of instructions physically changes the medium upon which is the stored so that the medium carries computer-readable information. The change may be electrical, magnetic, chemical, biological, or some other physical change. While it is convenient to describe the invention in terms of instructions, symbols, characters, or the like, the reader should remember that all of these in similar terms should be associated with the appropriate physical elements.
Note that the invention may describe terms such as comparing, validating, selecting, identifying, or other terms that could be associated with a human operator. However, for at least a number of the operations described herein which form part of at least one of the embodiments, no action by a human operator is desirable. The operations described are, in large part, machine operations processing electrical signals to generate other electrical signals.
Claims
1. A method for caching data comprising:
- detecting a data input/output (I/O) request, relative to a data object;
- selecting appropriate I/O to cache, wherein said selecting can occur with or without user input, or with or without application or operating system preknowledge;
- estimating an entropy of a data block to be cached in response to the data input/output request;
- selecting a compressor using a value of the entropy of the data block from the estimating step, wherein each compressor corresponds to one of a plurality of ranges of entropy values relative to an entropy watermark;
- storing the data block in a cache in compressed form from the selected compressor, or in uncompressed form if the value of the entropy of the data block from the estimating step falls in a first range of entropy values relative to the entropy watermark; and
- prefetching the data block using gap prediction with an applied entropy bias.
2. The method of claim 1 further comprising adaptively adjusting the plurality of ranges of entropy values.
3. The method of claim 1 further comprising scheduling a flush of the data block from the cache.
4. The method of claim 3 further comprising suppressing operating system flushes in conjunction with the scheduling step.
5. The method of claim 1, wherein said selecting occurs dynamically.
6. The method of claim 1, wherein said selecting occurs manually.
7. A method for caching data comprising:
- detecting a data input/output (I/O) request, relative to a data object;
- selecting appropriate I/O to cache, wherein said selecting can occur with or without user input, or with or without application or operating system preknowledge;
- estimating an entropy of a first data block to be cached in response to the data input/output request;
- selecting a compressor using a value of the entropy of the first data block from the estimating step, wherein each compressor corresponds to one of a plurality of ranges of entropy values relative to an entropy watermark;
- storing the first data block in a cache in compressed form from the selected compressor, or in uncompressed form if the value of the entropy of the first data block from the estimating step falls in a first range of entropy values relative to the entropy watermark; and
- prefetching a second data block using gap prediction with an applied entropy bias.
8. The method of claim 7 further comprising adaptively adjusting the plurality of ranges of entropy values.
9. The method of claim 7 further comprising scheduling a flush of the data block from the cache.
10. The method of claim 9 further comprising suppressing operating system flushes in conjunction with the scheduling step.
11. The method of claim 7, wherein said selecting occurs dynamically.
12. The method of claim 7, wherein said selecting occurs manually.
13. One or more computer program products readable by a machine and containing instructions for performing the method contained in claim 1.
14. One or more computer program products readable by a machine and containing instructions for performing the method contained in claim 7.
Type: Application
Filed: Sep 8, 2008
Publication Date: Feb 26, 2009
Inventor: John E. Kellar (Georgetown, TX)
Application Number: 12/206,051
International Classification: G06F 12/08 (20060101); G06F 12/00 (20060101); G06F 12/12 (20060101);