Accelerated RAID with rewind capability
A method for storing data in a fault-tolerant storage subsystem having an array of failure independent data storage units, by dividing the storage area on the storage units into a logical mirror area and a logical stripe area, such that when storing data in the mirror area, duplicating the data by keeping a duplicate copy of the data on a pair of storage units, and when storing data in the stripe area, storing data as stripes of blocks, including data blocks and associated error-correction blocks.
Latest Quantum Corporation Patents:
- Automated system and method for diagnosing tape drive and media issues within large-scale tape library system
- System and method for storing and retrieving objects across multiple media
- Automatic implementation of a physical barrier to protect removable storage media access
- Artificial intelligence augmented iterative product decoding
- System and method for writing data to tape and reading data from tape using a restful interface
The present invention relates to data protection in data storage devices, and in particular to data protection in disk arrays.
BACKGROUND OF THE INVENTIONStorage devices of various types are utilized for storing information such as in computer systems. Conventional computer systems include storage devices such as disk drives for storing information managed by an operating system file system. With decreasing costs of storage space, an increasing amount of data is stored on individual disk drives. However, in case of disk drive failure, important data can be lost. To alleviate this problem, some fault-tolerant storage devices utilize an array of redundant disk drives (RAID).
In typical data storage systems including storage devices such as primary disk drives, the data stored on the primary storage devices is backed-up to secondary storage devices such as tape, from time to time. However, any change to the data on the primary storage devices before the next back-up, can be lost if one or more of the primary storage devices fail.
True data protection can be achieved by keeping a log of all writes to a storage device, on a data block level. In one example, a user data set and a write log are maintained, wherein the data set has been completely backed up and thereafter a log of all writes is maintained. The backed-up data set and the write log allows returning to the state of the data set before the current state of the data set, by restoring the backed-up (baseline) data set and then executing all writes from that log up until that time.
To protect the log file itself, RAID configured disk arrays provide protection against data loss by protecting a single disk drive failure. Protecting the log file stream using RAID has been achieved by either a RAID mirror (known as RAID-1) shown by example in
Referring back to
On the other hand, the RAID mirror configuration (“mirror”) allows writing the log file stream to disk faster than the RAID stripe configuration (“stripe”). A mirror is faster than a stripe since in the mirror, each write activity is independent of other write activities, in that the same block can be written to the mirroring disk drives at the same time. However, a mirror configuration requires that the capacity to be protected be matched on another disk drive. This is costly as the capacity to be protected must be duplicated, requiring double the number of disk drives. A stripe reduces such capacity to 1/n where n is the number of disk drives in the disk drive array. As such, protecting data with parity across multiple disk drives makes a stripe slower than a mirror, but more cost effective.
There is, therefore, a need for a method and system of providing cost effective data protection with improved data read/write performance than a conventional RAID system. There is also a need for such a system to provide the capability of returning to a desired previous data state.
BRIEF SUMMARY OF THE INVENTIONThe present invention satisfies these needs. In one embodiment, the present invention provides a method for storing data in a fault-tolerant storage subsystem having an array of failure independent data storage units, by dividing the storage area on the storage units into a hybrid of a logical mirror area (i.e., RAID mirror) and a logical stripe area (i.e., RAID stripe). When storing data in the mirror area, the data is duplicated by keeping a duplicate copy of the data on a pair of storage units, and when storing data in the stripe area, the data is stored as stripes of blocks, including data blocks and associated error-correction blocks.
In one version of the present invention, a log file stream is maintained as a log cache in the RAID mirror area for writing data from a host to the storage subsystem, and then data is transferred from the log file in the RAID mirror area to the final address in the RAID stripe area, preferably as a background task. In doing so, the aforementioned write latency performance penalty associated with writes to a RAID stripe can be masked from the host.
To further enhance performance, according to the present invention, a memory cache (RAM cache) is added in front of the log cache, wherein incoming host blocks are first written to RAM cache quickly and the host is acknowledged. The host perceives a faster write cycle than is possible if the data were written to a data storage unit while the host waited for an acknowledgement. This further enhances the performance of the above hybrid RAID subsystem.
While the data is en-route to a data storage unit through the RAM cache, power failure can result in data loss. As such, according to another aspect of the present invention, a flashback module (backup module) is added to the subsystem to protect the RAM cache data. The flashback module includes a non-volatile memory, such as flash memory, and a battery. During normal operations, the battery is trickle charged. Should any power failure then occur, the battery provides power to transfer the contents of the RAM cache to the flash memory. Upon restoration of power, the flash memory contents are transferred back to the RAM cache, and normal operations resume.
Read performance is further enhanced by pressing a data storage unit (e.g., disk drive) normally used as a spare data storage unit (“hot spare”) in the array, into temporary service in the hybrid RAID system. In a conventional RAID subsystem, any hot spare lies dormant but ready to take over if one of the data storage units in the array should fail. According to the present invention, rather than lying dormant, the hot spare can be used to replicate the data in the mirrored area of the hybrid RAID subsystem. Should any data storage unit in the array fail, this hot spare could immediately be delivered to take the place of that failed data storage unit without increasing exposure to data loss from a single data storage unit failure. However, while all the data storage units of the array are working properly, the replication of the mirror area would make the array more responsive to read requests by allowing the hot spare to supplement the mirror area.
The mirror area acts as a temporary store for the log, prior to storing the write data in its final location in the stripe area. In another version of the present invention, prior to purging the data from the mirror area, the log can be written sequentially to an archival storage medium such as tape. If a baseline backup of the entire RAID subsystem stripe is created just before the log files are archived, each successive state of the RAID subsystem can be recreated by re-executing the write requests within the archived log files. This would allow any earlier state of the stripe of the RAID subsystem to be recreated (i.e., infinite roll-back or rewind). This is beneficial in allowing recovery from e.g. user error such as accidentally erasing a file, from a virus infection, etc.
As such, the present invention provides a method and system of providing cost effective data protection with improved data read/write performance than a conventional RAID system, and also provides the capability of returning to a desired previous data state.
BRIEF DESCRIPTION OF THE DRAWINGSThese and other features, aspects and advantages of the present invention will become understood with reference to the following description, appended claims and accompanying figures where:
Referring to
In the example of
Referring to the example steps in the flowchart of
Referring to
As the write log 26 may grow large, it is preferably offloaded to secondary storage devices such as tape drives, to free up disk space to log more changes to the data set 24. As such, the disk array 17 (
Referring to the example steps in the flowchart of
A such, the stripe area 22 is used for flushing the write log data, thereby permanently storing the data set in the stripe area 22, and also used to read data blocks that are not in the write log cache 26 in the mirror area 20. The hybrid RAID system 16 is an improvement over a conventional RAID stripe without a RAID mirror, since according to the present invention most recently written data is likely in the log 26 stored in the mirror area 20, which provides a faster read than a stripe. The hybrid RAID system provides equivalent of RAID mirror performance for all writes and for most reads since most recently written data is most likely to be read back. As such, the RAID stripe 22 is only accessed to retrieve data not found in the log cache 26 stored in the RAID mirror 20, whereby the hybrid RAID system 16 essentially provides the performance of a RAID mirror, but at cost effectiveness of a RAID stripe.
Therefore, if the stripe 22 is written to as a foreground process (e.g., real-time), then there is write performance penalty (i.e. the host is waiting for an acknowledgement that the write is complete). The log cache 26 permits avoidance of such real-time writes to the stripe 22. Because the disk array 17 is divided into two logical data areas (i.e., a mirrored log write area 20 and a striped read area 22) using a mirror configuration for log writes avoids the write performance penalty of a stripe. Provided the mirror area 20 is sufficiently large to hold all log writes that occur during periods of peak activity, updates to the stripe area 22 can be performed in the background. The mirror area 20 is essentially a write cache, and writing the log 26 to the mirror area 20 with background writes to the stripe area 22 allows the hybrid subsystem 16 to match mirror performance at stripe-like cost.
Referring to the example steps in the flowchart of
However, power failure while the data is en-route to disk (e.g., to the write log cache on disk) through the RAM write cache 32 can result in data loss because RAM is volatile. Therefore, as shown in the example block diagram of another embodiment of a hybrid RAID subsystem 16 in
The module 34 includes a non-volatile memory 36 such as Flash memory, and a battery 38. Referring to the example steps in the flowchart of
To minimize the size (and the cost) of the RAM write cache 32 (and thus the corresponding size and cost of flash memory 36 in the flashback module 34), write data should be transferred to disk as quickly as possible. Since sequential throughput of a hard disk drive is substantially better than random performance, the fastest way to transfer data from the RAM write cache 32 to disk is via the log file 26 (i.e., a sequence of address/data pairs above) in the mirror area 20. This is because when writing a data block to the mirror area 20, the data block is written to two different disk drives. Depending on the physical disk address of the incoming blocks from the host to be written, the disk drives of the mirror 20 may be accessed randomly. However, as a log file is written sequentially based on entries in time, the blocks are written to the log file in a sequential manner, regardless of their actual physical location in the data set 24 on the disk drives.
In the above hybrid RAID system architecture according to the present invention, data requested by the host 29 from the RAID subsystem 16 can be in the RAM write cache 32, in the log cache area 26 in the mirror 20 area or in the general purpose stripe area 22. Referring to the example steps in the flowchart of
Since data in the mirror area 20 is replicated, twice the number of actuators are available to pursue read data requests effectively doubling responsiveness. While this mirror benefit is generally recognized, the benefit may be enhanced because the mirror does not contain random data but rather data that has recently been written. As discussed, because the likelihood that data will be read is probably directly proportional to the time since the data has been written, the mirror area 20 may be more likely to contain the desired data. A further acceleration can be realized if the data is read back in the same order it was written regardless of the potential randomness of the final data addresses since the mirror area 20 stores data in the written order and a read in that order creates a sequential stream.
According to another aspect of the present invention, read performance of the subsystem 16 can further be enhanced. In a conventional RAID system, one of the disk drives in the array can be reserved as a spare disk drive (“hot spare”), wherein if one of the other disk drives in the array should fail, the hot spare is used to take the place of that failed drive. According to the present invention, read performance can be further enhanced by pressing a disk drive normally used as a hot spare in the disk array 17, into temporary service in the hybrid RAID subsystem 16.
Referring to the example steps in the flowchart of
As such, in
Depending upon the size of the mirrored area 20, the hot spare 18a may be able to provide multiple redundant data copies for further performance boost. For example, if the hot spare 18a matches the capacity of the mirrored area 20 of the array 17, the mirrored area data can then be replicated twice on the hot spare 18a. For example, in the hot spare 18a data can be arranged wherein the data is replicated on each concentric disk track (i.e., one half of a track contains a copy of that which is on the other half of that track). In that case, rotational latency of the hot spare 18a in response to random requests is effectively halved (i.e., smaller read latency).
As such, the hot spare 18a is used to make the mirror area 20 of the hybrid RAID subsystem 16 faster.
Because a hot spare disk drive should match capacity of other disk drives in the disk array (primary array) and since in this example the mirror area data (M0-M5) is half the capacity of a disk drive 18, the hot spare 18a can replicate the mirror area 20 twice. If the hot spare 18a includes a replication of the mirror area, the hot spare 18a can be removed from the subsystem 16 and backed-up. The backup can be performed off-line, not using network bandwidth. A new baseline could be created from the hot spare 18a.
If for example, previously a full backup of the disk array has been made to tape, and that the hot spare 18a contains all writes since that backup, then the backup can be restored from tape to a secondary disk array and then all writes from the log file 26 written to the stripe 22 of the secondary disk array. To speed this process only the most recent update to a given block need be written. The order of writes need not take place in a temporal order but can be optimized to minimize time between reads of the hot spare and/or writes to the secondary array. The stripe of the secondary array is then in the same state as that of the primary array, as of the time the hot spare was removed from the primary array. Backing up the secondary array to tape at this point creates a new baseline that can then be updated with newer hot spares over time to create newer baselines facilitating fast emergency restores. Such new baseline creation can be done without a host but rather with an appliance including a disk array and a tape drive. If the new baseline tape backup fails, the process can revert to the previous baseline and a tape backup of the hot spare.
Another embodiment of a hybrid RAID subsystem 16 according to the present invention provides data block service and can be used as any block device (e.g:, single disk drive, RAID, etc.). Such a hybrid RAID subsystem can be used in any system wherein a device operating at a data block level can be used.
The present invention provides further example enhancements to the hybrid RAID subsystem, described herein below. As mentioned, the mirror area 20 (
The present invention further provides compressing the data in the log 26 stored in the mirror area 20 of the hybrid RAID system 16 for cost effectiveness. Compression is not employed in a conventional RAID subsystem because of variability in data redundancy. For example, a given data block is to be read, modified and rewritten. If the read data consumes the entire data block and the modified data does contain as much redundancy as did the original data, then the compressed modified data cannot fit in the data block on disk.
However, a read/modify/write operation is not a valid operation in the mirror area 20 in the present invention because the mirror area 20 contains a sequential log file of writes. While a given data block may be read from the mirror area 20, after any modification, the writing of the data block would be appended to the existing log file stream 26, not overwritten in place. Because of this, variability in compression is not an issue in the mirror area 20. Modern compression techniques can e.g. halve the size of typical data, whereby use of compression in the mirror area 20 effectively e.g. doubles its size. This allows doubling the mirror area size or cutting the actual mirror area size in half, without reducing capacity relative to a mirror area without compression. The compression technique can similarly be performed for the RAM write cache 32.
For additional data protection, in another version of the present invention, the data in the RAID subsystem 16 may be replicated to a system 16a (
The present invention goes beyond standard RAID by protecting data integrity, not just providing device reliability. Infinite roll-back provides protection during the window of vulnerability between backups. A hybrid mirror/stripe data organization results in improved performance. With the addition of the flashback module 34, a conventional RAID mirror is outperformed at a cost which approaches that of a stripe. Further performance enhancement is attained with replication on an otherwise dormant hot spare and that hot spare can be used by a host-less appliance to generate a new baseline backup.
The present invention can be implemented in various data processing systems such as Enterprise systems, networks, SAN, NAS, medium and small systems (e.g., in a personal computer a write log is used, and data transferred to the user data set in background). As such in the description herein, the “host” and “host system” refer to any source of information that is in communication with the hybrid RAID system for transferring data to, and from, the hybrid RAID subsystem.
The present invention has been described in considerable detail with reference to certain preferred versions thereof; however, other versions are possible. Therefore, the spirit and scope of the appended claims should not be limited to the description of the preferred versions contained herein.
Claims
1-55. (canceled)
56. A method for storing data in a fault-tolerant storage subsystem having an array of failure independent data storage units, comprising the steps of:
- dividing the data storage area on the data storage units into a logical mirror area and a logical stripe area, such that when storing data in the mirror area, duplicating the data by keeping a duplicate copy of the data on a pair of storage units, and when storing data in the stripe area, storing data as stripes of blocks, including data blocks and associated error-correction blocks;
- storing a data set in the stripe area, and storing an associated log cache in the mirror area;
- in response to a request from a host to write data to the storage subsystem: storing the host data in the log cache in the mirror area, and acknowledging completion of the write to the host;
- copying said host data from the log cache in the mirror area to the data set in the stripe area.
57. The method of claim 56, wherein:
- the log cache comprises a write log having multiple time-sequential entries, each entry including a data block, the data block address in the data set, and a data block time stamp.
58. The method of claim 57, wherein:
- said request from the host includes said host data and a block address in the data set for storing the host data;
- the step of storing the host data in the log cache in response to said host request further includes the steps of entering the host data, said block address and a time stamp in an entry in the log cache.
59. The method of claim 57, wherein:
- the step of copying said host data from the log cache in the mirror area to the data set in the stripe area, further comprises the steps of: copying the host data in said log cache entry in the mirror area to said block address in the data set in the stripe area.
60. The method of claim 57, further comprising the steps of:
- archiving said log cache entry in an archive; and
- purging said entry from the cache log.
61. The method of claim 58 further comprising the steps of:
- in response to a request to recreate a state of the data set at a selected time: obtaining a copy of the data set created at a back-up time prior to the selected time; obtaining a cache log associated with said data set copy, the associated cache log including entries created time-sequentially immediately subsequent to said back-up time; and time-sequentially transferring each data block in each entry of said associated cache log, to the corresponding block address in the data set copy, until said selected time stamp is reached in an entry of the associated cache log.
62. The method of claim 57, wherein the storage subsystem further includes a cache memory, the method further comprising the steps of:
- in response to a request to write data to the storage subsystem: storing the data in the cache memory, acknowledging completion of the write, and copying the data from the cache memory to the log cache in the mirror area.
63. The method of claim 62, further comprising the steps of:
- copying said data from the log cache in the mirror area to the data set in the stripe area.
64. The method of claim 63, further comprising the steps of:
- in response to a request to read data from the storage subsystem: determining if the requested data is in the cache memory, and if so, providing the requested data from the cache memory, otherwise, determining if the requested data is in the log cache in the mirror area, and if so, providing the requested data from the log cache, otherwise, determining if the requested data is in the data set in the stripe area, and if so, providing the requested data from the data set.
65. The method of claim 57, further comprising the steps of compressing the data stored in the mirror area.
66. The method of claim 57, wherein the data storage units comprise data disk drives.
67. A fault-tolerant storage subsystem comprising:
- an array of failure independent data storage units;
- a controller that logically divides the data storage area on data the storage units into a logical mirror area and a logical stripe area, wherein the controller stores data in the mirror area by duplicating the data and keeping a duplicate copy of the data on a pair of storage units, and the controller stores data in the stripe area as stripes of blocks, including data blocks and associated error-correction blocks;
- the controller further maintains a data set in the stripe area, and an associated log cache in the mirror area; and
- in response to a request to write incoming data to the storage subsystem, the controller stores the incoming data in the log cache in the mirror area, and acknowledges completion of the write, and the controller copies said incoming data from the log cache in the mirror area to the data set in the stripe area.
68. The storage subsystem of claim 67, wherein:
- the log cache comprises a write log having multiple time sequential entries, each entry including a data block, the data block address in the data set, and time stamp.
69. The storage subsystem of claim 68, wherein:
- said request includes said incoming data and a block address in the data set for storing the incoming data; and
- the controller enters the incoming data, said block address and a time stamp in an entry in the log cache.
70. The storage subsystem of claim 69, wherein in response to a request to read data from the data set, the controller further:
- determines if the requested data is in the log cache in the mirror area, and if so, provides the requested data from the log cache,
- otherwise, the controller determines if the requested data is in the data set in the stripe area, and if so, provides the requested data from the data set.
71. The storage subsystem of claim 69, wherein:
- the controller copies said incoming data from the log cache in the mirror area to the data set in the stripe area, by copying the incoming data in said log cache entry in the mirror area to said block address in the data set in the stripe area.
72. The storage subsystem of claim 69, further comprising a cache memory, wherein:
- in response to a request to write data to the data set, the controller stores the data in the cache memory, and acknowledges completion of the write; and
- the controller further copies the data from the cache memory to the log cache in the mirror area.
73. The storage subsystem of claim 72, wherein the controller further copies said data from the log cache in the mirror area to the data set in the stripe area.
74. The storage subsystem of claim 73, wherein in response to a request to read data from the data set, the controller further:
- determines if the requested data is in the cache memory, and if so, provides the requested data from the cache memory,
- otherwise, the controller determines if the requested data is in the log cache in the mirror area, and if so, provides the requested data from the log cache,
- otherwise, the controller determines if the requested data is in the data set in the stripe area, and if so, provides the requested data from the data set.
75. The storage subsystem of claim 68, wherein the controller further compresses the data stored in the mirror area.
76. A data organization manager for a fault-tolerant storage subsystem having an array of failure independent data storage units, the data organization manager comprising:
- a controller that logically divides the data storage area on the data storage units into a hybrid of logical mirror area and a logical stripe area, wherein the controller stores data in the mirror area by duplicating the data and keeping a duplicate copy of the data on a pair of storage units, and the controller stores data in the stripe area as stripes of blocks, including data blocks and associated error-correction blocks;
- the controller maintains a data set in the stripe area, and an associated log cache in the mirror area, and in response to a request to write data to the storage subsystem, the controller further: stores the data in the log cache in the mirror area, acknowledges completion of the write, and copies said data from the log cache in the mirror area to the data set in the stripe area.
77. The data organization manager of claim 76, wherein:
- the log cache comprises a write log having multiple time sequential entries, each entry including a data block, the data block address in the data set, and time stamp;
- said request includes said data and a block address in the data set for storing the data; and
- the controller enters the data, said block address and a time stamp in an entry in the log cache.
78. The data organization manager of claim 76, wherein in response to a request to read data from the storage subsystem, the controller further:
- determines if the requested data is in the log cache in the mirror area, and if so, provides the requested data from the log cache;
- otherwise, the controller determines if the requested data is in the data set in the stripe area, and if so, provides the requested data from the data set.
79. The data organization manager of claim 77, wherein:
- the controller copies said data from the log cache in the mirror area to the data set in the stripe area, by copying the data in said log cache entry in the mirror area to said block address in the data set in the stripe area.
80. The data organization manager of claim 79, wherein in response to a request to recreate a state of the data set at a selected time, the controller further:
- obtains a copy of the data set created at a back-up time prior to the selected time;
- obtains a cache log associated with said data set copy, the associated cache log including entries created time sequentially immediately subsequent to said back-up time; and
- time sequentially transfers each data block in each entry of said associated cache log, to the corresponding block address in the data set copy, until said selected time stamp is reached in an entry of the associated cache log.
81. The data organization manager of claim 77, further comprising a cache memory, wherein:
- in response to a request to write data to the data set, the controller stores the data in the cache memory, and acknowledges completion of the write; and
- the controller further copies the data from the cache memory to the log cache in the mirror area.
82. The data organization manager of claim 81, wherein the controller further copies said data from the log cache in the mirror area to the data set in the stripe area.
83. The data organization manager of claim 76, wherein in response to a request to read data from the data set, the controller further:
- determines if the requested data is in the cache memory, and if so, provides the requested data from the cache memory,
- otherwise, the controller determines if the requested data is in the log cache in the mirror area, and if so, provides the requested data from the log cache,
- otherwise, the controller determines if the requested data is in the data set in the stripe area, and if so, provides the requested data from the data set.
84. The data organization manager of claim 81, further comprising a memory backup module including non-volatile memory and a battery, wherein the storage subsystem is normally powered from a power supply;
- wherein, upon detecting power failure from the power supply, the controller powers the cache memory and the non-volatile memory from the battery instead, and copies the data content of the cache memory to the non-volatile memory, and upon detecting restoration of power from the power supply, the controller copies back said data content from the non-volatile memory to the cache memory.
85. The data organization manager of claim 84, wherein said cache memory comprises random access memory (RAM), and said non-volatile memory comprises flash memory (FLASH).
86. The data organization manager of claim 84, wherein said battery comprises a rechargeable battery that is normally trickle charged by the power supply.
87. The data organization manager of claim 76, wherein the controller further reserves one of the storage units as a spare for use in case one of the other storage units fails, such that while the spare storage unit is not in use, the controller further:
- replicates the log cache data stored in the mirror area into the spare storage unit, such that multiple copies of that data are stored in the spare storage unit; and
- upon receiving a request to read data from the data set, the controller determines if the requested data is in the spare storage unit, and if so, the controller selects a copy of the requested data in the spare storage unit that can be provided with minimum read latency relative to other copies of the selected data, and provides the selected copy of the requested data.
88. The data organization manager of claim 77, wherein the controller further compresses the data stored in the mirror area and the cache.
Type: Application
Filed: May 13, 2006
Publication Date: Sep 14, 2006
Applicant: Quantum Corporation (San Jose, CA)
Inventor: Tim Orsley (San Jose, CA)
Application Number: 11/433,152
International Classification: G06F 12/16 (20060101);