Detecting and managing missing parents between primary and secondary data stores for content addressed storage
A method and system for locating and eliminating missing parent files within a primary storage device. The method includes identifying a placeholder file on a primary storage device, the placeholder file having an offline reference and being associated with file identification data, identifying a secondary file on a secondary storage device using a content address of the secondary file included in the offline reference of the placeholder file, identifying an online reference of the secondary file, determining if the online reference of the secondary file properly references the file identification data of the placeholder file, and updating the online reference of the secondary file so that it includes the file identification data of the placeholder file.
Latest EMC Corporation Patents:
- Combining explicit and implicit feedback in self-learning fraud detection systems
- System and method to enhance phrase search with nested thesaurus parsing
- Systems and methods for bi-directional replication of cloud tiered data across incompatible clusters
- Method and system for adaptive wear leveling in solid state memory
- Handling deletes with distributed erasure coding
This application is related to U.S. patent application Ser. No. 11/538,307 filed on Oct. 3, 2006 and entitled “DETECTING AND MANAGING ORPHAN FILES BETWEEN PRIMARY AND SECONDARY DATA STORES FOR CONTENT ADDRESSED STORAGE,” which application is incorporated herein by reference in its entirety.
BACKGROUND1. The Field of the Invention
The present invention relates to data migration storage systems. More specifically, the present invention relates to methods and systems for locating and eliminating missing parent files and/or orphan files from data migration storage systems.
2. The Relevant Technology
In this society where many personal and business interactions are data driven, the ability to store a large amount of data has become increasingly important. Modern computer systems allow for the interchange of data and resources through network environments. For example, a modern computer network may include a number of interconnected client computers. The computer network may further include resources for storing large amounts of data, including but not limited to, direct attached storage (DAS), network attached storage (NAS), a storage area network (SAN), and other storage environments.
Data migration systems have been developed to facilitate the efficient storage of large amounts of data. For example, a data migration system may include a primary storage location, a secondary storage location, and a policy engine server. The primary storage location is typically used for storing files that are commonly accessed by a client computer, while the secondary storage location is used for storing files that are less commonly accessed by the client computer. The policy engine server is responsible for migrating data files from the primary storage to the secondary storage, and recalling data files from the secondary storage to the primary storage.
In order for a data migration system to function efficiently, certain relationships should exist between the files stored on primary storage and the files stored on secondary storage. When a file is accessed from the primary storage and one or more properties of the accessed file is altered, many of the relationships that should exist between the primary storage and secondary storage can be destroyed or invalidated.
The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced.
To further clarify the features of the present invention, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
In the following detailed description of various embodiments of the invention, reference is made to the accompanying drawings which form a part hereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention.
Embodiments of the present invention provide for methods and systems for locating and eliminating missing parent files within a primary storage device. One exemplary method may be practiced in a distributed computing system having a primary storage device, a secondary storage device and a policy engine server, the policy engine server being configured to manage migration of files between the primary and secondary storage devices. An exemplary method for locating and eliminating missing parent files within the primary storage device includes identifying a placeholder file on a primary storage device, the placeholder file having an offline reference and being associated with file identification data, identifying a secondary file on a secondary storage device using a content address of the secondary file included in the offline reference of the placeholder file, identifying an online reference of the secondary file, determining if the online reference of the secondary file properly references the file identification data of the placeholder file, and updating the online reference of the secondary file so that it includes the file identification data of the placeholder file. The result is that the online reference of the secondary file contains a proper reference to the placeholder file (i.e., parent file).
As used herein, the term “data” may include, but are not limited to, directories (e.g., volumes, file systems, and the like), user data, system data, applications, services, operating systems, instructions, and the like, that can be stored on one or more storage devices of a computer. The term “data object” refers to a subset of “data.” Data may be organized in logical directories that do not necessarily correspond to a particular storage device.
Certain embodiments described herein will involve electronic communication between a client computer system (hereinafter referred to as a “client”) requesting access to a primary storage device. Accordingly, the client sends a request to the primary storage device for access to its system resources, wherein if the client is authorized and validated, the primary storage device responds with a response message providing the desired information. Of course, other messaging patterns between client and primary storage device are available, as are well known in the art.
As used herein, the term “user” may refer to a person operating the client computer. Users are able to access information on the primary storage device. Alternatively, the term “user” may refer to a person operating the policy engine server.
Data Migration System
Referring now to
The policy engine server 118 includes a data migration module 120 used for migrating data between the primary storage device 104 and the secondary storage device 106. The data migration module 120 is configured to automatically move data from primary storage 104 to secondary storage 106 according to a predefined migration policy. In one embodiment, the primary storage device 104 is a faster storage device than the secondary storage device 106, and thus may be more expensive. Therefore, files that are most often used may generally be stored in primary storage 104, while files that are seldom accessed are migrated to secondary storage 106. A file that has been migrated to secondary storage 106 is referred to as an “offline file” or “secondary file” 110, whereas a file residing on primary storage 104 is referred to as an “online file,” “primary file” or “data file” 107. Secondary files 110 that have been migrated to secondary storage 106 can later be recalled to primary storage 104.
The types of data objects that can be stored on primary storage 104 include both data files 107 and placeholder files 108. When a data file 107 is migrated and stored as a secondary file 110 on secondary storage 106, the data file 107 is deleted and a small placeholder file 108, commonly referred to as a “stub”, is created and stored on the primary storage 104 to assist in locating the secondary file 110. Each placeholder file 108 includes an offline reference 109 which identifies or references the corresponding migrated secondary file 110 on the secondary storage device 106. Each of the secondary files 110 is also associated with an online reference 116. The online references 116 identifies or references the corresponding placeholder file 108 on the primary storage 104. The information contained in the offline reference 109 and online reference 116 depends on the storage scheme of the primary and secondary storage devices 104 and 106. This will be discussed in further detail below.
Thus, a placeholder file 108 is a representation of a data file 107 that has been migrated and is only needed in that situation. Therefore, for a particular data file, generally either a data file 107 or a placeholder file 108 will be stored on primary storage 104, but not both. However, in some cases, as explained below, a data file 107 and placeholder file 108 may reside at the primary storage 104 for a particular data file. Further, more than one placeholder file 108 may exist on primary storage 104 that reference the same secondary file 110 on secondary storage 106.
The client computer 102 may access the online data files 107 directly from the primary storage 104, and may access the offline secondary files 110 via the placeholder file 108 that was created to represent and take the place of a migrated data file (formerly one of the data files 107) and reference the secondary file 110. The primary storage 104 displays the placeholder files 108 along with the data files 107 and a user is generally unaware of the distinction between the data files 107 and placeholder files 108. To access one of the secondary files 110, a user of the client computer 102 selects the corresponding placeholder file 108. The primary storage 104 uses the offline reference 109 in the placeholder file 108 to access the appropriate secondary file 110 from secondary storage 106, the location of the secondary file being identified by the offline reference 109. The placeholder files 108 can be read, modified, deleted or renamed in the same manner as the data files 107. Therefore, online files (i.e., data files 107) and offline files (i.e., secondary files 110) are equally accessible from the perspective of the client computer 102, although it may require slightly more processing time to access an offline file.
Each of the placeholder files 108 and secondary files 110 also include information identifying itself. This information is generally referred to as file identification data 114 and 117. Primary and secondary storage 104 and 106 may include a flat file or database for storing file identification data 114 and 117 that are associated with the particular file. Such file identification data may include filenames. In certain operating systems, such as Unix, data files may be associated with multiple filenames. In situations, such as these, exemplarily, each of the data objects located on the primary storage 104 and the secondary storage 106 may be associated with at least one filename.
In addition, file identification data 114 and 117 can include information identifying a related group of files. In one embodiment, a handle or other unique identifier can be used to identify a related group of files. In one embodiment, a group of related files may include one or more placeholder files 108 on primary storage 104 and one or more secondary files 110 on secondary storage 106. In another embodiment, a group of files includes one or more placeholder files 108 and a single secondary file 110. While each of the placeholder files 108 and/or secondary files 110 in the file group may have different filenames, all of the placeholder files 108 and secondary files 110 in the file group share a common handle which is used to identify the related files in the file group. Therefore, if any question exists as to whether a file on primary storage 104 or secondary storage 106 is related to another file, the handles of both files can be compared to one another. If they share a common handle, then the two files are associated with one another. (Note that data files 107 may also have file identification information such as at least one filename and/or handle.)
Policy engine server 118 also includes file management module 119 to manage various scenarios that occur in the instance of a missing parent or an orphan file. As such, policy engine server 118 includes an orphan manager 122 and a missing parent manager 124 which functionality will be described in detail below. File management module 119 also includes a database 126 that assists in the management of missing parents and/or orphan files. In one embodiment, the database 126 can store content addresses 128 of data objects stored on primary storage 104 and/or secondary storage 106.
The primary and secondary storage devices 104 and 106 may employ various storage schemes. For example, in one embodiment, the primary and secondary storage devices 104 and 106 employ a location addressable storage scheme. Location addressable storage is the typical storage scheme employed by most local and networked storage devices, where each element of data is stored onto a physical medium, and its location is recorded for later use. The storage device often keeps a list, or directory, of these locations. When a request is made to access a location addressable data object, the request typically includes only the location (for example, path and filenames) of the data. The storage device then uses this information to locate and retrieve the data on the physical medium. When new information is written into a location-addressed storage device, it is typically stored in available free space, without regard to its content. The information at a given location can usually be altered or completely overwritten without any special action on the part of the storage device.
Either or both of primary storage 104 and secondary storage 106 can employ location addressable storage. Thus, in embodiments where the secondary storage employs location addressable storage, the offline reference 109 of placeholder files 108 will contain the path of the secondary file 110 located on secondary storage 106. Similarly, where the primary storage 104 employs location addressable storage, the online reference 116 of secondary files 110 will contain the path of the placeholder file 108 located on primary storage 104. Further, where the primary and/or secondary storage 104 and 106 employ location addressable storage, the file identification data 114 and/or 117 will include a handle or other unique identifier.
In another embodiment, the primary and secondary storage devices 104 and 106 employ a content addressable storage scheme. Content addressable storage is a storage scheme for storing information that can be retrieved based on its content as opposed to its storage location. When a data object is stored into a content addressable storage system, the system typically records a content address, which is an identifier based on the information contained in the data object itself. In one embodiment, the content address may be produced by applying a cryptographic hash function, such as MD5, to the content of the data object, and the result is used as the content address.
A request to retrieve information from a content addressable storage system includes identifying the content address of the data object, from which the system can determine the physical location of the data object and retrieve it. Because the content addresses are based on content, any change to a data object will typically also include a change to its content address. Content addressable storage is also referred to as “single instance storage”, because a single instance of a data object may be referenced by multiple filenames. Thus, each data object (i.e., data files 107, placeholder files 108, or secondary files 110) will have a single instance of storage that is identifiable using the content address. One exemplary content addressable storage system includes a Centera platform, although other content addressable storage systems may currently exist or may be developed in the future.
Either or both of primary storage 104 and secondary storage 106 may employ content addressable storage. In embodiments where the secondary storage 106 employs content addressable storage, offline reference 109 of the placeholder file 108 includes a content address of the secondary file 110 located on the secondary storage 106. Conversely, where the primary storage 104 employs content addressable storage, the online reference 116 of the secondary file 110 contains the content address of the corresponding placeholder file 108 on the primary storage 104. Further, where the primary storage 104 and/or secondary storage 106 employ content addressable storage, the file identification data 114 and/or 117 will contain the content address of the particular file.
Data objects on a content addressable storage system can be identifiable by performing an itemized scan for the data objects, reviewing the metadata associated with each data object to identify the path/location of the data object. Alternatively, policy engine server 118 can include a database 126 that stores the content addresses 128 of data objects located on primary storage 104 (i.e., data files 107 and placeholder file 108) and/or secondary storage 106 (i.e., secondary files 110). In another embodiment, the content addresses of data objects can be stored on a database located on their respective storage devices.
Thus, combinations of location addressable storage and content addressable storage may exist in the data migration system 100 illustrated in
In summary, the policy engine server 118 includes a data migration module 120 for managing the efficient storage of data in primary and secondary storage devices 104 and 106. When a data file 107 is migrated from the primary storage 104 to the secondary storage 106, a placeholder file 108 is created in the primary storage 104 to replace the migrated data file 107. The placeholder files 108 are associated with offline references 109 which identify the corresponding migrated secondary files 110. Likewise, the secondary files 110 include online references 116 for identifying the corresponding placeholder files 108. When placeholder files 108 and/or secondary files 110 become disassociated to form missing parents or orphan files, the file management module 119 of the policy engine server 118 operates to resolve these situations.
Referring now to
As described previously, the references associated with the placeholder files 202 and 204 and the secondary file 212 may be referred to as the offline references and the online references, respectively. For example, in an embodiment where location addressable storage is employed, the references 214 and 216 may include pointers that describe the location (e.g., directory path and filename) of the corresponding placeholder file 202 and 204 or secondary file 212 being referenced. In another embodiment, where content addressable storage is employed, the references 214 and 216 may include a content address of the corresponding placeholder file 202 and 204 or secondary file 212 being referenced.
The unidirectional arrows 218, 220 and 222 indicate that the secondary file 212 contains a database that may contain references to the filenames 206, 208 and 210 on the primary storage device 104 that are associated with the secondary file 212. The unidirectional arrows 224, 226 and 228 illustrate that a relationship exists between the filenames 206, 208 and 210 and the placeholder files 202 and 204 that the filenames identify.
As illustrated in
As described above,
Primary Scan
The primary scan is performed to locate and resolve missing parent files residing on the primary storage device 104. Missing parent files include placeholder files 108 located on primary storage 104 that are not properly referenced by an online reference 116 of any secondary file 110 located on the secondary storage device 106.
In the embodiment where the primary storage 104 implements content addressable storage, the online reference 316 of secondary file 310 includes a content address for identifying a placeholder file 308 located on the primary storage 104. The effect of the movement of placeholder file 308a to placeholder file 308b is that placeholder file 108b has a different/new content address from the content address of placeholder file 308a. Thus, the secondary file 310 which originally referenced the content address of placeholder file 308a in the online reference 316 does not reference placeholder file 308b.
The missing parent module 124 of the policy engine server 118, illustrated in
As described previously, the primary and secondary storage devices 104 and 106 may employ various different storage schemes, including location addressable storage and content addressable storage. The process and algorithms for identifying missing parent files differs depending on whether content addressable storage or location addressable storage is being employed by the storage devices 104 and 106. In embodiments where the primary storage 104 implements location addressable storage, the missing parent manager 124 of the policy engine server 118 identifies placeholder files 308 and secondary files 310 that share a common handle. The missing parent manager 124 determines whether the online reference 316 correctly references the placeholder file 308b. That is, whether the online reference 316 includes the correct path of the placeholder file 308b. If the online reference 316 does not correctly reference the placeholder file 308b, the online reference 316 is updated to include the correct reference.
Where a content addressable storage scheme is employed, the missing parent manager 124 of the policy engine server 118 performs a check of the online reference 316 of the secondary file 310 to determine if the online reference properly identifies a content address of a placeholder file 308. If the content address of a placeholder file 308 is not referenced by any secondary file 310, then the missing parent manager 124 identifies the placeholder file 308b as a missing parent file. The placeholder file 308b should have the same offline reference 109 (i.e., path or content address, depending on the storage scheme of secondary storage 106) as the placeholder file 308a, which points back to the secondary file 110. Therefore, the missing parent manager 124 is able to identify a secondary file 310 that corresponds with the missing parent placeholder file 308b. The missing parent manager 124 proceeds to update the content address of online reference 316 of the secondary file 310 so that it properly identifies the content address of placeholder file 308b.
In another example, the placeholder file may be assigned an additional filename. As described in
In yet another example, a user may delete a placeholder file, and later restore the deleted placeholder file from a backup location. When the placeholder file is restored, it may be restored with a content address that is not identified by the online reference of the corresponding secondary file. Because the placeholder file is not identified by the secondary file's online reference, the placeholder file is a missing parent. Therefore, the missing parent module 124 may update the content address of the placeholder file in the online reference of the secondary file so that it properly identifies the newly restored placeholder file.
Those skilled in the art will recognize that the present invention may also be described in terms of methods of locating and eliminating missing parent files within primary storage devices employing content addressable storage. Referring now to
In one embodiment, both the primary storage and secondary storage employ location addressable storage. In another embodiment, one of the primary storage device or secondary storage device employs a content addressable storage scheme, such that the location and content of the primary files are defined by content addresses, while the other employs location addressable storage. In another embodiment, both primary storage and secondary storage devices employ content addressable storage. The information contained in the online reference, offline reference, and file identification data and how that information is accessed differs depending on the storage scheme employed by the primary storage and/or secondary storage.
The method 400 includes, at 402, identifying a placeholder file on a primary storage device. In one embodiment, file management module 119 can perform an item by item scan of each placeholder file in the primary storage 104 before proceeding to the next step. Alternatively, the file management module 119 can identify a first placeholder file and then perform the rest of the method, and then return to 402 to recursively repeat all of the process for a second placeholder file.
At 404, the method includes identifying a secondary file on a secondary storage device related to the placeholder file. In one embodiment, the secondary file can be identified via an offline reference on the placeholder file, using path information where the secondary storage uses location addressable storage or content address information where the secondary storage employs content addressable storage. As illustrated in
At 406, the method includes identifying the online reference of the secondary file associated with the placeholder file. In embodiments where the primary storage employs location addressable storage, identifying the online reference includes determining a path of a placeholder file. In embodiments where the primary storage implements content addressable storage, identifying the online reference includes identifying a content address of a placeholder file.
At 408, the method 400 further includes determining if the online reference of the secondary file properly references the file identification data of the placeholder file. If the online reference of the secondary file does not properly identify the placeholder file, the placeholder file is a “missing parent” and steps should be taken to remedy the situation. In one embodiment, determining whether the online reference of the secondary file properly identifies the file identification data of the placeholder file includes using the online reference data to request access to the primary storage to determine if the pointer back to the primary storage obtains access to the correct placeholder file. If the placeholder file can be accessed by a request using the online reference data, then the online reference data properly identifies the placeholder file. Accessing the placeholder file can further include analyzing the file identification data of the placeholder file to determine if it matches the online reference data of the secondary file.
In one embodiment, where the online reference is a content address, the placeholder file can be accessed by sending the content address contained in the online reference to database 126 of the policy engine server 118 (see
At 410, if the online reference does not properly reference the placeholder file, the method 400 includes updating the online reference of the secondary file so that it includes a proper reference to the placeholder file. Thus, as illustrated in the
Secondary Scan
A secondary scan is performed to locate and eliminate orphan files from the secondary storage device 106. In general, orphan files are secondary files 110 residing on secondary storage 106 that are no longer needed in order for the system 100 to function properly.
Thus, the orphan manager 122 looks at whether a secondary file is appropriately identified in an offline reference of a placeholder file. If the secondary file is not correctly referenced in any placeholder file, then the secondary file is considered to be an “orphan file.” In embodiments where the secondary storage device 106 implements location addressable storage, the path of the secondary file 510, 560, and 590 are not included in an offline reference of a placeholder file. Therefore, the secondary file 510, 560 and 590 are considered to be orphan files. In embodiments where the secondary storage device 106 employs content addressable storage, the content address of the secondary files 510, 560 and 590 are not included in an offline reference of a placeholder file, rendering the secondary files inaccessible.
Prior to permanently deleting the secondary file 510, 560 or 590 from the secondary storage device 106, the orphan manager 122 of
Another test that the orphan manager 122 may perform prior to deleting the secondary file 510, 560 or 590 from the secondary storage device 104 is to confirm that no other placeholder files exist that contain offline references identifying the secondary file marked for deletion. As described in reference to
Those skilled in the art will recognize that the present invention may also be described in terms of methods of locating and eliminating orphan files within storage devices employing content addressable storage. Referring now to
In one embodiment, both the primary storage and secondary storage employ location addressable storage. In another embodiment, one of the primary storage device or secondary storage device employs a content addressable storage scheme, such that the location and content of the primary files are defined by content addresses, while the other employs location addressable storage. In another embodiment, both primary storage and secondary storage devices employ content addressable storage. The information contained in the online reference, offline reference, and file identification data and how that information is accessed differs depending on the storage scheme employed by the primary storage and/or secondary storage.
The method 600 includes, at 602, identifying a secondary file on a secondary storage device. In one embodiment, file management module 119 can perform an item by item scan of each secondary file in the secondary storage 106 before proceeding to the next step. Alternatively, the file management module 119 can identify a first secondary file and then perform the rest of the method, and then return to 602 to recursively repeat all of the process for a second secondary file.
At 604, the method 600 further includes identifying the file identification data of the secondary file. In embodiments where the secondary storage is location addressable storage, the file identification data can include a path and/or handle. In embodiments where the secondary storage implements location addressable storage, identifying the file identification data of the secondary file may include determining the content address of the secondary file, the location and content of the secondary file being determined by its content address. In one embodiment, where the file identification data is a content address, the orphan manager 122 identifies the secondary file 110 by analyzing the content addresses 128 for secondary files 110 located in the secondary storage 106. The secondary file 110 can also be identified by performing an itemized scan of the data objects on the secondary storage 106, and reviewing the associated metadata with each data object to identify the path or location of the secondary file.
The method 600 further includes, at 606, determining if the secondary file is validly referenced by a placeholder file on the primary storage device. If the secondary file is not validly referenced by a placeholder file, then the secondary file is considered an “orphan file.” Data may be determined to be invalidly referenced by a placeholder file for a variety of reasons. For example, as previously described in
In another example, as previously described in
In yet another example, as previously described in
Thus, determining whether a secondary file is validly referenced by a placeholder file can include identifying an offline reference of the placeholder. If the offline reference of the placeholder validly points to a secondary file, then the secondary file is not an orphan file. Where the secondary storage is location addressable storage, validly pointing to a secondary file can include that the offline reference of the placeholder file contains a path or handle that matches the path or handle of the secondary file that was previously identified in the file identification data of the secondary file. For secondary storage devices that implement content addressable storage, validly pointing to a secondary file can include that the offline reference of the placeholder file contains a content address of the secondary file that was previously identified in the file identification data of the secondary file. Each placeholder file can be analyzed to determine whether it validly references a secondary file. If the secondary file is not validly referenced by a placeholder file, then it cannot be accessed by a client computer.
If the secondary file is not validly referenced by any placeholder file, at 608, the orphan manager 122 marks the secondary file for deletion. In one embodiment, when the secondary file is marked for deletion, the secondary file is immediately deleted from the secondary storage device. In another embodiment, marking the secondary file for deletion further includes recording the time at which the secondary file is marked for deletion, and then deleting the marked secondary file following the completion of a predefined amount of time (e.g., one month). The predefined amount of time may be programmable by a user of the policy engine server.
In one embodiment, the method 600 may further include determining if any placeholder files located on the primary storage device include offline references identifying the secondary file prior to marking the secondary file for deletion. This step ensures that a secondary file that is associated with multiple placeholder files is not deleted merely because one of the placeholder files is no longer valid.
It will be appreciated that steps of the missing parent process and orphan file management process may overlap and/or preface each other. For example, analyzing secondary files to identify missing parents (e.g., block 404) may occur simultaneously with the analysis of secondary files to identify orphan files (e.g., block 602). Similarly, analyzing placeholder files to identify missing parents (e.g., 402) may be part of the same step for analyzing placeholder files for identifying orphan files (e.g., block 606).
Embodiments herein may comprise a special purpose or general-purpose computer including various computer hardware. Embodiments may also include computer-readable media for carrying or having computer-executable instructions or data structures stored thereon. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a computer-readable medium. Thus, any such connection is properly termed a computer-readable medium. Combinations of the above should also be included within the scope of computer-readable media.
Computer-executable instructions comprise, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
Claims
1. In a distributed computing system having a primary storage device, a secondary storage device and a policy engine server, the policy engine serve being configured to manage migration of files between the primary and secondary storage devices by migrating data files on the primary storage device to the secondary storage device and replacing the data files on the primary storage device with placeholder files, a method for locating placeholder files within the primary storage device that are not properly referenced by secondary files stored on the secondary storage device, the method comprising:
- identifying a placeholder file on a primary storage device that has replaced a data file, the placeholder file including an offline reference to the data file stored as a secondary file on the secondary storage device, wherein the offline reference includes file identification data including a content address of the secondary file and the file identification data describes the data file;
- identifying the secondary file on the secondary storage device using the content address of the secondary file included in the offline reference of the placeholder file;
- identifying an online reference included in the secondary file, wherein the online reference includes an address to a particular placeholder file stored on the primary storage device;
- determining that the placeholder file is a missing parent when the address to the particular placeholder file does not identify the placeholder file; and
- notifying a user of the policy engine server when the online reference of the secondary file does not properly reference the placeholder file.
2. The method as recited in claim 1, further comprising at least one of:
- indicating the reason for the improper reference; or
- updating the online reference of the secondary file so that it includes the file identification data of the placeholder file and properly references the placeholder file.
3. The method as recited in claim 1, wherein identifying the secondary file on the secondary storage device using the content address of the secondary file included in the offline reference of the placeholder file further comprises:
- accessing a database on a policy engine server, the database storing the content address of the secondary file; and
- locating the secondary file by comparing the content address in the offline reference of the placeholder file with the content address of the secondary file stored in the database.
4. The method as recited in claim 1, wherein identifying the secondary file on the secondary storage device using the content address of the secondary file included in the offline reference of the placeholder file further comprises:
- performing an itemized scan of data objects on the secondary storage device;
- analyzing metadata of each data object to identify the location of the data object; and
- determining whether the data object contains the same content address as the content address in the offline reference of the placeholder file.
5. The method as recited in claim 1, wherein the primary storage device is further configured to communicate with a client computer system, the client computer system able to access the secondary file via the placeholder file on the primary storage device.
6. The method as recited in claim 2, wherein the file identification data of the placeholder file includes a filename that has been renamed, and wherein the online reference of the secondary file is updated to include the renamed filename of the placeholder file.
7. The method as recited in claim 2, wherein the file identification data of the placeholder file includes an additional filename, and wherein the online reference of the secondary file is updated to include the additional filename of the placeholder file.
8. The method as recited in claim 2, wherein the placeholder file has been restored from a backup location to a different location in a file system of the primary storage device than a previous placeholder file such that new file identification data of the placeholder file is created, and wherein the online reference of the secondary file is updated to include the new file identification data of the placeholder file.
9. The method as recited in claim 1, wherein the online reference of the secondary file includes a content address of the placeholder file.
10. A content addressable data migration storage system for eliminating missing parent files that are not properly identified by secondary files, the storage system comprising:
- a primary storage device employing one of location addressable storage or content addressable storage and configured to store and detect a placeholder file, the detected placeholder file having replaced a data file that was migrated to a secondary storage device, the placeholder file including an offline reference with file identification data for identifying the placeholder file, the offline reference referencing the data file which is stored as a secondary file on the secondary storage device, wherein the offline address includes an address of the secondary file
- the secondary storage device employing content addressable storage and configured to store the secondary file, the secondary file including file identification data for identifying the secondary file, the file identification data of the secondary file related to the offline reference of the detected placeholder file, and the secondary file including at least one online reference to a referenced placeholder file on the primary storage device, wherein the detected placeholder file is properly identified if the detected placeholder file is the referenced placeholder file; and
- a policy engine server including a file management module for detecting that the detected placeholder file is not properly identified by the at least one online reference of the secondary file, wherein the detected placeholder file is a missing parent file that is eliminated from the primary storage device when not properly identified by the at least one online reference of the secondary file, wherein the file management module notifies a user of the policy engine server that the detected placeholder file is not properly identified by the at least one online reference of the secondary file.
11. The system as recited in claim 10, wherein the file management module of the policy engine server is further configured for performing at least one of:
- eliminating the missing parent file by updating the at least one online reference of the secondary file such that the online reference refers to the detected placeholder file.
12. The system as recited in claim 10, wherein prior to updating the at least one online reference of the secondary file such that the online reference refers to the detected placeholder file, the file identification data of the detected placeholder file included at least one of a renamed filename, an added filename, or new file identification data from the detected placeholder file being restored from a backup location to a different location in a file system of the primary storage device than a previous placeholder file.
13. The system as recited in claim 10, wherein the file identification data of the secondary file further includes one or more filenames associated with the detected placeholder file.
14. The system as recited in claim 10, wherein the policy engine server further includes a data migration module for moving files between the primary storage device and the secondary storage device, the data migration module configured to delete a data file from the primary storage device, create the detected placeholder file to represent the deleted data file, and copy the data file to the secondary storage device in the form of the secondary file.
15. The system as recited in claim 10, further comprising a client computer system configured to access the placeholder file from the primary storage device and access the secondary file via the placeholder file.
16. The system as recited in claim 10, wherein the content address is generated from a hash function.
17. In a distributed computer having a primary storage device, a secondary storage device and a policy engine server, the policy engine server being configured to manage migration of files between the primary storage device and the secondary storage device by migrating data files on the primary storage device to the secondary storage device and replacing the data files on the primary storage device with placeholder files, a computer program product configured to implement a method of locating and eliminating placeholder files within the primary storage device that are not properly referenced by secondary files stored on the secondary storage device, the computer program product comprising one or more computer readable media having stored thereon computer executable instructions that, when executed by a processor, cause the policy engine server to perform the following:
- identify a placeholder file on a primary storage device that has replaced a data file, the placeholder file including an offline reference to the data file stored as a secondary file on the secondary storage device, wherein the offline reference includes a content address of the secondary file;
- identify the secondary file on the secondary storage device using the content address of the secondary file included in the offline reference of the placeholder file;
- identify an online reference of the secondary file, wherein the online reference includes an address to a particular placeholder file stored on the primary storage device;
- determine that the placeholder file is a missing parent when the particular placeholder file is not the placeholder file; and
- notify a user of the policy engine server when the online reference of the secondary file does not properly reference the placeholder file or the file identification data of the placeholder file.
18. The method as recited in claim 17, wherein the computer executable instructions that cause the server system to identify a secondary file on a secondary storage device using a content address of the secondary file included in the offline reference of the placeholder file further comprise instructions to cause the policy engine server to perform at least one of the following:
- indicating the reason for the improper reference; or
- update the online reference of the secondary file so that it includes the file identification data of the placeholder file.
19. The method as recited in claim 17, wherein the computer executable instructions that cause the server system to identify a secondary file on a secondary storage device using a content address of the secondary file included in the offline reference of the placeholder file further comprise instructions to cause the policy engine server to perform the following:
- perform an itemized scan of data objects on the secondary storage device;
- analyze metadata of each data object to identify the location of the data object; and
- determine whether the data object contains the same content address as the content address in the offline reference of the placeholder file.
20. The method as recited in claim 17, wherein the computer executable instructions that cause the server system to identify a secondary file on a secondary storage device using a content address of the secondary file included in the offline reference of the placeholder file further comprise instructions to cause the policy engine server to perform at least one of the following:
- wherein the file identification data of the placeholder file includes a filename that has been renamed, and wherein the online reference of the secondary file is updated to include the renamed filename of the placeholder file;
- wherein the file identification data of the placeholder file includes an additional filename, and wherein the online reference of the secondary file is updated to include the additional filename of the placeholder file; or
- wherein the placeholder file has been restored from a backup location such that new file identification data of the placeholder file is created, and wherein the online reference of the secondary file is updated to include the new file identification data of the placeholder file.
5873103 | February 16, 1999 | Trede et al. |
5991753 | November 23, 1999 | Wilde |
6192375 | February 20, 2001 | Gross |
6366988 | April 2, 2002 | Skiba et al. |
20040163029 | August 19, 2004 | Foley et al. |
20050021566 | January 27, 2005 | Mu |
20060010154 | January 12, 2006 | Prahlad et al. |
20060112096 | May 25, 2006 | Ahluwalia et al. |
20060129537 | June 15, 2006 | Torii et al. |
20060212746 | September 21, 2006 | Amegadzie et al. |
20070083722 | April 12, 2007 | Per et al. |
20070198611 | August 23, 2007 | Prahlad et al. |
- U.S. Appl. No. 11/538,316, filed Oct. 3, 2006.
- U.S. Appl. No. 11/538,351, filed Oct. 3, 2006, Hagerstrom, et al.
- U.S. Appl. No. 11/538,316, filed Oct. 3, 2006, Hagerstrom, et al.
- U.S. Appl. No. 11/538,307, filed Oct. 3, 2006, Hagerstrom, et al.
- U.S. Appl. No. 11/538,316, Mail Date Mar. 11, 2008, Office Action.
- U.S. Appl. No. 11/538,351, Mail Date Apr. 9, 2008, Office Action.
- U.S. Appl. No. 11/538,316, Mail Date Nov. 10, 2008, Final Office Action.
- U.S. Appl. No. 11/538,351, Mail Date Aug. 7, 2008, Final Office Action.
- U.S. Appl. No. 11/538,351, Mail Date Dec. 22, 2008, Office Action.
- U.S. Appl. No. 11/538,307, Mail Date Dec. 15, 2008, Office Action.
- Notice of Allowance dated May 1, 2009 as issued in U.S. Appl. No. 11/538,316 filed Oct. 3, 2006.
- Notice of Allowance dated Jun. 2, 2009 as issued in U.S. Appl. No. 11/538,351 filed Oct. 3, 2006.
- Notice of Allowance dated Jul. 13, 2009 as issued in U.S. Appl. No. 11/537,307 filed Oct. 3, 2006.
Type: Grant
Filed: Oct 3, 2006
Date of Patent: Oct 6, 2009
Assignee: EMC Corporation (Hopkinton, MA)
Inventors: Carl F. Hagerstrom (Boulder, CO), Thomas Dixon Hutchinson (Boulder, CO), Shridhar Bharthulwar (Longmont, CO), Paul E. Tinius (Broomfield, CO)
Primary Examiner: James Trujillo
Assistant Examiner: Albert Phillips
Attorney: Workman Nydegger
Application Number: 11/538,313
International Classification: G06F 12/00 (20060101);