Database change data capture within a storage array
A log is identified which includes one or more transactions to a source database. The log is ingested using a processor included in a storage processor complex, where the storage processor complex includes storage and a processor associated with the storage. The ingested information is transformed into change data appropriate for a target database stored at least in part on the storage processor complex. The change data is applied to the target database.
Latest EMC Corporation Patents:
- Combining explicit and implicit feedback in self-learning fraud detection systems
- System and method to enhance phrase search with nested thesaurus parsing
- Systems and methods for bi-directional replication of cloud tiered data across incompatible clusters
- Method and system for adaptive wear leveling in solid state memory
- Handling deletes with distributed erasure coding
Change data capture is a technique in which changes made to one database are recorded and those records are used to replicate the same changes in a second database. For example, the second database may be a backup/recovery database or an analytic database. The amount of data being managed by databases is even larger than ever, which negatively affects the performance of change data capture systems. New change data capture systems with improved performance would be desirable.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
A technique for performing change data capture (CDC) processing is described herein. The technique is performed at least in part by a storage processor complex which includes storage and a processor associated with the storage. A log which includes one or more transactions to a source database is identified. In various embodiments, the log is located in the source database or the storage processor complex. The log is ingested using a processor included in a storage processor complex. The storage processor complex is associated with a target database (e.g., the target database is located at least in part on the storage processor complex). The ingested information is transformed into change data appropriate for a target database stored at least in part on the storage processor complex. In various embodiments, temporary storage is used to store the ingested information and/or transform the ingested information. In one example, the target database is an analytic database. The change data is then applied to the target database. In some embodiments, the change is applied to the target database by a processor included in the storage processor complex (e.g., one that performed a previous step or a new processor not previously involved).
In some embodiments, step 100 includes identifying the physical locations of components (such as blocks) which make up or comprise a log. For example, this may enable a process (e.g., which accesses or reads the log) to access the components directly at the component level and/or without having to go through an operating system or file system. For a variety of reasons, it may be difficult or impossible to pass a higher level command or instructions (e.g., “read <log>” or “open <log>”) to an operating system or file system in order to access a log. Some examples of this are described in further detail below.
In some embodiments, step 100 includes identifying a format or type associated with a log. For example, different source databases may use different log formats or log types when writing a log and in order to properly interpret the information in a log, the format or type is determined or identified. The type or format may, for example, be used to select an appropriate downstream process from a plurality of downstream processes, each of which is configured to work with a particular log type.
At 102, the log is ingested using a processor included in a storage processor complex, wherein the storage processor complex includes the processor and storage associated with the processor. In one example, Informatica Data Replication is used at step 102 to ingest a log.
A storage processor complex includes one or more units (e.g., devices) of storage and one or more associated processors. In some embodiments, a given processor is dedicated to or paired with a given storage device. In some such embodiments, a processor is co-located with its associated storage (e.g., they are in the same device). Alternatively, a processor and its associated storage may be on separate devices and they are connected together via some (e.g., local) connection. In some embodiments, a given processor is not necessarily dedicated to or paired with a given storage device. One example of a storage processor complex is EMC's Symmetrix VMAX. Some storage processor complex embodiments are described in further detail below.
At 104, the ingested information is transformed into change data appropriate for a target database stored at least in part on the storage processor complex. In some embodiments, a target database provider (e.g., EMC Corporation if the target database is a Greenplum database) provides applications, tools, schema, and/or descriptions which enable the transformation of the ingested information into a form or style that can be understood by and/or is relevant to the target database. This may include, for example, grooming attribute types (which are permitted in the source database) into ones that are permitted in the target database. In another example, a source database may have different rules regarding names or paths (e.g., names are case sensitive/names are not case sensitive, spaces permitted in names/spaces not allowed in names, etc.) and this is cleaned up during transformation at 104. In one example, a source database is a transactional database and a target database is a data warehouse and transforming includes transforming the transactional data into a form appropriate for the data warehouse. In some embodiments, no transformation is performed, for example because any data grooming tasks are performed internally by the target database functions.
In some embodiments, a processor which performs step 102 also performs step 104. In some other embodiments, some other processor in a storage processor complex performs step 104.
The change data is applied to the target database at 106. In some embodiments, the same processor which ingested the log at 102 and/or which transformed the ingested information at 104 performs step 106. In some embodiments, some other processor in the storage processor complex performs step 106. For example, a processor dedicated to or associated with the target database may perform step 106 but does not perform step 102 and/or step 104. The process of
Since the target database is stored at least in part on the storage processor complex and at least some of the steps in
In some embodiments, the example process of
In some embodiments, the process of
In various embodiments, various systems perform the example process of
In this example, source database 200 stores its log (i.e., log 246c) on storage processor complex 240. In some embodiments, a source database (e.g., 200) is an operational database. For example, a source database may be owned by a retailer and the source database stores and manages transactions (e.g., purchases, returns, or exchanges) which are updated in substantially real-time as transactions are performed. In some embodiments, a source database is an online transaction processing (OLTP) database. In one example, a source database is an Oracle database.
In this example, processor 248a accesses log 246c and sends it to temporary storage 246a. Processor 248a in this example is paired with temporary storage 246a and in this example “ingests” log 246c. Processor 248a (paired with temporary storage 246a) then transforms the ingested data in temporary storage 246a into change data. Either processor 248a or processor 248b then applies the change data in temporary storage 246a to target database 246b. The processors which are described as performing certain steps (above) are merely exemplary and any processor in storage processor complex 240 may be used for any step. Target database 246b may already have data or information stored in it prior to applying change data. For example, log 246c may record changes to source database 200 between points in time t1 and t2 or between states s1 and s2. Target database 246b, prior to application of the change data, may be in a state or time corresponding to t1 or s1.
In some embodiments, a target database (e.g., 246b) is a specialized database where the processes and/or structure of the database are optimized or specialized for a special purpose or function. For example, a target database may be an analytic database which is designed for analyzing large amounts of data (sometimes referred to as big data analytics). To continue the retailer example from above, the source database may be an operational database used to record transactions of the retailer in substantially real-time. The target database is (at least in this example) an analytic database (e.g., a Greenplum database) where data analysis is performed on the transactions. Some example data analysis performed on the target database include fraud detection, inventory management, earnings forecasting, and so on. This example configuration enables big data analysis to be performed on up-to-date retail transactions, without slowing down a retailer's operational server with analysis. It also enables big data analysis to be performed on a database system which is specifically designed for that task (e.g., the target database may be optimized for fast transaction servicing or to always be “up” since transaction down time would be disastrous for a retailer). In some embodiments, a target database is an online analytical processing (OLAP) database. In some embodiments, a target database is a backup and/or recovery database.
One advantage to the system shown in
In some embodiments, a processor in a storage processor complex has prioritized access to storage in a storage processor complex over a host or other processor not in the storage processor complex. For example, if processor 248a (or processors 248b or 248c) in storage processor complex 240 and application host 260 both desire access to target database 246b in storage processor complex 240, processor 248a (or processors 248b or 248c) will be serviced first in such embodiments. This makes processing done by processor(s) on a storage processor complex faster than processing done on or by application host 260, in addition to or as an alternative to the performance improvement described above.
Another advantage to using processor(s) in a storage processor complex to perform CDC related processing is that there may be fewer potential points of failure. For example, if CDC processing is performed by application host 260 and application host 260 or some other device between network 220 and application host 260 fails, then the CDC process will fail. In contrast, CDC processing which uses one or more of processors 248a-248c will be successful even if application host 260 fails. There are, in other words, fewer potential points of failure. Additionally, when the software components required for CDC related processing are tested to run together in the same closed environment, there is less potential for overall processing failure.
In this example, internal connection 346 is internal to storage processor complex 340 and is not exposed to devices which are not part of storage processor complex 340. In one example, internal connection 346 includes an InfiniBand connection or network. Internal connection 346 may enable processing done by processor 344a or 344b to be faster than processing done by a processor outside of the storage processor complex (e.g., application host 360) when access to/from storage in storage processor complex 340 is desired (e.g., to/from log 342a, temporary storage 342b, and/or target database 342c).
Log 342a is located on storage processor complex 340. For example, source database 300 writes log 342a to storage processor complex 340 via network 320. In some embodiments, network 320 includes a SAN and source database 300 goes through the SAN to write log 342a on storage processor complex 340.
Referring back to
For the same reasons described above, the example system shown in
After CDC changes are applied to a target database (e.g., using the exemplary systems shown in
Returning to
At 400, one or more files associated with a source database are identified.
Returning to
At 404 in
In this example, the log file is stored on a mirrored system and so the log file in question (i.e., /usr/oracle/oradata/64817/redo01.log) is located on more than one mirror. Section 604 of response 602 includes information about the mirror physical extents (e.g., primary mirrors) and section 606 of response 602 shows information about the mirror physical devices (e.g., secondary mirrors). In various embodiments, block locations associated with mirror physical extents (e.g., section 604) and/or block locations associated with mirror physical devices (e.g., section 606) are used at step 404 in
In some mirrored systems, it is desirable to access a block from a mirror with the highest availability (e.g., based on references, I/O accesses, etc.). In some embodiments, when blocks are requested from a mirrored system, the mirrored system selects which mirror has the highest availability and obtains the requested blocks from the selected mirror. The mirrored system may have the best idea which of its mirrors has the highest availability at any given time and it may be convenient and/or performance may be best when the mirrored system itself selects which mirror to use.
An advantage to the technique described above is that it works even if a log is located on a device which has a mountable operating system. As such, the technique works for both mountable operating systems and unmountable operating systems, which makes the technique attractive because it can be universally used. For example, in
Depending upon the particular commands available to a system, various alternates to
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.
Claims
1. A method, comprising:
- identifying a log which includes one or more transactions to a source database, wherein: the log is stored on a plurality of mirrors in a mirrored system; and identifying the log includes generating a block map associated with the log, wherein the block map includes one or more block locations where one or more blocks which comprise the log reside within the mirrored system, including one or more of the following: (1) a block location associated with a mirror physical extent or (2) a block location associated with a mirror physical device;
- ingesting the log using a processor included in a storage processor complex, including by using the block locations in the block map to obtain, from a mirror that has a highest availability in the plurality of mirrors, the blocks which comprise the log;
- transforming the ingested log into change data appropriate for a target database stored at least in part on the storage processor complex; and
- applying the change data to the target database.
2. The method of claim 1, wherein the source database includes one or more of the following: an operational database or an online transaction processing (OLTP) database.
3. The method of claim 1, wherein the target database includes one or more of the following: an analytic database or an online analytical processing (OLAP) database.
4. The method of claim 1, wherein the storage processor complex includes one or more of the following: an internal connection or an InfiniBand connection.
5. The method of claim 1, wherein:
- obtaining includes requesting from the mirrored system; and
- the mirrored system, in response to the request, selects which mirror has the highest availability.
6. The method of claim 5, wherein the mirrored system selects which mirror has the highest availability using one or more of the following: a number of references to a particular mirror or a number of I/O accesses to a particular mirror.
7. The method of claim 1, wherein:
- the source database writes the log to a first storage device in the storage processor complex, wherein a first processor is co-located with the first storage device;
- the processor which ingests the log is a second processor, wherein the second processor is co-located with a second storage device; and
- the second storage device is used to store the ingested log and the change data, at least temporarily.
8. The method of claim 1, wherein:
- the source database writes the log to a first storage device in the storage processor complex;
- a second storage device in the storage processor complex is used to store the ingested log and the change data, at least temporarily; and
- the first storage device, the processor which ingests the log, and the second storage device are connected by an internal connection which is internal to the storage processor complex and is not exposed to devices outside of the storage processor complex.
9. The method of claim 1, wherein generating the block map includes:
- identifying one or more files associated with the source database;
- identifying one or more logical volumes associated with the source database;
- mapping the logical volumes to physical blocks; and
- determining which of the physical blocks are associated with a log file.
10. A system, comprising:
- one or more processors, included in a storage processor complex, which are configured to: identify a log which includes one or more transactions to a source database, wherein: the log is stored on a plurality of mirrors in a mirrored system; and identifying the log includes generating a block map associated with the log, wherein the block map includes one or more block locations where one or more blocks which comprise the log reside within the mirrored system, including one or more of the following: (1) a block location associated with a mirror physical extent or (2) a block location associated with a mirror physical device; ingest the log, including by using the block locations in the block map to obtain, from a mirror that has a highest availability in the plurality of mirrors, the blocks which comprise the log; transform the ingested log into change data appropriate for a target database stored at least in part on the storage processor complex; and apply the change data to the target database; and
- storage, included in the storage processor complex, which is configured to store the target database.
11. The system of claim 10, wherein the source database includes one or more of the following: an operational database or an online transaction processing (OLTP) database.
12. The system of claim 10, wherein the target database includes one or more of the following: an analytic database or an online analytical processing (OLAP) database.
13. The system of claim 10, wherein the storage processor complex includes one or more of the following: an internal connection or an InfiniBand connection.
14. The system of claim 10, wherein:
- obtaining includes requesting from the mirrored system; and
- the mirrored system, in response to the request, selects which mirror has the highest availability.
15. The system of claim 14, wherein the mirrored system selects which mirror has the highest availability using one or more of the following: a number of references to a particular mirror or a number of I/O accesses to a particular mirror.
16. The system of claim 10, wherein:
- the source database writes the log to a first storage device in the storage processor complex, wherein a first processor is co-located with the first storage device;
- the processor which ingests the log is a second processor, wherein the second processor is co-located with a second storage device; and
- the second storage device is used to store the ingested log and the change data, at least temporarily.
17. The system of claim 10, wherein:
- the source database writes the log to a first storage device in the storage processor complex;
- a second storage device in the storage processor complex is used to store the ingested log and the change data, at least temporarily; and
- the first storage device, the processor which ingests the log, and the second storage device are connected by an internal connection which is internal to the storage processor complex and is not exposed to devices outside of the storage processor complex.
18. The system of claim 10, wherein generating the block map includes:
- identifying one or more files associated with the source database;
- identifying one or more logical volumes associated with the source database;
- mapping the logical volumes to physical blocks; and
- determining which of the physical blocks are associated with a log file.
19. A computer program product, the computer program product being embodied in a non-transitory computer readable storage medium and comprising computer instructions for:
- identifying a log which includes one or more transactions to a source database, wherein: the log is stored on a plurality of mirrors in a mirrored system; and identifying the log includes generating a block map associated with the log, wherein the block map includes one or more block locations where one or more blocks which comprise the log reside within the mirrored system, including one or more of the following: (1) a block location associated with a mirror physical extent or (2) a block location associated with a mirror physical device;
- ingesting the log using a processor included in a storage processor complex, including by using the block locations in the block map to obtain, from a mirror that has a highest availability in the plurality of mirrors, the blocks which comprise the log;
- transforming the ingested log into change data appropriate for a target database stored at least in part on the storage processor complex; and
- applying the change data to the target database.
20. The computer program product of claim 19, wherein:
- obtaining includes requesting from the mirrored system; and
- the mirrored system, in response to the request, selects which mirror has the highest availability.
21. The computer program product of claim 20, wherein the mirrored system selects which mirror has the highest availability using one or more of the following: a number of references to a particular mirror or a number of I/O accesses to a particular mirror.
22. The computer program product of claim 19, wherein:
- the source database writes the log to a first storage device in the storage processor complex, wherein a first processor is co-located with the first storage device;
- the processor which ingests the log is a second processor, wherein the second processor is co-located with a second storage device; and
- the second storage device is used to store the ingested log and the change data, at least temporarily.
23. The computer program product of claim 19, wherein:
- the source database writes the log to a first storage device in the storage processor complex;
- a second storage device in the storage processor complex is used to store the ingested log and the change data, at least temporarily; and
- the first storage device, the processor which ingests the log, and the second storage device are connected by an internal connection which is internal to the storage processor complex and is not exposed to devices outside of the storage processor complex.
24. The computer program product of claim 19, wherein generating the block map includes:
- identifying one or more files associated with the source database;
- identifying one or more logical volumes associated with the source database;
- mapping the logical volumes to physical blocks; and
- determining which of the physical blocks are associated with a log file.
20040107296 | June 3, 2004 | Donker |
20050193034 | September 1, 2005 | Kitsuregawa et al. |
20070299885 | December 27, 2007 | Pareek et al. |
20110167229 | July 7, 2011 | Szalay et al. |
Type: Grant
Filed: Mar 8, 2013
Date of Patent: May 3, 2016
Assignee: EMC Corporation (Hopkinton, MA)
Inventors: Kenneth J Taylor (Franklin, MA), Balakrishnan Ganeshan (Foster City, CA), Lance Fang (Elk Grove, CA)
Primary Examiner: Kris Mackes
Application Number: 13/790,341
International Classification: G06F 17/30 (20060101);