Method and apparatus for integrating relational and hierarchical data
Methods and apparatus for integrating relational and hierarchical data, schema definitions, and queries in a data processing system are provided. It is determined if one or more schema definitions or one or more query expressions are provided as input to the data processing system. The one or more schema definitions are converted into an intermediate schema language component of an intermediate data language when one or more schema definitions are provided. The one or more query expressions are converted into an intermediate query language component of the intermediate data language when one or more query expressions are provided. The intermediate schema language component or the intermediate query language component is compiled in an intermediate data language processing engine into a run-time representation in accordance with a relational-hierarchical analysis.
Latest IBM Patents:
The present invention relates generally to data processing techniques and, more particularly, to techniques for integrating relational and hierarchical data, schema definitions, and queries in a data processing system.
BACKGROUND OF THE INVENTIONTwo main standards of describing and querying data have evolved. One of these standards is based on a relational model that is used by most modern databases. The other is based on a hierarchical model, examples of which include, XML (Extensible Markup Language), XML Schema Language (XSD) and XQuery Language.
XML is a specification language created to describe data interchange formats and data semantics. An XML document consists of data annotation tags that represent relationships between data values. An XML schema is an auxiliary document describing the structure of an XML document making it easier to interpret. XQuery is a language for querying information from XML documents.
Before the inception of XML, the majority of data was stored in relational tables. A relational table is a data structure that represents a mathematical mapping between one or more types of data. Relational databases store information by organizing data in normalized tables where the stored information can be retrieved through querying languages based on Relational Algebra, an example being Structured Query Language (SQL).
As XML continues to gain popularity, the need for effective integration of hierarchical data expressions and relational data expressions grows. Effective integration between the two has proven difficult because of key differences between them. For example, XML documents organize data in a hierarchical structure with multiple levels of nesting, while the relational model organizes data in flat tables with inter-table functional dependencies. Additionally, in hierarchical data expressions, document order of a node (the position each node occurs in the document) is important, while in relational data expressions document order is not relevant.
Previous attempts have been made at developing techniques to effectively integrate hierarchical data schemas and relational data schemas. These attempts have suffered from problems such as excessive use of the computationally very expensive “join” operation. Such attempts include, XML shredding, as described in P. Bohannon et al., “LegoDB: Customizing Relational Storage for XML Documents,” 2002, mapping XML data values to a set of predefined tables based on node type, and mapping XML data to a relational table by number-encoding each of the XML data values.
Attempts have also been made to convert queries written over hierarchical data into queries over relational data. These attempts have suffered shortfalls similar to those described above. These attempts are described in Y. Diao et al., “Towards an Internet-Scale XML Dissemination Service,” VLDB, 2004, and C. Koch et al., “FluXQuery: An Optimizing XQuery Processor for Streaming XML Data,” VLDB, 2005. These attempts include a pure XML engine to handle processing, and translating hierarchical queries into relational queries.
Various techniques have been proposed for specifying “continuous queries” over steams. In these environments, data is not fixed, but arrives one message at a time in one or more continuous streams. Queries define views over the entire history of one or more streams. Rather than receiving a single result set, subscribers to continuous queries receive a continuously updated result set reflecting how the view changed as a result of the changes to the streams on which it depends. In a mixed environment, any of the following combinations are possible: schemas defined in a relational (SQL) or hierarchical style (XML); messages delivered in a relational (flat) or hierarchical format (XML); and queries written in a relation language (SQL) or hierarchical language (XQUERY/XSLT).
SUMMARY OF THE INVENTIONThe present invention provides techniques for integrating relational and hierarchical data, schema definitions, and queries in a data processing system through the use of an intermediate data language. While not limited thereto, such techniques have been developed and tested for use with XML documents and schemas, and XQuery and SQL language for querying.
By way of example, in one aspect of the invention, a method for integrating relational and hierarchical data, schema definitions, and queries in a data processing system is provided. It is determined if one or more schema definitions or one or more query expressions are provided as input to the data processing system. The one or more schema definitions are converted into an intermediate schema language component of an intermediate data language when one or more schema definitions are provided. The one or more query expressions are converted into an intermediate query language component of the intermediate data language when one or more query expressions are provided. The intermediate schema language component or the intermediate query language component is compiled in an intermediate data language processing engine into a run-time representation in accordance with a relational-hierarchical analysis.
In an additional embodiment of the present invention, the one or more schema definitions may be relational schema or hierarchical schema. Further, the one or more query expressions may be one or more relational query expressions or one or more hierarchical query expressions.
In another additional embodiment of the present invention, the compiling step may include analyzing the intermediate schema language component or the intermediate query language component to capture relationships between at least one of relational tuples and hierarchical data. The steps of determining, converting and compiling may be repeated for additional input data, and the compiling step may be performed in accordance with relationships between at least one of relational tuples and hierarchical data captured from previously input data.
In further embodiments of the present invention, the analyzing step may include the step of computing functional dependency information for augmentation with the intermediate schema language component or the intermediate query language component. The functional dependency information may be utilized to determine redundant cells and a hierarchical representation of the intermediate schema language component or the intermediate query language component.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The following description will illustrate the present invention using exemplary data processing system architecture. It should be understood, however, that the invention is not limited to use with any particular system architecture. The invention is instead more generally applicable to any task that would benefit from the integration of hierarchical data expressions and relational data expressions.
As will be illustrated below the present invention introduces techniques for integrating relational and hierarchical data, schema definitions, and queries through the use of an intermediate data language.
As discussed herein the term “hierarchical data” may generally refer to a structure of data having several levels arranged in a tree-like structure. The term “relational data,” as used herein, may refer to a structure of data that is represented as a series of mathematical relations. By way of example, a relational database stores information by organizing data into normalized flat tables, without the multiple levels of nesting seen in a database with a hierarchical structure. The term “functional dependency” as used herein may refer to a mathematical relation between sets of columns in a given database. If a first column set depends on a second column set such that for a unique combination of values for the first column set there is at most one value for each column of the second column set, then a functional dependency exists between the two columns.
Referring initially to
As shown in
Relational query expressions 106 are input into a relational query to IDL query converter 116 which converts the data into an intermediate query language 124. Once in intermediate query language 124, the data can be input into block 130 in accordance with the steps described above. Hierarchical query expressions 108 are input into a hierarchical query to IDL query converter 118 which converts the data into intermediate query language 124. Once in intermediate query language 124, the data can be input into block 130 in accordance with the steps described above.
The intermediate data language is based on classical relational algebra. This intermediate data language is extended however, to be substantially compatible with nested data. The intermediate data language encompasses core relational operators for queries, including, the “select”, “extend”, “project”, “top k”, “join”, “merge”, “combine” and “split” operators in addition to several other arithmetic, logical and comparison operators.
Referring now to
Referring now to
The set of nested relations is also equivalent to a denormalized flat relation which is the join of the set of normalized flat relations. The intermediate language treats the query as if it were executed over this denormalized relation, however, at execution time, the compiler actually represents the data more compactly either as a hierarchical structure, or as a set of normalized tables. The denormalized form is one that would be very inefficient if actually materialized, but it allows queries either in an SQL-like form or in an XQUERY-like form to be interpreted appropriately. This assumes that the compiler retains the functional dependencies.
Referring now to
Referring now to
The upper table of
Referring now to
In a streaming system, the compiler for the intermediate data language has the additional task of generating efficient code for continuously updating views of stream data as messages appending new tuples to streams arrive. A type analysis step computes additional properties of columns of relation given the intermediate language expression that derived that relation, and given the properties of the relation or relations that were input to that expression. Starting with the user-specified schema of the input streams, the system will successively apply steps of type analysis to views derived from these streams, and then to view derived from these views, until these properties are derived for all views. These properties allow the run-time to efficiently compute not just the current value of each row and column of the relation, but also will compute whether and to what degree that value can change.
This information is used both to advantageously compute whether an intermediate value needs to be saved, and it can also be used to signal to the consumer of such a view whether the value is final. A value which cannot change any more is final, and once it has been propagated to any views which need to know the vale it can be discarded. A consumer may wish to distinguish between the case where the number of responses received within the deadline is currently zero and the case where the number of responses received within the deadline is finally zero, because the deadline has passed.
The additional information computed by type analysis includes: the maximum positive and negative components of values of aggregate types and the maximum number of steps needed to reach finality; and whether the column is masked as a result of another Boolean selection value, as in SELECT*FROM T WHERE X>Y, in which an intermediate column representing the Boolean intermediate value of X>Y is created, and each column is typed as being masked by this intermediate value.
Functional dependencies alone can specify how many values are in a given column. For example, if column X depends on columns (K1, K2,) and there are 100 values for K1 and 4 values for K2, there could be at most 4*100 values of X.
Referring now to
If it is determined that the data is not schematized in block 702, it is assumed that the data is a query. Block 710 then determines whether the data is relational. If the data is relational, then in block 712 a relational query converter converts data to an intermediate query language. If the data is not relational, it is assumed to be hierarchical, and in block 714 a hierarchical query converter converts data to intermediate query language. Once the data is in intermediate form, block 716 processes the data through an intermediate data language processing engine.
Referring now to
As shown, the computer system may be implemented in accordance with a processor 810, a memory 812, I/O devices 814, and a network interface 816, coupled via a computer bus 818 or alternate connection arrangement.
It is to be appreciated that the term “processor” as used herein is intended to include any processing device, such as, for example, one that includes a CPU (central processing unit) and/or other processing circuitry. It is also to be understood that the term “processor” may refer to more than one processing device and that various elements associated with a processing device may be shared by other processing devices.
The term “memory” as used herein is intended to include memory associated with a processor or CPU, such as, for example, RAM, ROM, a fixed memory device (e.g., hard drive), a removable memory device (e.g., diskette), flash memory, etc.
In addition, the phrase “input/output devices” or “I/O devices” as used herein is intended to include, for example, one or more input devices (e.g., keyboard, mouse, scanner, etc.) for entering data to the processing unit, and/or one or more output devices (e.g., speaker, display, printer, etc.) for presenting results associated with the processing unit.
Still further, the phrase “network interface” as used herein is intended to include, for example, one or more transceivers to permit the computer system to communicate with another computer system via an appropriate communications protocol.
Software components including instructions or code for performing the methodologies described herein may be stored in one or more of the associated memory devices (e.g., ROM, fixed or removable memory) and, when ready to be utilized, loaded in part or in whole (e.g., into RAM) and executed by a CPU.
Although illustrative embodiments of the present invention have been described herein with references to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be made by one skilled in the art without departing from the scope or spirit of the invention.
-
- The following listing of claims will replace all prior versions and listings of claims in the above-referenced application:
Claims
1. A method for integrating relational and hierarchical data, schema definitions, and queries in a data processing system, comprising the steps of:
- converting one or more schema definitions into an intermediate schema language component of an intermediate data language when one or more schema definitions are provided;
- converting one or more query expressions into an intermediate query language component of the intermediate data language when one or more query expressions are provided; and
- compiling, in an intermediate data language processing engine, at least one of the intermediate schema language component and the intermediate query language component into a run-time representation in accordance with a relational-hierarchical analysis.
2. The method of claim 1, wherein, in the step of converting the one or more schema definitions, the one or more schema definitions comprise at least one of relational schema and hierarchical schema.
3. The method of claim 1, wherein, in the step of converting the one or more query expressions, the one or more query expressions comprise at least one of one or more relational query expressions and one or more hierarchical query expressions.
4. The method of claim 1, wherein the compiling step comprises the step of analyzing at least one of the intermediate schema language component and the intermediate query language component to capture relationships between at least one of relational tuples and hierarchical data.
5. The method of claim 4, further comprising the step of choosing a preferred run-time representation for the intermediate schema language component in accordance with the analysis of at least one of the intermediate schema language component and the intermediate query language component.
6. The method of claim 4, further comprising the step of repeating the converting and compiling steps for additional input data, wherein the compiling step is performed in accordance with relationships between at least one of relational tuples and hierarchical data captured from previously input data.
7. The method of claim 4, wherein the analyzing step comprises the step of computing functional dependency information for augmentation with at least one of the intermediate schema language component and the intermediate query language component.
8. The method of claim 7, wherein, in the analyzing step, the functional dependency information is utilized to determine redundant cells and a hierarchical representation of at least one of the intermediate schema language component and the intermediate query language component.
9. The method of claim 7, wherein at least one of the intermediate schema language component and the intermediate query language component comprises inner column names prepended with the names of the inner relations.
10. The method of claim 7, wherein the analyzing step comprises the step of computing at least one of maximum value ranges, maximum steps to value finality, and masking of columns as a result of a Boolean selection value.
11. The method of claim 1, wherein the one or more query expressions comprise continuous queries over streaming data.
12. The method of claim 1, wherein at least one of the intermediate schema language component and the intermediate query language component comprises classical relational algebra.
13. The method of claim 1, wherein the intermediate schema language component comprises one or more of the core relational operators.
14. Apparatus for integrating relational and hierarchical data, schema definitions, and queries in a data processing system, comprising:
- a memory; and
- at least one processor coupled to the memory and operative to: (i) convert one or more schema definitions into an intermediate schema language component of an intermediate data language when one or more schema definitions are provided; (ii) convert one or more query expressions into an intermediate query language component of the intermediate data language when one or more query expressions are provided; and (iii) compile, in an intermediate data language processing engine, at least one of the intermediate schema language component and the intermediate query language component into a run-time representation in accordance with a relational-hierarchical analysis.
15. The apparatus of claim 14, wherein, in the operation of converting the one or more schema definitions, the one or more schema definitions comprise at least one of relational schema and hierarchical schema.
16. The apparatus of claim 14, wherein, in the operation of converting the one or more query expressions, the one or more query expressions comprise at least one of one or more relational query expressions and one or more hierarchical query expressions.
17. The apparatus of claim 14, wherein the compiling operation comprises the step of analyzing at least one of the intermediate schema language component and the intermediate query language component to capture relationships between at least one of relational tuples and hierarchical data.
18. The apparatus of claim 17, further comprising the operation of repeating the converting and compiling steps for additional input data, wherein the compiling operation is, performed in accordance with relationships between at least one of relational tuples and hierarchical data captured from previously input data.
19. The apparatus of claim 17, wherein the analyzing operation comprises the step of computing functional dependency information for augmentation with at least one of the intermediate schema language component and the intermediate query language component.
20. An article of manufacture for integrating relational and hierarchical data, schema definitions, and queries in a data processing system, comprising a machine readable medium containing one or more programs which when executed implement the steps of:
- converting one or more schema definitions into an intermediate schema language component of an intermediate data language when one or more schema definitions are provided;
- converting one or more query expressions into an intermediate query language component of the intermediate data language when one or more query expressions are provided; and
- compiling, in an intermediate data language processing engine, at least one of the intermediate schema language component and the intermediate query language component into a run-time representation in accordance with a relational-hierarchical analysis.
Type: Application
Filed: Sep 29, 2006
Publication Date: Apr 3, 2008
Applicant: International Business Machines Corporation (Armonk, NY)
Inventors: Andrey Khorlin (Sunnyvale, CA), Robert Evan Strom (Ridgefield, CT), Lu Tian (Mountain View, CA)
Application Number: 11/541,260
International Classification: G06F 17/30 (20060101);