Systems and methods for implementing a computer language type system
The invention provides systems and methods for implementation of a computer language type system by augmenting finite state automata algorithms to accommodate symbols having both subtype relationships and nested types. To make the classical automata algorithms work for type system with subtypes, finite state automata for a data type is augmented by additional transitions that include secondary symbols, wherein secondary symbols are subtypes of symbols of alphabet of finite state automata. A data type when compared to another data type must compare both names and the contents.
Latest Bea Systems, Inc. Patents:
This application claims priority to U.S. Provisional Patent Application No. 60/573,401, entitled SYSTEMS AND METHODS FOR IMPLEMENTING A COMPUTER LANGUAGE TYPE SYSTEM, by Paul J. Lucas, Daniela D. Florescu, and Fabio Riccardi, filed May 21, 2004, which is hereby incorporated herein by reference.
COPYRIGHT NOTICEA portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
FIELD OF THE INVENTIONThe invention generally relates to implementing a type system for a computer language, and in particular to implementing a complicated type system for a computer language, including but limited to a programming language and a data processing language, by augmenting finite state automata.
BACKGROUNDTypes are used by programming languages to determine whether a particular data is in a particular form. Conventional programming languages define type systems having types such as integer, string, etc., and user-defined structures. XQuery is a programming language that defines a rich and complicated type system. XQuery is a language for querying XML documents. In addition to types found in conventional programming languages, the type system of languages like XQuery allows new types to be created using sequences (e.g., integer followed by string), alteration (e.g., integer or string), shuffle-product (e.g., integer and string in either order), and occurrences of those (zero or one, zero or more, and one or more).
Types are used by XQuery to determine whether XML data is in the required form. To determine the type of a XML data for atomic types, an XQuery processor can use a class hierarchy to model the type hierarchy as defined in the XQuery specification. For example, the built in or atomic types like xs:integer, xs:boolean, xs:date, etc. can be represented by singleton instances of classes comprising an inheritance hierarchy.
The invention provides systems and methods for implementation of a computer language type system by augmenting finite state automata algorithms to accommodate symbols having both subtype relationships and nested types.
In various embodiments of the invention, answering the questions like: Are two types equal?, Is one type a subset of another?, Do two types intersect?, etc., to determine if the data is in the required form is done by using augmented finite state automata.
The invention is illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an,” “one” and “various” embodiments in this disclosure are not necessarily to the same embodiment, and such references mean at least one.
XQuery allows complex types to be created from atomic types using sequences, alteration, shuffle-product, occurrences etc. A complex XQuery types such as (xs:integer|xs:string)* meaning zero or more integers or strings resemble regular expressions used in Unix shells and other programming languages. Regular expressions, and XQuery types, can be represented internally using trees. The previous example can be represented using a tree as shown in
In prior art systems, complex types can be represented as finite state automata (FSA). The previous example tree can be represented using a non deterministic FSA (NFA) as shown in
NFAs are generally not easy for computers to deal with due to the intrinsic non deterministic nature. Every NFA can be converted into a deterministic FSA (DFA) using classical algorithms. The previous example NFA as shown in
To answer the question, “Are two types equal?”, one can use the equation: T=U≡T⊂U and U⊂T , that is: type T is equal to type U only if T is a sub type of U and U is also a subtype of T.
To answer the question, “Is one type a sub type of another?”, one can use the equation: T⊂U≡T∩Uc=ø, that is: type T is a subtype (⊂) of type U only if the intersection (∩) of T and the complement of U (Uc) is empty (ø). The complement of a DFA can be computed by merely flipping its accepting and non accepting states. A DFA is empty only if there is no way to transition from its start state to any accepting state.
To answer the question, “Do types intersect?”, one can use one of DeMorgan's laws since performing a union (∪) and an extra complement is less expensive than performing an intersection: T∩Uc=(Tc∪U)c.
Since all the questions can be answered by performing an union, this union needs to be performed where T and U are DFAs representing types. To do this one can add a new start and end state (state zero and state one in
In classical regular languages, the symbols of the alphabet of a language have no relationship to each other. In languages with complex type system like XQuery's type system, even an atomic type can be a subtype of another, e.g., xs:integer is a subtype of xs:decimal. This affects the result of intersection. In classical regular languages, a language L accepting xs:integer and another language R accepting xs:decimal have no intersection. However in XQuery, because xs:integer is a subtype of xs:decimal, the intersection of is xs:integer.
In various embodiments, the classical automata's algorithms can be made to work by augmenting the joint alphabet with all the symbols that are subtypes of the original symbols in L∪R. Then, for each transition ‘t’ from states si to sj a transition ‘u’ can be added such that:
-
- symbol(u) ⊂ symbol(t) and symbol(u) ε symbols(L) ∪ symbols(R).
In various embodiments, for the above example, language R can be augmented to accept xs:integer in addition to xs:decimal prior to performing an intersection as shown in
In classical regular languages, the symbols of the alphabet of languages are atomic, in that they are not comprising smaller or nested components. In the XQuery type system, however, element and attribute types can have wildcards in their names and the contents. For example, the XQuery type:
-
- element *:ZipCode {xs:integer}
(Which means the element ZipCode in any namespace having a single integer for content) when compared to another element type must compare both names and the contents.
- element *:ZipCode {xs:integer}
In various embodiments an element or attribute is a name-test that can have two parts: a prefix and a local name that are separated by a colon. A name-test can be of any of the forms: name:name, ε:name, *:name, name:*, and * (which for short *:*) where name is a constant like ZipCode, ‘ε’ means “empty” and ‘*’ is a wildcard that matches any name or ‘ε’. The strict ordering for subtyping each part of a name-test is given by the two rules:
-
- name ⊂* and ε ⊂*
Hence a name ‘n’ is a subtype of ‘m’ only if both the prefix and the local name of ‘n’ are subtypes of the prefix and the local name of ‘m’, respectively.
- name ⊂* and ε ⊂*
In various embodiments the strict sub-typing rules can also be used to implement the intersection of the name sets. The intersection of two name-tests can be obtained by intersecting the prefixes and the local names separately, then combining the results into a new name-test: If given parts in both name-tests are constant names, then they must be equals. Otherwise, the result for each part is the most specific between the two names. For any subtype relationship: A⊂B, A is more specific than B. Hence “name” is the most specific between “name” and “ε”, and “ε” is the most specific between “ε” and “*”.
Examples of strict typing rules are,
-
- foo:bar ∩foo: *=foo:bar,
- ε:bar ∩*:bar=ε:bar, and
- foo:bar ∩bar:bif=ø.
In the example, “foo:bar ∩foo:*=foo:bar”, “bar” is picked over “*”, because it is more specific.
In various embodiments when two element types are compared, as per the rule, their content also needs to be compared. This can be handled through recursion. The complete rules for sub-typing and intersecting two elements ‘e’ and ‘f’ are:
-
- e ⊂f=name-test (e) ⊂name-test (f) and content (e) ⊂content (f)
- e ∩f=name-test (e) ∩name-test (f) and content (e) ∩content (f)
An example of complete rule for sub-typing and intersecting two elements ‘e’ and ‘f’ is:
-
- e=element *:ZipCode {xs:integer}
- f=element USPostalService:* {xs:decimal}
- e∩f=element USPostalService:ZipCode {xs:integer}
The present invention may be conveniently implemented using a conventional general purpose or a specialized digital computer or microprocessor programmed according to the teachings of the present disclosure. Appropriate software coding can readily be prepared by skilled programmers based on the teachings of the present disclosure, as will be apparent to those skilled in the software art.
In some embodiments, the present invention includes a computer program product which is a storage medium (media) having instructions stored thereon/in which can be used to program a computer to perform any of the processes of the present invention. The storage medium can include, but is not limited to, any type of disk including floppy disks, optical discs, DVD, CD-ROMs, microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.
The foregoing description of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention for various embodiments and with various modifications that are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalence.
Claims
1. A method for implementing a data type system for use with a computer language, comprising the steps of:
- accessing a computer language document using a programming language, wherein the programming language defines a plurality of complex data types and subtypes that specify whether data elements within the document are in a correct form;
- creating a minimized deterministic finite state automata for use with the data types and subtypes used by the programming language, wherein the finite state automata includes states that represent language data types, and transitions between the states, and wherein each of the transitions includes a symbol of the alphabet of the minimized deterministic finite state automata;
- augmenting the minimized deterministic finite state automata with additional transitions that include secondary symbols, wherein each of the secondary symbols corresponds in the programming language to a subtype of one of the data types; and
- using the augmented minimized deterministic finite state automata with automata algorithms that include intersection and union, to compare and determine if data types and subtypes in the document are one of equal, subsets or intersect.
2. A method according to claim 1, the computer language document is an XML document and the programming language is XQuery.
3. A method according to claim 1, wherein: said data type is created from atomic data types.
4. A method according to claim 1, wherein: said data type is created from atomic data types using sequences, alteration, shuffle-product and occurrences.
5. A method according to claim 1, wherein: said data type comprises wildcards in the name.
6. A method according to claim 1, wherein: said data type comprises nested data type.
7. A method according to claim 1, wherein: said computer language is a programming language.
8. A method according to claim 1, wherein: said computer language is a data processing language.
9. A computer system for implementation of a computer language data type system, comprising:
- a computer processor and computer readable storage medium having stored thereon a computer language document using a programming language, wherein the programming language defines a plurality of complex data types and subtypes that specify whether data elements within the document are in a correct form;
- a first language processor component that creates a minimized deterministic finite state automata for use with the data types and subtypes used by the programming language, wherein the finite state automata includes states that represent language data types, and transitions between the states, and wherein each of the transitions includes a symbol of the alphabet of the minimized deterministic finite state automata;
- a second language processor component that augments the minimized deterministic finite state automata with additional transitions that include secondary symbols, wherein each secondary symbol corresponds in the programming language to a subtype of one of the data types;
- a third language processor component that uses the augmented minimized deterministic finite state automata with automata algorithms that include intersection and union, to compare and determine if data types and subtypes in the document are one of equal, subsets or intersect; and
- wherein the first, second and third language processing component can be one the same language processing component and different language processing components.
10. A system according to claim 9, the computer language document is an XML document and the programming language is XQuery.
11. A system according to claim 9, wherein: said data type is created from atomic data types.
12. A system according to claim 9, wherein: said data type is created from atomic data types using sequences, alteration, shuffle-product and occurrences.
13. A system according to claim 9, wherein: said data type comprises wildcards in the name.
14. A system according to claim 9, wherein: said data type comprises nested data type.
15. A system according to claim 9, wherein: said computer language is a programming language.
16. A system according to claim 9, wherein: said computer language is a data processing language.
17. A method for implementing a data type system for use with XQuery, comprising the steps of:
- accessing an XML document using XQuery, wherein the XML comprises data elements and wherein XQuery defines a plurality of complex data types and subtypes;
- creating a minimized deterministic finite state automata for use with the XQuery data types and subtypes, wherein the finite state automata includes states that represent XQuery data types, and transitions between the states;
- augmenting the minimized deterministic finite state automata with additional transitions corresponds to XQuery subtypes; and
- using the augmented minimized deterministic finite state automata to compare and determine if data types and subtypes in the XML document are one of equal, subsets or intersect.
18. The method of claim 17, wherein at least some of the plurality of complex data types and subtypes are comprised of atomic data types.
19. The method of claim 18 wherein the atomic data types can include string and integer data types.
20. The method of claim 19 wherein at least some of the plurality of complex data types are a combination of both string and integer atomic data types.
21. The method of claim 17 wherein the step of using the augmented minimized deterministic finite state automata to compare and determine if data types and subtypes in the XML document are one of equal, subsets or intersect includes the steps of determining an intersection between the first data type or subtype and the complement of the second data type or subtype.
22. The method of claim 17 wherein the step of using the augmented minimized deterministic finite state automata to compare and determine if data types and subtypes in the XML document are one of equal, subsets or intersect includes the steps of determining the complement of the union between the complement of the first data type or subtype and the second data type or subtype.
23. The method of claim 17 wherein each of the data elements in the document have a type, namespace and data associated therewith and wherein the method comprises determining namesets of the data elements and using nameset information in comparing the data types and subtypes.
24. The method of claim 17 wherein at least some of the data elements have a prefix and a local name component, and wherein the method further comprises comparing two data elements by comparing their prefix and local name components.
7194462 | March 20, 2007 | Riccardi et al. |
7240004 | July 3, 2007 | Allauzen et al. |
7240048 | July 3, 2007 | Pontius |
20040013307 | January 22, 2004 | Thienot et al. |
- Markus Forsberg; Finite State Transducers in Haskell; Aug. 28, 2001; p. 1-41.
- Andrew Eisenberg, An Early Look at XQuery, Dec. 2002, SIGMOD Record, vol. 31, No. 4, pp. 113-120.
- Daniela Florescu, The BEA/SQRL Streaming XQuery Processor, 2003, pp. 1-12.
Type: Grant
Filed: Dec 3, 2004
Date of Patent: Mar 24, 2009
Patent Publication Number: 20060010124
Assignee: Bea Systems, Inc. (Redwood Shores, CA)
Inventors: Paul J. Lucas (Mountain View, CA), Daniela D. Florescu (Palo Alto, CA), Fabio Riccardi (Palo Alto, CA)
Primary Examiner: Tuan Q Dam
Assistant Examiner: Hanh T Bui
Attorney: Fliesler Meyer LLP
Application Number: 11/004,462
International Classification: G06F 9/44 (20060101);