COMPRESSING XML DOCUMENTS USING STATISTICAL TREES GENERATED FROM THOSE DOCUMENTS
Compressing data from a markup language document such as an XML document includes the steps of creating from the document a path based statistical tree built according to a given set of rules, and compressing the document by using the statistical tree. In an embodiment, the statistical tree includes a multitude of paths, and a single bit represents each of said paths. Also, the document may include both enumerated data and non-enumerated data, and the enumerated data is compressed by using the statistical tree. In an embodiment, the document includes a multitude of document nodes, and the step of creating the path based statistical tree includes the step of forming said tree with a multitude of tree nodes, each of the tree nodes representing one of the document nodes.
Latest IBM Patents:
- AUTO-DETECTION OF OBSERVABLES AND AUTO-DISPOSITION OF ALERTS IN AN ENDPOINT DETECTION AND RESPONSE (EDR) SYSTEM USING MACHINE LEARNING
- OPTIMIZING SOURCE CODE USING CALLABLE UNIT MATCHING
- Low thermal conductivity support system for cryogenic environments
- Partial loading of media based on context
- Recast repetitive messages
XML (eXtensible Markup Language) is a standard for creating markup languages, which allow the description of different types of data and simplify sharing of structured information. XML is used as a standard for documents sent over the Internet and in other multimedia fields.
XML documents tend to be quite large compared to other forms of data representation and this has been a cause for concern among people who want to use XML for data representation. Some existing techniques use restructuring the document and then running regular text compression, i.e., gzip on this. This provides compression.
SUMMARY OF THE INVENTIONEmbodiments of the invention compressing data from a markup language document such as an XML document. An embodiment of a method includes creating from the document a path based statistical tree built according to a given set of rules, and compressing the document by using the statistical tree. In an embodiment, the statistical tree includes a multitude of paths, and a single bit represents each of said paths. Also, the document may include both enumerated data and non-enumerated data, and the enumerated data is compressed by using the statistical tree. In this embodiment, the path based statistical tree is formed with a multitude of tree nodes, each of which representing node one of the document nodes. More specifically, identifying a root node of the document and starting the tree with a node denoting that root node of the document start the tree.
An embodiment of the invention, described in detail below, is based on the principal of using the XML Document directly to create a path based tree built on a certain set of rules and using this tree to encode the XML body and the enumerated data in the XML document. A distinguishing feature of this tree is that each path is represented by a single bit, i.e., 0 or 1 which helps to represent the whole XML document structure in as little as a few bytes of data. The only data that is not considered in this compression is the data that is defined by the user, i.e., non-enumerated data that can be compressed using the normal text-based compression techniques.
An important aspect of this approach is that the compression of the structure is independent of the compression used for the data and the compression achieved is very high. If the XML document contains only enumerated data, then a separate compression process for the data will not be required since this compression will take care of this completely. This technique also makes querying for data in the partially compressed XML document easier.
In the basic architecture of an embodiment of the invention, the XML document which needs to be compressed is passed to a Statistical Tree Generator which processes the XML document and builds the Statistical Trees from the information read from the XML document. A structural compressor uses this tree to compress the XML records retrieved by a parser and produces the compressed structure and enumerated data along with the uncompressed data, which can be used as an input to the next level of compression or can be stored directly.
An approach for statistical XML structural compression involves four parts: Statistical tree generation directly from XML Document; Optimizing the generated tree so that the elements maximum occurrence in the XML document would get the least bit trace; Encoding the XML Document using the Statistical Trees generated; and Decoding the Encoded XML Document using the same Statistical Trees generated during Encoding.
The eXtensible Markup Language (XML) is a standard for creating markup languages. Languages based on XML can describe different types of data in addition to text data and can simplify sharing of structured information on, for example, the Internet. A program without knowledge of the language itself may process documents written in an XML-based language. XML and other data description languages allow software developers to specify fundamental language syntax by defining a document type definition (DTD) that specifies constraints on the document structure. A typical DTD employed for interpretation of an XML document specifies allowable XML elements, attributes, and allowable attribute values. Alternatively, an XML schema may be defined.
An XML file is a text file, which must conform to various XML syntax rules. Particularly, an XML document must include a declaration that declares an identifier, which specifies the document as an XML-compliant file. A declaration can be considered as a definition. Identifiers may be declared. In XML, several things are said to be declared, e.g., namespaces, data types, and version. For example, the XML declaration may identify an XML version and may specify a character-encoding format. XML encoding generally defaults to 8-bit Unicode Transformation Format (UTF-8), using the following declaration:
<?xml version=“1.0” encoding=“UTF-8” ?>.
In XML terms, this is technically a so-called “processing instruction,” that in effect says, “This is an XML document that conforms to XML specification version 1.0 and is encoded in a character set called UTF-8.”
An XML-compliant document comprises a single root element, and elements containing data entries must be delineated with both a start tag, e.g., “<element_a>”, and an end tag, e.g., “</element_a>”. Additionally, attribute values are delineated with quotations, and nested (but not overlapping) tags are permissible.
This disclosure describes compressing data from a markup language document such as an XML document. According to an embodiment, a method comprises creating from the document a path based statistical tree built according to a given set of rules, and compressing the document by using the statistical tree.
The approach for statistical XML structural compression involves four parts: Statistical tree generation directly from XML Document; Optimizing the generated tree so that the elements maximum occurrence in the XML document would get the least bit trace; Encoding the XML Document using the Statistical Trees generated; and Decoding the Encoded XML Document using the same Statistical Trees generated during Encoding. Each of these parts is discussed below in detail.
Statistical Tree GenerationThis involves generating a statistical tree using the information available in the XML document. The information available in the XML can be represented in the form of a tree. This tree will describe the paths in the XML document using a single bit, i.e. 0 or 1, and will help represent the whole XML structure using a few bytes by defining the paths to each of the elements etc. The process of building the tree will be described below using the sample XML given in
A statistical tree can be generated from the XML document if the document satisfies basic XML properties like the XML under use should be a well-formed XML document.
Every parent node which is an element may have at least one child node that is a leaf node, and whenever there are multiple children that needs to be accommodated into a particular node, that node will be branched using the sub-tree shown at 30 in
Every node in the XML tree has a few intrinsic characteristic properties (e.g., Namespace definition, Namespace adherence, etc.), which may be tagged with the Node, so that these characteristic properties can be carried along with the node.
Further, such trees are built for each Genuine Node, which has child nodes other than the Leaf Node, and these individual Statistical Trees are referred to as “Node Statistical Trees”. These Node trees will be uniquely identifiable using their names just as in the case of normal XML Dom Structure. Further, any tag/element/attribute that has a reference to some other tag/element/attribute is considered to be of the same type as its reference, and its properties will be a union of its and its reference's properties.
Every XML document would start with a “Root Node”, so the first Genuine Node encountered will be the Root Node. In the XML Snippet shown in
The CommonBaseEvents node has three attributes and one child element. As this Node contains children other than the Leaf nodes, a placeholder needs to be provided for the children under the Parent CommonBaseEvent. This is done by splitting the node identified by (1) under CommonBaseEvents genuine node as shown at 60 in
When generating the statistical tree, it may be useful to represent data that would generally be present in the XML document. When traversing the statistical trees, data will need to be retrieved from the data source. This may be represented in the Node Statistical Tree as shown at 70 in
On modifying the figure for CommonBaseEvents, the statistical tree shown in
As the xmlns has only text data associated with it, this is considered as one of the leaf nodes of the CommonBaseEvents node, so the data is picked up and sent for data processing, and the node information of (1) is updated with attribute xmlns properties, as shown at 91 in
In the above Statistical Tree for CommonBaseEvents, CommonBaseEvent is a child element, which has one or more children. So, as shown in
An “Element Counter” is initialized when a node is first added to the Statistical Tree. The purpose of this counter is to keep track of the number of times the same node appears in the XML Document being processed. Every time a node is identified, a look up is done in the already built Statistical Trees to check if this node has already appeared and has been included in the Statistical Tree of its Parent node. If yes, then the Element counter of the node identified is incremented by 1.
Once the Statistical trees are built, Optimization of the Statistical trees is done. Optimization of the Statistical Tree involves rearranging of the nodes in each Statistical tree based on the value in individual “Element Counter”.
The nodes in a Statistical Tree may be arranged in a decreasing order of the value in individual Element counter, i.e., the node with highest Element counter value would come to the top of the tree and the node with the least Element counter value would go to the bottom of the tree. This Optimization step may ensure that the nodes that occur more times in an XML Document would get the least bit trace in the tree. If the Element counter for two or more nodes is same, then the nodes that have same Element counter value are, for example, rearranged in their alphabetical order.
- Element counter=1 for xmlns node 111,
- Element counter=1 for xmlns:xsi node 112,
- Element counter=1 for xsi:schemaLocation node 113, and
- Element counter=2 for CommonBaseEvent node 114.
After the Statistical Tree is optimized based on the counts for each node in the tree, for every Statistical Tree formed, then nodes denoted by the active node and its peer is collapsed with its parent, so that the peer child is directly the child of its parent node, i.e., for the last node xsi:schemaLocation and the active node represented by (0) are considered to be the peer nodes in observation. These two branches from the node denoted by (0) which is a peer of xmlns:xsi. Now as the peer (0) node of xmlns:xsi is represented by two nodes, these two nodes can be merged into a single node representing xsi:schemaLocation as a peer to xmlns:xsi instead of node (0). This is depicted at 121 in
Building the Statistical Tree from the XML Document
Step 135 is to check if the Parent node Statistical tree of the XML fragment under observation already has any child node that represents the XML fragment obtained in step 134. If so, then, at step 136, the count associated with that node in the Parent Statistical Tree is incremented by 1. However, if at step 135, the Parent node Statistical Tree does not have a child node that represents the XML fragment, then, at step 137, the current active node of the Parent Statistical Tree is split into two nodes, i.e., add two child nodes (0 and 1) and set the child labeled “1” as active node. Also, at step 137, a node is added representing this XML Fragment as a child of the node labeled “1”, and the node labeled “0” as is set as an Active node.
Step 135 is to check if the XML fragment is an Element. If it is, then step 139 is to check if there is a Statistical Tree already created for this Element, i.e., a separate Statistical Tree with the Element under observation as the root node of the Statistical Tree. If it is not present, then a new Statistical Tree is built with this element as the Root Node of the Statistical Tree, as shown in step 131. The attributes, child elements and text data of this element are processed and the XML fragments, which directly come under this element, are appended as nodes to the Statistical tree of this element starting from step 134 recursively.
Steps 134-139 are repeated until the children of the current root node are processed. When this is done, the following the Method for Optimization discussed below optimizes the generated Statistical Trees.
Optimizing the Statistical Trees Built Using the Above Method:For each of the Statistical Trees generated, the following method, shown in the flowchart of
Specifically, step 141 is to reorder the children based on the number of occurrences (count associated with individual nodes in the Statistical Trees built), i.e., arrange the nodes in the decreasing order of their count values, so that the children that occur more number of times would encode to a shorter binary code (move the nodes with higher count values close to the root node of the Statistical Tree). If the count is the same for two or more nodes, then, at step 142, the nodes are ordered based on their alphabetical order.
After this, at step 143, for each Statistical Tree formed, the nodes denoted by the active node and its peer are collapsed with its parent, so that the peer's child is directly the child of its parent node.
Encoding and DecodingAny suitable procedures may be used for Encoding and Decoding of the XML document. As the Statistical Trees are generated directly from the XML Document, and as the same Statistical Trees are needed to Decode the XML Document back, the generated Statistical Trees are, in this embodiment, possibly serialized with the XML Document encoded bits.
At the decoding end, the Statistical Trees are obtained by de-serializing the part of the e encode stream first and then use these Statistical Trees are used to Decode the XML Document back.
The Statistical Trees could also be compressed using any compression technology and then serialized with the encoded bits. At the decoding end, first we need to extract the Statistical part of the bit stream and run the corresponding decompression technique run on the encoding side to get back the Statistical Trees.
Embodiments of the invention may be generally implemented by a computer executing a sequence of program instructions for carrying out the steps of the method and may be embodied in a computer program product comprising media storing the program instructions. For example,
Although not required, embodiments of the invention can be implemented via an application-programming interface (API), for use by a developer, and/or included within the network browsing software, which will be described in the general context of computer-executable instructions, such as program modules, being executed by one or more computers, such as client workstations, servers, or other devices. Generally, program modules include routines, programs, objects, components, data structures and the like that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments. Moreover, those skilled in the art will appreciate that embodiments of the invention may be practiced with other computer system configurations.
Other well known computing systems, environments, and/or configurations that may be suitable for use with embodiments of the invention include, but are not limited to, personal computers (PCs), server computers, hand-held or laptop devices, multi-processor systems, microprocessor-based systems, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. Embodiments of the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 210 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 210 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CDROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 610.
Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared, and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 230 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 231 and random access memory (RAM) 232. A basic input/output system 233 (BIOS), containing the basic routines that help to transfer information between elements within computer 210, such as during start-up, is typically stored in ROM 231. RAM 232 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 220. By way of example, and not limitation,
The computer 210 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 210 through input devices such as a keyboard 262 and pointing device 261, commonly referred to as a mouse, trackball or touch pad. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 220 through a user input interface 260 that is coupled to the system bus 221, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB).
A monitor 291 or other type of display device is also connected to the system bus 221 via an interface, such as a video interface 290. A graphics interface 282, such as Northbridge, may also be connected to the system bus 221. Northbridge is a chipset that communicates with the CPU, or host-processing unit 220, and assumes responsibility for accelerated graphics port (AGP) communications. One or more graphics processing units (GPUs) 284 may communicate with graphics interface 282. In this regard, GPUs 284 generally include on-chip memory storage, such as register storage and GPUs 284 communicate with a video memory 286. GPUs 284, however, are but one example of a coprocessor and thus a variety of co-processing devices may be included in computer 210. A monitor 291 or other type of display device is also connected to the system bus 221 via an interface, such as a video interface 290, which may in turn communicate with video memory 286. In addition to monitor 291, computers may also include other peripheral output devices such as speakers 297 and printer 296, which may be connected through an output peripheral interface 295.
The computer 210 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 280. The remote computer 280 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 210, although only a memory storage device 281 has been illustrated in
When used in a LAN networking environment, the computer 210 is connected to the LAN 271 through a network interface or adapter 270. When used in a WAN networking environment, the computer 210 typically includes a modem 272 or other means for establishing communications over the WAN 273, such as the Internet. The modem 272, which may be internal or external, may be connected to the system bus 221 via the user input interface 260, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 210, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
One of ordinary skill in the art can appreciate that a computer 210 or other client device can be deployed as part of a computer network. In this regard, embodiments of the invention pertain to any computer system having any number of memory or storage units, and any number of applications and processes occurring across any number of storage units or volumes. Embodiments of the invention may apply to an environment with server computers and client computers deployed in a network environment, having remote or local storage. Embodiments of the invention may also apply to a standalone computing device, having programming language functionality, interpretation and execution capabilities.
As will be readily apparent to those skilled in the art, embodiments of the invention can be realized in hardware, software, or a combination of hardware and software. Any kind of computer/server system(s)—or other apparatus adapted for carrying out the methods described herein—is suited. A typical combination of hardware and software could be a general-purpose computer system with a computer program that, when loaded and executed, carries out the respective methods described herein. Alternatively, a specific use computer, containing specialized hardware for carrying out one or more of the functional tasks, could be utilized.
Embodiments may be embodied in a computer program product, which comprises the respective features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods. Computer program, software program, program, or software, in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
It will be appreciated that numerous modifications and embodiments may be devised by those skilled in the art, and it is intended that the appended claims cover all such modifications and embodiments as fall within the scope of the disclosure.
Claims
1. A method of compressing data from a markup language document, comprising:
- creating from the document a path based statistical tree built according to a given set of rules; and
- compressing said document by using said statistical tree.
2. The method according to claim 1, wherein said document is an XML document, the statistical tree includes a multitude of paths, and each of said paths is represented by a single bit.
3. The method according to claim 1, wherein the document includes enumerated data and non-enumerated data, and the compressing said document by using said statistical tree includes compressing said enumerated data by using said statistical tree.
4. The method according to claim 1, wherein the document includes a multitude of document nodes, and the creating from said document a path based statistical tree includes forming said tree with a multitude of tree nodes, each of the tree nodes representing one of the document nodes.
5. The method according to claim 4, wherein the forming the statistical tree includes:
- identifying a root node of the document; and
- creating the tree with a node denoting the root node of the document.
6. The method according to claim 5, wherein the forming the statistical tree further includes adding two children nodes to the root node of the tree, and designating one of the child nodes as active.
7. The method according to claim 6, wherein the forming the statistical tree further includes:
- getting a text fragment from the document; and
- checking the statistical tree to determine if the tree already has a node representing said fragment.
8. The method according to claim 7, wherein the forming the statistical tree further includes:
- if the statistical tree already has a node representing said fragment, then incrementing a counter; and
- if the statistical tree does not already have a node representing said fragment, then splitting a currently active node of the tree into two new nodes, and using one of said new nodes to represent said fragment.
9. The method according to claim 4, wherein the compressing the document by using said statistical tree includes:
- optimizing the statistical tree to form an optimized statistical tree, by ordering the tree nodes according to a specified optimization rule; and
- compressing the document by using the optimized statistical tree.
10. The method according to claim 9, wherein the ordering the tree nodes includes ordering the tree nodes based on the frequency of occurrence of the document nodes in the document.
11. A system for compressing data from a markup language document, comprising:
- a statistical tree generator for processing the document and for building path based statistical tree from information read from the document and according to a given set of rules; and
- a structural compressor for using said statistical tree to compress said document.
12. The system according to claim 11, further comprising a parser for parsing the document into text segments, and for feeding the text segments to the statistical tree generator and to the structural compressor.
13. The system according to claim 11, wherein said document is an XML document, the statistical tree includes a multitude of paths, and each of said paths is represented by a single bit.
14. The system according to claim 11, wherein the document includes a multitude of document nodes, and the path based statistical tree includes a multitude of tree nodes, each of the tree nodes representing one of the document nodes, and wherein the statistical tree generator identifies a root node of the document and creates the statistical tree with a node denoting the root node of the document.
15. The system according to claim 14, wherein:
- the Statistical Tree generator includes an optimizer for optimizing the statistical tree to form an optimized statistical tree, by ordering the tree nodes according to a specified optimization rule; and
- the structural compressor compresses the document by using the optimized statistical tree.
16. An article of manufacture comprising:
- at least one computer usable medium having computer readable program code logic to execute a machine instruction in a processing unit for compressing data from a markup language document, the computer readable program code logic when executing performing the following steps:
- creating from the document a path based statistical tree built according to a given set of rules; and
- compressing said document by using said statistical tree.
17. The article of manufacture according to claim 16, wherein the document includes a multitude of document nodes, and the step of creating the path based statistical tree includes the step of forming said tree with a multitude of tree nodes, each of the tree nodes representing one of the document nodes.
18. The article of manufacture according to claim 17, wherein the step of forming the statistical tree includes the steps of:
- identifying a root node of the document; and
- creating the tree with a node denoting the root node of the document.
19. The article of manufacture according to claim 18, wherein the step of forming the statistical tree includes the further steps of:
- getting a text fragment from the document;
- checking the statistical tree to determine if the tree already has a node representing said fragment;
- if the statistical tree already has a node representing said fragment, then incrementing a counter; and
- if the statistical tree does not already have a node representing said fragment, then splitting a currently active node of the tree into two new nodes, and using one of said new nodes to represent said fragment.
20. The article of manufacture according to claim 16, wherein said document is an XML document, the statistical tree includes a multitude of paths, and each of said paths is represented by a single bit.
Type: Application
Filed: Aug 20, 2008
Publication Date: Feb 25, 2010
Applicant: International Business Machines Corporation (Armonk, NY)
Inventors: Umesh Kumar Balegar (Bangalore), Rohit Shetty (Bangalore)
Application Number: 12/194,599
International Classification: G06F 7/00 (20060101); G06F 17/30 (20060101);