DICTIONARY-BASED ORDER-PRESERVING STRING COMPRESSION FOR MAIN MEMORY COLUMN STORES
Methods and systems are described that involve usage of dictionaries for compressing a large set of variable-length string values with fixed-length integer keys in column stores. The dictionary supports updates (e.g., inserts of new string values) without changing codes for existing values. Furthermore, a shared-leaves approach is described for indexing such a dictionary that compresses the dictionary itself while offering access paths for encoding and decoding.
Embodiments of the invention generally relate to the software arts, and more specifically, to data structures that support an order-preserving dictionary compression for string attributes with a large domain size that may change over time.
BACKGROUNDIn the field of computing, a database management system (DBMS) is a set of software programs that controls the organization, storage, management, and retrieval of data in a database storage unit. Traditionally, the DBMS is a row-oriented database system; however, there are database systems that are column-oriented. The column-oriented database systems store their content by column rather than by row. This may have advantages for databases, where the aggregates are computed over large numbers of similar data items. A column-oriented implementation of a DBMS would store attributes of a given column in sequence, with the column values for the same column stored in sequence, with the end of one column followed by the beginning of the next column. Column-oriented database systems may be more efficient when an aggregate has to be computed over many rows but only for a smaller subset of all columns of data. This may be so at least because, reading that smaller subset of data can be faster than reading all data. Column-oriented database systems may also be more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and can replace old column data without interfering in any other columns for the rows.
SUMMARYMethods and systems are described that involve data structures that support order-preserving dictionary compression of variable-length string attributes where the domain size is large or not known in advance. In one embodiment, the method includes propagating a plurality of string values to the compressed data of a shared-leaves structure of a dictionary via an encode index. A plurality of order-preserving integer codes is obtained for the plurality of string values via a lookup operation. If a subset of the plurality of integer codes was not found during the obtainment, a subset of the plurality of string values for which the subset of the plurality of integer codes was not found is inserted into the shared-leaves structure. The method also includes generating the subset of the plurality of integer codes for the corresponding subset of the plurality of string values. Finally, a list of the order-preserving plurality of integer codes is provided, wherein the list includes the generated subset of the plurality of integer codes as well.
In one embodiment, the system includes a column-oriented database system and a dictionary-based storage unit specifying a mapping between a plurality of variable-length string values and a plurality of integer codes in the column-oriented database system. Further, the system includes shared-leaves data structures that hold the data of the dictionary-based storage unit in sort order in their leaves. In addition, a processor in communication with the dictionary-based storage unit is included, wherein the processor is operable to encode the plurality of variable-length string values to the plurality of integer codes and decode the plurality of integer codes to the plurality of variable-length string values using the shared-leaves data structures.
These and other benefits and features of the embodiments of the invention will be apparent upon consideration of the following detailed description of preferred embodiments thereof, presented in connection with the following drawings in which like reference numerals are used to identify like elements throughout.
The invention is illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which, like references, indicate similar elements. It should be noted that references to “an” or “one” embodiment in this disclosure are not necessarily to the same embodiment, and such references mean at least one.
Embodiments of the invention relate to data structures that support an order-preserving dictionary compression for string attributes with a large domain size that is likely to change over time. It is believed that the column-oriented database systems perform better than the traditional row-oriented database systems on analytical workloads. Lightweight compression schemes for column-oriented database systems may enable query processing on compressed data and thus improve query processing performance. Dictionary encoding replaces variable-length attribute values with shorter, fixed-length integer codes. To compress column data in this way, existing column stores usually create a dictionary array of distinct values and then store each attribute value as an index into that array. Dictionaries may be used in column stores if the domain size is small.
Bit packing can be used on top of a dictionary to compress the data further by calculating the minimal number of bits needed to code the maximal index into the dictionary. Bit packing is useful if the size of the domain is stable or known in advance, but in application scenarios the domain size may increase over time. If the domain size is not known in advance, a column store may analyze the first bulk load of data to find out the current domain size of a given attribute and then derive the minimal number of bits (for bit packing). If subsequent bulk loads contain new values, all the previously loaded data can be decoded and then encoded again with the new load using more bits.
Column stores often use order-preserving compression schemes to speed up expensive query operations because the operations can then be executed directly on the encoded data. However, such compression schemes generate either variable-length codes that are expensive or fixed-length codes that are difficult to extend. For large dictionaries where the domain size is not known in advance, sorted arrays and fixed-length integer codes for indexes are too expensive.
Further, process 100 includes query compilation 120. To execute analytical queries on encoded data, it is necessary to rewrite the query predicates. A predicate is a phrase template that describes a property of objects or a relationship among objects represented by the variables. For example, string dictionary 125 includes a list of string values: Whole Milk—Gallon, Whole Milk—Quart, etc, wherein “Whole Milk” is the predicate of these strings. Query compilation 120 involves rewriting a string constant in an equality predicate (e.g., p_name=“Whole Milk—Gallon”) or in a range predicate (e.g., p_name≧“Whole Milk—Gallon”) with the corresponding integer code. An order-preserving encoding scheme allows the string constants of equality and range predicates to be replaced by integer codes, and prefix predicates (e.g., p_name=“Whole Milk*”) to be mapped to range predicates. For example, original query 140 is rewritten in query 145, by rewriting string constant p_name with prefix predicate “Whole Milk*” into a range predicate 32100≧p_name≧32000. In an embodiment, the string dictionary 125 supports lookup operations to rewrite string constants as well as string prefixes. Process 100 also includes query execution 130. During query execution 130, encoded query results 150 are decoded using the dictionary 125. In an embodiment, the string dictionary 155 supports decoding of the encoded query results 150 given as a list of integer codes to generate query results as string values 160. The encoded query results 150 are decoded to a list of non-encoded query results 160 that represents string values 105.
In an embodiment, the string dictionary is a table T with two attributes: T=(value, code). Table T defines a mapping of variable-length string values (defined by the attribute value) to fixed-length integer codes (defined by the attribute code) and vice versa. The dictionary supports the following operations for encoding and decoding string values and to enable rewrite of the query predicates: 1) encode: values→codes; 2) decode: codes→values; 3) lookup: (value, type)→code; and 4) lookup: prefix→(mincode, maxcode). The “encode:→values codes” operation is used during data loading 110 to encode the data of a string column (i.e., the values) with the corresponding integer codes (i.e., the codes). This operation includes the lookup of codes for those strings that are already in the dictionary and the insertion of new string values as well as the generation of order-preserving codes for these new values. The “decode: codes→values” operation is used during query processing 130 to decode bulk results using the corresponding string values. The “lookup: (value, type)→code” operation is used during query compilation 120 to rewrite a string constant in an equality-predicate (e.g., p_name=“Whole Milk—Gallon”) or in a range-predicate (e.g., p_name≧“Whole Milk—Gallon”) with the corresponding integer code. The parameter “type” specifies whether a dictionary should execute an exact-match lookup or return the integer code for the next smaller string value. The “lookup: prefix→(mincode, maxcode)” operation is used during query compilation 120 to rewrite the prefix of a prefix-predicate (e.g., p_name→“Whole Milk*”) with the corresponding integer ranges (i.e., the mincode and the maxcode).
Since the string dictionary uses an order-preserving encoding scheme, the string values and the integer codes in table T follow the same sorting order. As both attribute values of table T can be kept in sorting order inside the leaves, the leaves can provide access paths for both lookup directions (i.e., for the encoding and decoding) using a standard search method for sorted data (e.g., binary search or interpolation search). Moreover, as is the case for direct indexes, using the shared-leaves for indexing the dictionary means that table T does not have to be kept explicitly in main memory because the leaves hold all the data of table T.
In an embodiment, the shared leaves also support rewriting predicates inside a dictionary. For rewriting equality and range predicates, the encode index propagates the string values to the corresponding leaves and a search operation on the leaves returns the integer codes. For rewriting prefix predicates, the encode index propagates the prefix to the leaves containing the minimal and maximal string values for the prefix; the leaves map the minimal and maximal strings to the integer codes.
The data structures of the dictionary (i.e., leaves and indexes) are optimized for encoding or decoding bulks and are cache-aware. The operations, encoding and decoding, are easy to parallelize. The leaf structure differs from the index structure. All structures, leaf structures and index structures, reside in memory. The leaf structure holds the string values and the integer codes in sorting order. A leaf supports the encoding of variable-length string values and supports efficient bulk loads and bulk updates. The indexes for encoding and decoding keep the keys in sort order for efficient lookup over the sorted leaves. The encode index provides propagation of string constants and string prefixes to the leaves.
In addition to the lookup of the integer codes for string values that are already a part of the dictionary, it might be necessary to insert new string values into the dictionary (e.g., update the leaves as well as the both indexes for encoding and decoding) and generate new order-preserving codes for these values. In an embodiment, the lookup and insert operations are combined into one operation. The following strategies can support this approach: all-bulked that updates the encode and decode indexes after generation of any new integer codes and hybrid approach that updates the encode index during propagation of the string values.
During leaf structure data loading, the string values are first compressed and then written into the leaf together with their codes in a forward way (e.g., starting from memory position 0 and incrementing the position for each new entry). To enable searching for string values inside a leaf includes, each n-th string (e.g., each third string) is stored in an uncompressed way and the positions of the uncompressed strings are saved as anchors at the end of a leaf (to be found during search). However, when loading data into a leaf, the exact size of the compressed string values may be unknown before all data is written into the leaf. Thus, the offset vector 505 may be stored from the last memory position in a reverse way by decrementing the position for each new anchor.
For bulk lookup, the leaf structure supports one lookup operation to look up the integer code for a given string value and another to look up the string value for a given code. To look up the code for a given string value, an algorithm may be performed for sequential search over the incrementally encoded values that does not require decompression of the leaf data. In an embodiment, the algorithm may be as described with reference to table 1 below.
The values are incrementally decompressed during the sequential search when looking up a value for a given code. For bulk lookup, the lookup probe is sorted to reduce the search overhead. For the initial bulk load of a leaf with a list of string values, the string values are sorted first. Then, the leaf data is written sequentially from the beginning of the leaf and the offset vector is written in reverse order from the end. If the string values do not occupy all the memory allocated for them, the offset vector is moved forward and the unused memory released. For bulk update, a list of new string values is first sorted and then inserted into an existing leaf. Then, a sort merge of the new string values and the existing leaf is performed to create a new leaf.
In an embodiment, a cache-conscious index structure may be used on top of the leaf structure for encoding and decoding. These indexes support the all-bulked 401 and hybrid 402 strategies. For the encoding index, a cache-sensitive (CS) version of the Patricia trie, the CS array trie, that supports the hybrid strategy 402 is defined. The Patricia trie, or radix tree, is a specialized set data structure based on the trie (a prefix tree that is an ordered tree data structure used to store associative array where the keys are usually strings) that is used to store a set of stings. In contrast with a regular trie, the edges of a Patricia trie are labeled with sequences of characters rather than with single characters. These can be strings of characters, bit strings such as integers or IP addresses, or generally arbitrary sequences of objects in lexicographical order. In addition, a new cache-sensitive version of the prefix B-tree (a tree data structure that keeps data sorted and is optimized for systems that read and write large bulks of data), the CS prefix tree, to support the all-bulked 401 update strategy is defined. As decoding index, a CS search (CSS) tree may be used. The CSS tree may be created over the leaves of the dictionary using the minimal integer codes of each leaf as keys of the index. The CSS tree can be bulk loaded efficiently bottom-up from the leaves of the dictionary. A CS array trie may be used as an encode index to propagate string lookup probes and updates to the leaves. The CS array trie uses read-optimized cache-aware data structures for the index nodes and does not decompose the strings completely.
In an embodiment, the CS array trie may be used to implement the hybrid update strategy 402 for bulk encoding of string values during data loading 110, as shown at 630. The string values 640 are propagated (in preorder) to the leaves using variable buffers (e.g., buffers 645, 650, and 650) at each trie node to increase cache locality for lookup. Using buffers at each node, the array of characters stored at a node grow only once per bulk. This reduces cache misses. To estimate the expected leaf size, the uncompressed size of all new strings can be added in a buffer page as well as the size of their new codes (without eliminating duplicates) to the current leaf size. When all string values 640 are propagated to their leaves, new integer codes 660 are generated for the new string values by analyzing the number of strings inserted between existing string values 640.
The CS array trie supports efficient predicate rewrite. For equality and range predicates, the constants are propagated through the trie without buffering. For prefix predicates, the prefix is used to find the minimal and maximal string values that match it. Propagation of string values from the root of the trie to the leaves is parallelized. New integer codes are generated in parallel, without locking any data structures, by determining which leaves hold contiguous new string values. Lookup of the new string values can also be parallelized without locking any data structures.
The CS prefix tree can only be bulk loaded bottom-up, so it is mainly suitable for the all-bulked update strategy 401. To encode the first bulk load of string values, the string values are used to build the complete leaf level. A CS array trie may be used to partition the string values into buckets sorted using multi-key quick-sort. Then, the sorted string values are used to create and fill in leaves 720 to the maximum leaf size. From these leaves 720, a new encode index is bulk loaded bottom-up.
For subsequent bulk loads, the existing CS prefix tree may be used to propagate the string values to the leaves. The string values are buffered at leaf level and then the existing leaf is sort-merged with the new string values stored in the buffers. If the new string values in the buffers and the values of the existing leaf do not fit into one leaf, another leaf may be created. Query predicates can be rewritten using the CS prefix tree. For an equality predicate or a range predicate, a simple lookup can be performed with the string constants. For a prefix predicate, the prefix can be used to find the minimal string value that matches the prefix. The lookup for the prefix finds a leaf containing the value even if the value is not in the dictionary. From that leaf on, a sequential search can be executed for the maximum string value matching the prefix.
Memory may be allocated in advance in contiguous blocks. The maximum amount of memory that all the tree nodes need can be calculated by setting an arbitrary limit on the maximum length of the keys in the tree. Then, the minimum number of keys that fit in one node is calculated and hence the maximal number of nodes needed to store the data. Using a mix of pointers and offset arithmetic identifies the correct child and thus allows use of multiple blocks of memory. A CS prefix tree may be more expensive to build than a CS array trie because the data is first sorted and then loaded bottom-up. But the CS prefix tree performs better than the CS array trie for lookup workloads.
In the experiment, a 16 MB leaf structure is compared to two cache-sensitive read-optimized index structures using two different workloads. For encoding the string values, the leaf structure is compared to a compact-chain hash table (i.e., bulk lookup 920). For decoding integer codes, the leaf structure is compared to a CSS tree (i.e., bulk lookup 930). The result shows that the optimal leaf structure size is about 512 KB (medium) and the performance of the leaf structure is comparable to read-optimized index structures yet uses less memory.
Lightweight compression schemes can improve the query processing performance of column-oriented database systems. In one such scheme, a dictionary replaces long (variable-length) values with shorter (fixed-length) integer codes. To improve performance further, column stores can use order-preserving compression schemes. New data structures may be used to support order-preserving dictionary compression for variable-length string attributes with a large domain size that can change over time. A dictionary can be modeled as a table mapping string values to arbitrary integer codes. A new indexing approach may be used for efficient access to such a dictionary using compressed index data. The data structures are at least as fast as other data structures for dictionaries but occupy less memory.
The processor 1210 is capable of processing instructions for execution within the system 1200. The processor is in communication with the main memory store 1220. Further, the processor is operable to execute operations 1280 stored in the main memory 1220, such as data loading 110, query compilation 120, and query execution 130. In one embodiment, the processor 1210 is a single-threaded processor. In another embodiment, the processor 1210 is a multi-threaded processor. The processor 1210 is capable of processing instructions stored either in main memory 1220 or on the storage device 1230, to display graphical information for a user interface on the input/output device 1240.
The main memory 1220 stores information within the system 1200. In one implementation, the main memory 1220 is a machine-readable medium. In an embodiment, the main memory 1220 stores order-preserved compressed data in a column-oriented format. Main memory 1220 stores a dictionary 1260. Dictionary 1260 is used for encoding and decoding of the compressed data, as represented by index 1270. The encode index and decode index contain shared-leaves data structures that hold data in sorted order in their leaves.
The storage device 1230 is capable of providing mass storage for the system 1200. In one implementation, the storage device 1230 is a computer-readable medium. In alternative embodiments, the storage device 1230 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device.
The input/output device 1240 is used to trigger or initiate input/output operations 1280 for the system 1200. In one implementation, the input/output device 1240 includes a keyboard and/or pointing device. In another implementation, input/output device 1240 includes a display unit for displaying graphical user interfaces.
Elements of embodiments may also be provided as a tangible machine-readable medium (e.g., computer-readable medium) for tangibly storing the machine-executable instructions. The tangible machine-readable medium may include, but is not limited to, flash memory, optical disks, CD-ROMs, DVD ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards, or other type of machine-readable media suitable for storing electronic instructions. For example, embodiments of the invention may be downloaded as a computer program, which may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) via a communication link (e.g., a modem or network connection).
It should be appreciated that reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Therefore, it is emphasized and should be appreciated that two or more references to “an embodiment” or “one embodiment” or “an alternative embodiment” in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the invention.
In the foregoing specification, the invention has been described with reference to the specific embodiments thereof. It will, however, be evident that various modifications and changes can be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Claims
1. A machine-readable storage medium tangibly storing machine-readable instructions thereon, which when executed by the machine, cause the machine to perform operations comprising:
- propagating a plurality of string values to a compressed leaf data of shared-leaves structure of a dictionary via an encode index;
- obtaining a plurality of order-preserving integer codes for the plurality of string values via an lookup operation;
- if a subset of the plurality of integer codes was not found during the obtaining, inserting a subset of the plurality of string values for which the subset of the plurality of integer codes was not found into the shared-leaves structure;
- generating the subset of the plurality of integer codes for the corresponding subset of the plurality of string values; and
- providing a list of the order-preserving plurality of integer codes, the list including the generated subset of the plurality of integer codes.
2. The machine-readable storage medium of claim 1 wherein the operations further comprise:
- propagating the plurality of integer codes to the shared-leaves structure of the dictionary via a decode index; and
- updating the encode index and the decode index.
3. The machine-readable storage medium of claim 1, wherein the operations further comprise:
- rewriting a string value from the plurality of string values in an equality-predicate or in a range-predicate with a corresponding integer code from the plurality of order-preserving integer codes; and
- rewriting a string prefix of a prefix-predicate with corresponding integer code ranges from the plurality of order-preserving integer codes.
4. The machine-readable storage medium of claim 1, wherein obtaining the plurality of order-preserving integer codes further comprises executing a sequential search operation over the compressed leaf data without decompression of the compressed leaf data.
5. The machine-readable storage medium of claim 1, wherein the encode index comprise a cache-sensitive array trie index or a cache-sensitive prefix tree index.
6. The machine-readable storage medium of claim 5, wherein the cache-sensitive array trie comprises:
- storing the plurality of string values in an array;
- propagating in preorder the plurality of string values to the shared-leaves structure via variable buffers at each cache-sensitive array trie node to populate the array only once per bulk; and
- generating the subset of the plurality of integer codes for the corresponding subset of the plurality of string values in parallel.
7. The machine-readable storage medium of claim 5, wherein the cache-sensitive prefix tree comprises:
- calculating a first shortest prefix to distinguish a largest value of a first leaf and a smallest value of a second leaf of the cache-sensitive prefix tree;
- calculating a second shortest prefix to distinguish the largest value of the second leaf and the smallest value of a third leaf;
- if there is more than one node in a level of the cache-sensitive prefix tree, adding a second level on top with a node storing the calculated first and second prefixes, wherein the node is a root of the cache-sensitive prefix tree.
8. A computer implemented method comprising:
- propagating a plurality of string values to compressed leaf data of a shared-leaves structure of a dictionary via an encode index;
- obtaining a plurality of order-preserving integer codes for the plurality of string values via an lookup operation;
- if a subset of the plurality of integer codes was not found during obtaining, inserting a subset of the plurality of string values for which the subset of the plurality of integer codes was not found into the shared-leaves structure;
- generating the subset of the plurality of integer codes for the corresponding subset of the plurality of string values; and
- providing a list of the order-preserving plurality of integer codes, the list including the generated subset of the plurality of integer codes.
9. The method of claim 8 further comprising:
- propagating the plurality of integer codes to the shared-leaves structure of the dictionary via a decode index; and
- updating the encode index and the decode index.
10. The method of claim 8 further comprising:
- rewriting a string value from the plurality of string values in an equality-predicate or in a range-predicate with a corresponding integer code from the plurality of order-preserving integer codes; and
- rewriting a string prefix of a prefix-predicate with corresponding integer code ranges from the plurality of order-preserving integer codes.
11. The method of claim 8, wherein obtaining a plurality of order-preserving integer codes further comprises executing a sequential search operation over the compressed leaf data without decompression of the compressed leaf data.
12. The method of claim 8, wherein the encode index comprises a cache-sensitive array trie index or a cache-sensitive prefix tree index.
13. The method of claim 12, wherein the cache-sensitive array trie comprises:
- storing the plurality of string values in an array;
- propagating in preorder the plurality of string values to the shared-leaves structure via variable buffers at each cache-sensitive array trie node to populate the array only once per bulk; and
- generating the subset of the plurality of integer codes for the corresponding subset of the plurality of string values in parallel.
14. The method of claim 12, wherein the cache-sensitive prefix tree comprises:
- calculating a first shortest prefix to distinguish a largest value of a first leaf and a smallest value of a second leaf of the cache-sensitive prefix tree;
- calculating a second shortest prefix to distinguish the largest value of the second leaf and the smallest value of a third leaf;
- if there is more than one node in a level of the cache-sensitive prefix tree, adding a second level on top with a node storing the calculated first and second prefixes, wherein the node is a root of the cache-sensitive prefix tree.
15. A computing system comprising:
- a column-oriented database system;
- a dictionary-based storage unit specifying a mapping between a plurality of variable-length string values and a plurality of integer codes in the column-oriented database system;
- shared-leaves data structures that hold data of the dictionary storage unit in sorted order in their leaves;
- a processor in communication with the dictionary-based storage unit, the processor operable to encode the plurality of variable-length string values to the plurality of integer codes and decode the plurality of integer codes to the plurality of variable-length string values using the shared-leaves data structures.
16. The system of claim 15, wherein the shared-leaves data structures include an encode index to encode the plurality of variable-length string values and a decode index to decode the plurality of integer codes.
17. The system of claim 16, wherein the encode index supports propagation of the plurality of variable-length string values to the shared leaves, lookup of the plurality of integer codes, and generation of a second plurality of integer codes if a subset of the plurality of integer codes is not found during lookup.
18. The system of claim 16, further comprising a cache-sensitive array trie index or a cache-sensitive prefix tree index used on top of the shared-leaves data structures as the encode index.
19. The system of claim 15, wherein the cache-sensitive array trie index comprises an array to store the plurality of variable-length string values.
20. The system of claim 15, wherein the cache-sensitive prefix tree index comprises a node that includes shortest prefixes that enable propagation of the plurality of variable-length string values to child nodes of the cache-sensitive prefix tree.
Type: Application
Filed: Jun 28, 2009
Publication Date: Dec 30, 2010
Inventors: Carsten Binnig (Elztal), Franz Faerber (Walldorf), Stefan Hildenbrand (Altdorf)
Application Number: 12/493,210