Interdistinct Operator
A computer-implemented system and method for performing distinct operations on multiple tables of shared memory of parallel computing environments are disclosed. A distinct operation is executed on each table of a plurality of tables, each distinct operation eliminating duplicate data from each table, the executing creating a hierarchy of table pairs and distinct results, the distinct results comprising a reduced row set for each table. Duplicates on each reduced row set are detected to complete the distinct operation on the plurality of tables.
This application claims the benefit of priority under 35 U.S.C. §119 to U.S. Provisional Patent Application Ser. No. 61/363,304, filed on Jul. 12, 2010, entitled, “Hash-Map, Aggregation, Distinct and Join in Parallel Computation Environments with Shared Memory”, the entire disclosures of which is incorporated by reference herein.
BACKGROUNDThis disclosure relates generally to parallel computing environments and distinct operations performed on multiple tables of shared memory of parallel computing environments.
Computer processor design has become increasingly influenced by certain physical limits, like heat production, signal propagation delay, transistor size, and bandwidth of communication channels. Roughly since 2006, processor frequency (a measure for the computation power of a processor) has not significantly increased. Therefore, as an alternative to increase computation power, chip vendors began to put multiple computation units (so-called “cores”) on a single chip, in what is known as “multi-core processors”. Further, multiple chips are switched together on a single computer. On such a computer, all of the cores on these processors can access the main memory (known as “shared memory” or “shared memory architecture”).
As a result of these new hardware developments, software vendors can no longer rely on frequency-based performance improvements. Instead, they have to parallelize their software to scale with the number of available processor cores on a computer. Parallelization is difficult, however, especially for operations that were originally designed for single core or single chip computing systems.
One such operation is called “relational distinct.” The relational distinct operation eliminates duplicates from a table. Duplicates are defined based on the values of a set of columns. Two rows with the same values on these columns are duplicates. For example, as illustrated in
In general, this document discloses a computer-implemented system and method for performing distinct operations on multiple tables of shared memory of parallel computing environments.
In one aspect, a computer-implemented method includes executing a distinct operation on each table of a plurality of tables. Each distinct operation eliminates duplicate data from each table, creating a hierarchy of table pairs and distinct results. The distinct results include a reduced row set for each table. The method further includes detecting duplicates on each reduced row set to complete the distinct operation on the plurality of tables.
Articles are also described that comprise a tangibly embodied machine-readable medium operable to cause one or more machines (e.g., computers, etc.) to result in operations described herein. Similarly, computer systems are also described that may include a processor and a memory coupled to the processor. The memory may include one or more programs that cause the processor to perform one or more of the operations described herein.
The details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features and advantages will be apparent from the description and drawings, and from the claims.
These and other aspects will now be described in detail with reference to the following drawings.
Like reference symbols in the various drawings indicate like elements.
DETAILED DESCRIPTIONThis document describes relational distinct operations performed on multiple tables in a parallel computing environment.
In some implementations, a parallel hash table, or hash map, is used. In general, hash tables are used as index structures for data storage to enable fast retrieval (index lookup). The parallel hash table can be used in a parallel computation environment, where multiple concurrently running threads insert and retrieve data. Distinct operations are computed on multiple tables. The semantics of this operation is defined as: concatenate the distinct columns of all tables on which the distinct operation shall be executed, resulting in an intermediate table. On this table, the distinct operation is then applied.
Distinct can be computed by the SQL “group by” operator, which takes all the given attributes as group columns and uses COUNT on an aggregation column. All result rows with a counter>1 are duplicates. Basically, distinct is aggregation without any columns to aggregate. Therefore, the same algorithms apply. These algorithms are used in current solutions to compute the distinct operation.
A system and method are presented for calculating the distinct operation on a columnar store database. In such a database, tables are stored by column. The system and method described herein can be used for sequential and parallel distinct calculation on one or multiple tables. The algorithm differs from standard or other distinct algorithms on its new performance characteristics. The candidate set of rows which could be duplicates in the first step are reduced, and a standard duplicate check is done for a reduced candidate set.
At 506, the rows of the “exactly one row” buckets are removed from the candidate set. At 508, the buckets with “no row” and “exactly one row” are removed from the bucket list. At 510, a perfect hash map without empty buckets is calculated for the corresponding bucket. At 512, a corresponding bucket is calculated for all rows in the candidate row set, and at 514, an attribute set with a high discrimination value is selected out of the remaining attributes. At 516, for each bucket with less than X rows per bucket, a hash table with a COUNT aggregate column is calculated. At 518 the rows with buckets of counter 1 are removed. At 520, the step repeats at 504 until a threshold is reached.
Executing a distinct operation on multiple tables is known as an “interdistinct operation.”
At 610, for both tables, the same reduced set of attributes is selected with high discrimination value and low dependency, which can be hashed into a collision-free hash table. At 612, a hash table of the candidate row set is created by using the new dictionary positions as hash values. The bucket of the hash table is one bit, which indicates, if the bucket is used. At 614, the bucket sets (bit vectors) of the two tables are compared, to indicate which buckets are just used in one or the other table. At 616, the candidate row set of each table is reduced by filtering with the indicated buckets. At 618, steps 610 through 616 are re-executed by reusing the calculated buckets until a threshold is reached. The threshold is either a low level of row set reduction of a small set of candidate row sets. At 620, a duplicate detection on the reduced row set is performed, as described above.
The methods and system described herein can be parallelized by usage of parallel hash map for the hash tables mentioned above, with parallelization by dynamic vertical partitioning, or with parallelization by parallel table pair handling the multiple-tables algorithm.
Some or all of the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of them. Embodiments of the invention can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a computer readable medium, e.g., a machine readable storage device, a machine readable storage medium, a memory device, or a machine-readable propagated signal, for execution by, or to control the operation of, data processing apparatus.
The term “data processing apparatus” encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of them. A propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus.
A computer program (also referred to as a program, software, an application, a software application, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to, a communication interface to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio player, a Global Positioning System (GPS) receiver, to name just a few. Information carriers suitable for embodying computer program instructions and data include all forms of non volatile memory, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, embodiments of the invention can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
Embodiments of the invention can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the invention, or any combination of such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
Certain features which, for clarity, are described in this specification in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features which, for brevity, are described in the context of a single embodiment, may also be provided in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Particular embodiments of the invention have been described. Other embodiments are within the scope of the following claims. For example, the steps recited in the claims can be performed in a different order and still achieve desirable results. In addition, embodiments of the invention are not limited to database architectures that are relational; for example, the invention can be implemented to provide indexing and archiving methods and systems for databases built on models other than the relational model, e.g., navigational databases or object oriented databases, and for databases having records with complex attribute structures, e.g., object oriented programming objects or markup language documents. The processes described may be implemented by applications specifically performing archiving and retrieval functions or embedded within other applications.
Claims
1. A computer-implemented method comprising:
- executing a distinct operation on each table of a plurality of tables, each distinct operation eliminating duplicate data from each table, the executing creating a hierarchy of table pairs and distinct results, the distinct results comprising a reduced row set for each table; and
- detecting duplicates on each reduced row set to complete the distinct operation on the plurality of tables,
- the executing and detecting being performed by one or more processors.
2. The computer-implemented method in accordance with claim 1, further comprising executing, by the one or more processors and for each table pair in the hierarchy of table pairs:
- removing values that exist in one or the other table to generate a reduced value set;
- generating new dictionaries for each column in the table pair with the reduced value set;
- defining a candidate row set from the reduced value;
- generating a hash table of the candidate row set based on the new dictionaries as hash values; and
- reducing the candidate row set by filtering bit vectors of the hash table to generate the reduced row set.
3. The computer-implemented method in accordance with claim 2, wherein reducing the candidate row set includes filtering with the new dictionaries.
4. The computer-implemented method in accordance with claim 2, wherein the reduced value set comprises a reduced set of attributes with high discrimination value and low dependency between tables of the table pair.
5. The computer-implemented method in accordance with claim 4, wherein the reduced set of attributes are hashed into a collision-free hash table.
6. A computer program product comprising a machine-readable medium storing instructions that, when executed by at least one programmable processor, cause the at least one programmable processor to perform operations comprising:
- execute a distinct operation on each table of a plurality of tables, each distinct operation eliminating duplicate data from each table, the executing creating a hierarchy of table pairs and distinct results, the distinct results comprising a reduced row set for each table;
- detect duplicates on each reduced row set to complete the distinct operation on the plurality of tables.
7. The computer program product in accordance with claim 6, wherein the instructions further comprise, for each table pair in the hierarchy of table pairs, operations to:
- remove values that exist in one or the other table to generate a reduced value set;
- generate new dictionaries for each column in the table pair with the reduced value set;
- define a candidate row set from the reduced value;
- generate a hash table of the candidate row set based on the new dictionaries as hash values; and
- reduce the candidate row set by filtering bit vectors of the hash table to generate the reduced row set.
8. The computer program product in accordance with claim 7, wherein the operation to reduce the candidate row set includes on operation to filter with the new dictionaries.
9. The computer program product in accordance with claim 7, wherein the reduced value set comprises a reduced set of attributes with high discrimination value and low dependency between tables of the table pair.
10. The computer program product in accordance with claim 9, wherein the reduced set of attributes are hashed into a collision-free hash table.
11. A system comprising:
- at least one programmable processor; and
- a machine-readable medium storing instructions that, when executed by the at least one processor, cause the at least one programmable processor to perform operations comprising:
- executing a distinct operation on each table of a plurality of tables, each distinct operation eliminating duplicate data from each table, the executing creating a hierarchy of table pairs and distinct results, the distinct results comprising a reduced row set for each table; and
- detecting duplicates on each reduced row set to complete the distinct operation on the plurality of tables.
12. The system in accordance with claim 11, wherein the operations further comprise, for each table pair in the hierarchy of table pairs:
- removing values that exist in one or the other table to generate a reduced value set;
- generating new dictionaries for each column in the table pair with the reduced value set;
- defining a candidate row set from the reduced value;
- generating a hash table of the candidate row set based on the new dictionaries as hash values; and
- reducing the candidate row set by filtering bit vectors of the hash table to generate the reduced row set.
13. The system in accordance with claim 12, wherein reducing the candidate row set includes filtering with the new dictionaries.
14. The system in accordance with claim 12, wherein the reduced value set comprises a reduced set of attributes with high discrimination value and low dependency between tables of the table pair.
15. The system in accordance with claim 14, wherein the reduced set of attributes are hashed into a collision-free hash table.
Type: Application
Filed: Dec 30, 2010
Publication Date: Jan 12, 2012
Patent Grant number: 9223829
Inventors: Franz Faerber (Walldorf), Christian Bensberg (Heidelberg), Lars Fricke (Worms)
Application Number: 12/982,767
International Classification: G06F 17/30 (20060101); G06F 7/00 (20060101);