Patents by Inventor Peter Jay Haas
Peter Jay Haas has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20090192980Abstract: The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. The present invention provides synopses for DV estimation in the setting of a partitioned dataset, as well as corresponding DV estimators that exploit these synopses. Whenever an output compound data partition is created via a multiset operation on a pair of (possibly compound) input partitions, the synopsis for the output partition can be obtained by combining the synopses of the input partitions. If the input partitions are compound partitions, it is not necessary to access the synopses for all the base partitions that were used to construct the input partitions. Superior (in certain cases near-optimal) accuracy in DV estimates is maintained, especially when the synopsis size is small. The synopses can be created in parallel, and can also handle deletions of individual partition elements.Type: ApplicationFiled: January 30, 2008Publication date: July 30, 2009Applicant: International Business Machines CorporationInventors: Kevin Scott Beyer, Rainer Gemulla, Peter Jay Haas, Berthold Reinwald, John Sismanis
-
Publication number: 20090150421Abstract: A system, an article, and a computer program product for estimating a cardinality value for a set of data values. In one embodiment, the system includes means for initializing a data structure for representing an array of counts; means for obtaining a data value from said set of data values; means for transforming said data value into a transformed string; means for modifying said data structure with said transformed string; means for obtaining a summary statistic value from said modified data structure, wherein the summary statistic value is based on the array of counts; and means for generating said estimated cardinality value using said summary statistic value.Type: ApplicationFiled: November 26, 2008Publication date: June 11, 2009Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Walid Rjaibi, Peter Jay Haas
-
Patent number: 7543006Abstract: A sampling infrastructure/scheme that supports flexible, efficient, scalable and uniform sampling is disclosed. A sample is maintained in a compact histogram form while the sample footprint stays below a specified upper bound. If, at any point, the sample footprint exceeds the upper bound, then the compact representation is abandoned, the sample purged to obtain a subsample. The histogram of the purged subsample is expanded to a bag of values while sampling remaining data values of the partitioned subset. The expanded purged subsample is converted to a histogram and uniform random samples are yielded. The sampling scheme retains the bounded footprint property and to a partial degree the compact representation of the Concise Sampling scheme, while ensuring statistical uniformity. Samples from at least two partitioned subsets are merged on demand to yield uniform merged samples of combined partitions wherein the merged samples also maintain the histogram representation and bounded footprint property.Type: GrantFiled: August 31, 2006Date of Patent: June 2, 2009Assignee: International Business Machines CorporationInventors: Paul Geoffrey Brown, Peter Jay Haas
-
Patent number: 7512574Abstract: A novel method is employed for collecting optimizer statistics for optimizing database queries by gathering feedback from the query execution engine about the observed cardinality of predicates and constructing and maintaining multidimensional histograms. This makes use of the correlation between data columns without employing an inefficient data scan. The maximum entropy principle is used to approximate the true data distribution by a histogram distribution that is as “simple” as possible while being consistent with the observed predicate cardinalities. Changes in the underlying data are readily adapted to, automatically detecting and eliminating inconsistent feedback information in an efficient manner. The size of the histogram is controlled by retaining only the most “important” feedback.Type: GrantFiled: September 30, 2005Date of Patent: March 31, 2009Assignee: International Business Machines CorporationInventors: Peter Jay Haas, Volker Gerhard Markl, Nimrod Megiddo, Utkarsh Srivastava
-
Patent number: 7512629Abstract: The present invention provides a method of selectivity estimation in which preprocessing steps improve the feasibility and efficiency of the estimation. The preprocessing steps are partitioning (to make iterative scaling estimation terminate in a reasonable time for even large sets of predicates), forced partitioning (to enable partitioning in case there are no “natural” partitions, by finding the subsets of predicates to create partitions that least impact the overall solution); inconsistency resolution (in order to ensure that there always is a correct and feasible solution), and implied zero elimination (to ensure convergence of the iterative scaling computation under all circumstances). All of these preprocessing steps make a maximum entropy method of selectivity estimation produce a correct cardinality model, for any kind of query with conjuncts of predicates. In addition, the preprocessing steps can also be used in conjunction with prior art methods for building a cardinality model.Type: GrantFiled: July 13, 2006Date of Patent: March 31, 2009Assignee: International Business Machines CorporationInventors: Peter Jay Haas, Marcel Kutsch, Volker Gerhard Markl, Nimrod Megiddo
-
Patent number: 7496584Abstract: A method for incrementally maintaining column cardinality estimates in database management systems. In one embodiment, the system includes system catalog table containing a cardinality estimate for a column that is extended to include an appropriate data structure. A modified linear counting technique is used in a first embodiment of a method for column cardinality estimation. The cardinality estimate is produced by an initial scan of the data but is then further maintained without requiring a full scan of the data. Data changes are reflected incrementally in modifications to the initial cardinality estimate, keeping the cardinality statistics more current with respect to the database condition. The technique of the invention typically provides a capability for a database management system to produce more efficient search plans providing more effective responses to user queries through the use of improved cardinality statistics.Type: GrantFiled: August 8, 2006Date of Patent: February 24, 2009Assignee: International Business Machines CorporationInventors: Walid Rjaibi, Peter Jay Haas
-
Publication number: 20090006349Abstract: A method is disclosed for conducting a query to transform data in a pre-existing database, the method comprising: collecting database information from the pre-existing database, the database information including inconsistent dimensional tables and fact tables; running an entity discovery process on the inconsistent dimensional tables and the fact tables to produce entity mapping tables; using the entity mapping tables to resolve the inconsistent dimensional tables into resolved dimensional tables; and running the query on a resolved database to obtain a query result, the resolved database including the resolved dimensional table.Type: ApplicationFiled: June 5, 2008Publication date: January 1, 2009Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Ariel Fuxman, Peter Jay Haas, Berthold Reinwald, Yannis Sismanis, Ling Wang
-
Publication number: 20090006331Abstract: A method is disclosed for conducting a query to transform data in a pre-existing database, the method comprising: collecting database information from the pre-existing database, the database information including inconsistent dimensional tables and fact tables; running an entity discovery process on the inconsistent dimensional tables and the fact tables to produce entity mapping tables; using the entity mapping tables to resolve the inconsistent dimensional tables into resolved dimensional tables; and running the query on a resolved database to obtain a query result, the resolved database including the resolved dimensional table.Type: ApplicationFiled: June 29, 2007Publication date: January 1, 2009Inventors: Ariel Fuxman, Peter Jay Haas, Berthold Reinwald, Yannis Sismanis, Ling Wang
-
Publication number: 20080228831Abstract: There is disclosed a data processing system implemented method, a data processing system, and an article of manufacture for directing a data processing system to maintain a database table associated with an initial maintenance scheduling interval. The data processing system implemented method includes selecting a randomizing factor, and selecting a new maintenance scheduling interval for the database table based on the initial maintenance scheduling interval and the selected randomizing factor.Type: ApplicationFiled: March 31, 2008Publication date: September 18, 2008Applicant: INTERNATIONAL BUSINESS MACHINESInventors: Ashraf Ismail Aboulnaga, Peter Jay Haas, Sam Sampson Lightstone, Volker Gerhard Markl, Ivan Popivanov, Vijayshankar Raman
-
Publication number: 20080133454Abstract: An autonomic tool that supervises the collection and maintenance of database statistics for query optimization by transparently deciding what statistics to gather, when and in what detail to gather them. Feedback from data-driven statistics collection is simultaneously combined with feedback from query-driven learning-based statistics collection, to better process both rapidly changing data and data that is queried frequently. The invention monitors table activity and decides if the data in a table has changed sufficiently to require a refresh of invalid statistics. The invention determines if the invalidity is due to correlation between purportedly independent data, outdated statistics, or statistics that have too few frequent values. Tables and column groups are ranked in order of statistical invalidity, and a limited computational budget is prioritized by ranking subsequent gathering of improved statistics.Type: ApplicationFiled: October 29, 2004Publication date: June 5, 2008Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: VOLKER G. MARKL, PETER JAY HAAS, ASHRAF ISMAIL ABOULNAGA, VIJAYSHANKAR RAMAN, FELIX ENDRES
-
Patent number: 7363324Abstract: There is disclosed a data processing system implemented method, a data processing system, and an article of manufacture for directing a data processing system to maintain a database table associated with an initial maintenance scheduling interval. The data processing system implemented method includes selecting a randomizing factor, and selecting a new maintenance scheduling interval for the database table based on the initial maintenance scheduling interval and the selected randomizing factor.Type: GrantFiled: December 17, 2004Date of Patent: April 22, 2008Assignee: International Business Machines CorporationInventors: Ashraf Ismail Aboulnaga, Peter Jay Haas, Sam Sampson Lightstone, Volker Gerhard Markl, Ivan Popivanov, Vijayshankar Raman
-
Publication number: 20080059540Abstract: A sampling infrastructure/scheme that supports flexible, efficient, scalable and uniform sampling is disclosed. A sample is maintained in a compact histogram form while the sample footprint stays below a specified upper bound. If, at any point, the sample footprint exceeds the upper bound, then the compact representation is abandoned, the sample purged to obtain a subsample. The histogram of the purged subsample is expanded to a bag of values while sampling remaining data values of the partitioned subset. The expanded purged subsample is converted to a histogram and uniform random samples are yielded. The sampling scheme retains the bounded footprint property and to a partial degree the compact representation of the Concise Sampling scheme, while ensuring statistical uniformity. Samples from at least two partitioned subsets are merged on demand to yield uniform merged samples of combined partitions wherein the merged samples also maintain the histogram representation and bounded footprint property.Type: ApplicationFiled: August 31, 2006Publication date: March 6, 2008Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: PAUL GEOFFREY BROWN, PETER JAY HAAS
-
Publication number: 20080046455Abstract: A method is disclosed for automatically configuring database statistics by: collecting information from a database system, the database information including data query feedback; consolidating and formatting the database information into a plurality of intervals; converting the plurality of intervals into a plurality of non-overlapping buckets; computing frequencies for the buckets by solving a constrained maximum entropy problem to create a proxy data distribution function; and using the proxy data distribution function to determine a set of statistics to maintain for the database information.Type: ApplicationFiled: August 16, 2006Publication date: February 21, 2008Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: ALEXANDER BEHM, PETER JAY HAAS, VOLKER GERHARD MARKL
-
Publication number: 20080016097Abstract: A method of selectivity estimation is disclosed in which preprocessing steps improve the feasibility and efficiency of the estimation. The preprocessing steps are: partitioning (to make iterative scaling estimation terminate in a reasonable time for even large sets of predicates); forced partitioning (to enable partitioning in case there are no “natural” partitions, by finding the subsets of predicates to create partitions that least impact the overall solution); inconsistency resolution (in order to ensure that there always is a correct and feasible solution); and implied zero elimination (to ensure convergence of the iterative scaling computation under all circumstances). All of these preprocessing steps make a maximum entropy method of selectivity estimation produce a correct cardinality model, for any kind of query with conjuncts of predicates. In addition, the preprocessing steps can also be used in conjunction with prior art methods for building a cardinality model.Type: ApplicationFiled: July 13, 2006Publication date: January 17, 2008Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: PETER JAY HAAS, MARCEL KUTSCH, VOLKER GERHARD MARKL, NIMROD MEGIDDO
-
Patent number: 7277873Abstract: A scheme is used to automatically discover algebraic constraints between pairs of columns in relational data. The constraints may be “fuzzy” in that they hold for most, but not all, of the records, and the columns may be in the same table or different tables. The scheme first identifies candidate sets of column value pairs that are likely to satisfy an algebraic constraint. For each candidate, the scheme constructs algebraic constraints by applying statistical histogramming, segmentation, or clustering techniques to samples of column values. In query-optimization mode, the scheme automatically partitions the data into normal and exception records. During subsequent query processing, queries can be modified to incorporate the constraints; the optimizer uses the constraints to identify new, more efficient access paths. The results are then combined with the results of executing the original query against the (small) set of exception records.Type: GrantFiled: October 31, 2003Date of Patent: October 2, 2007Assignee: International Business Machines CorporatonInventors: Paul Geoffrey Brown, Peter Jay Haas
-
Patent number: 7124146Abstract: A technique is provided for incrementally maintaining column cardinality estimates in database management systems. The system catalog table containing a cardinality estimate for a column is extended to include an appropriate data structure. A modified linear counting technique is used in a first embodiment of a method for column cardinality estimation. Moreover, a modified logarithmic counting technique is used in a second, preferred embodiment of a column cardinality estimation method to reduce storage requirements for the data structure. The cardinality estimate is produced by an initial scan of the data but is then further maintained without requiring a full scan of the data. Data changes are reflected incrementally in modifications to the initial cardinality estimate, keeping the cardinality statistics more current with respect to the database condition.Type: GrantFiled: April 30, 2003Date of Patent: October 17, 2006Assignee: International Business Machines CorporationInventors: Walid Rjaibi, Peter Jay Haas
-
Publication number: 20060136499Abstract: There is disclosed a data processing system implemented method, a data processing system, and an article of manufacture for directing a data processing system to maintain a database table associated with an initial maintenance scheduling interval. The data processing system implemented method includes selecting a randomizing factor, and selecting a new maintenance scheduling interval for the database table based on the initial maintenance scheduling interval and the selected randomizing factor.Type: ApplicationFiled: December 17, 2004Publication date: June 22, 2006Inventors: Ashraf Ismail Aboulnaga, Peter Jay Haas, Sam Sampson Lightstone, Volker Gerhard Markl, Ivan Popivanov, Vijayshankar Raman
-
Patent number: 6993516Abstract: A system, method and computer readable medium for sampling data from a relational database are disclosed, where an information processing system chooses rows from a table in a relational database for sampling, wherein data values are arranged into rows, rows are arranged into pages, and pages are arranged into tables. Pages are chosen for sampling according to a probability P and rows in a selected page are chosen for sampling according to a probability R, so that the overall probability of choosing a row for sampling is Q=PR. The probabilities P and R are based on the desired precision of estimates computed from a sample, as well as processing speed. The probabilities P and R are further based on either catalog statistics of the relational database or a pilot sample of rows from the relational database.Type: GrantFiled: December 26, 2002Date of Patent: January 31, 2006Assignee: International Business Machines CorporationInventors: Peter Jay Haas, Guy Maring Lohman, Mir Hamid Pirahesh, David Everett Simmen, Ashutosh Vir Vikram Singh, Michael Jeffrey Winer, Markos Zaharioudakis
-
Publication number: 20040128290Abstract: A system, method and computer readable medium for sampling data from a relational database are disclosed, where an information processing system chooses rows from a table in a relational database for sampling, wherein data values are arranged into rows, rows are arranged into pages, and pages are arranged into tables. Pages are chosen for sampling according to a probability P and rows in a selected page are chosen for sampling according to a probability R, so that the overall probability of choosing a row for sampling is Q=PR. The probabilities P and R are based on the desired precision of estimates computed from a sample, as well as processing speed. The probabilities P and R are further based on either catalog statistics of the relational database or a pilot sample of rows from the relational database.Type: ApplicationFiled: December 26, 2002Publication date: July 1, 2004Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Peter Jay Haas, Guy Maring Lohman, Mir Hamid Pirahesh, David Everett Simmen, Ashutosh Vir Vikram Singh, Michael Jeffrey Winer, Markos Zaharioudakis
-
Patent number: 6732110Abstract: The present invention is directed to a system, method and computer readable medium for estimating a column cardinality value for a column in a partitioned table stored in a plurality of nodes in a relational database. According to one embodiment of the present invention, a plurality of column values for the partitioned table stored in each node are hashed, and a hash data set for each node is generated. Each of the hash data sets from each node is transferred to a coordinator node designated from the plurality of nodes. The hash data sets are merged into a merged data set, and an estimated column cardinality value for the table is calculated from the merged data set.Type: GrantFiled: June 27, 2001Date of Patent: May 4, 2004Assignee: International Business Machines CorporationInventors: Walid Rjaibi, Guy Maring Lohman, Peter Jay Haas