Patents by Inventor Surajit Chaudhuri

Surajit Chaudhuri has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20050223019
    Abstract: System and apparatus for using block-level sampling for histograms construction as well as distinct-value estimations. For histogram construction, the system implements a two-phase adaptive method in which the sample size required to reach a desired accuracy is decided based on a first phase sample. This method is significantly faster than previous iterative block-level sampling methods proposed for the same problem. For distinct-value estimation, it is shown that existing estimators designed for uniform-random samples may perform very poorly with block-level samples. An exemplary system computes an appropriate subset of a block-level sample that is suitable for use with most existing estimators.
    Type: Application
    Filed: March 31, 2004
    Publication date: October 6, 2005
    Inventors: Gautam Das, Surajit Chaudhuri, Utkarsh Srivastava
  • Publication number: 20050222965
    Abstract: A query progress indicator that provides an indication to a user of the progress of a query being executed on a database. The indication of the progress of the query allows the user to decide whether the query should be allowed to complete or should be aborted. One method that may be used to estimate the progress of a query that is being executed on a database defines a model of work performed during execution of a query. The total amount of work that will be performed during execution of the query is estimated according to the model. The amount of work performed at a given point during execution of the query is estimated according to the model. The progress of the query is estimated using the estimated amount of work at the given point in time and the estimated total amount of work. This estimated progress of query execution may be provided to the user.
    Type: Application
    Filed: March 31, 2004
    Publication date: October 6, 2005
    Inventors: Surajit Chaudhuri, Vivek Narasayya, Ravishankar Ramamurthy
  • Patent number: 6947927
    Abstract: A method for evaluating a user query on a relational database having records stored therein, a workload made up of a set of queries that have been executed on the database, and a query optimizer that generates a query execution plan for the user query. Each query plan includes a plurality of intermediate query plan components that verify a subset of records from the database meeting query criteria. The method accesses the query plan and a set of stored intermediate statistics for records verified by query components, such as histograms that summarize the cardinality of the records that verify the query component. The method forms a transformed query plan based on the selected intermediate statistics (possibly by rewriting the query plan) and estimates the cardinality of the transformed query plan to arrive at a more accurate cardinality estimate for the query.
    Type: Grant
    Filed: July 9, 2002
    Date of Patent: September 20, 2005
    Assignee: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Nicolas Bruno
  • Publication number: 20050203933
    Abstract: An XML transformation tool that constructs a relational database with associated physical structures that can be populated with shredded XML data. A mapping transformation enumerator examines queries in the workload and enumerates mapping transformations that use XSD specific constraints and statistics on XML data and can be used to generate mappings from XSD to relational database schema that may lead to better performance in presence of physical design. A design tuner that searches mappings generated from a default mapping using enumerated transformations together with physical design structures associated with those mappings and selects a preferred mapping and the physical design structures. Cost estimates for performing queries in the workload are determined for the relational database implementing the mapping and associated physical design structures.
    Type: Application
    Filed: March 9, 2004
    Publication date: September 15, 2005
    Inventors: Surajit Chaudhuri, Zhiyuan Chen, Kyuseok Shim, Yuqing Yu
  • Publication number: 20050192921
    Abstract: A framework is provided within a database system for specifying database monitoring rules that will be evaluated as part of the execution code path of database events being monitored. The occurrence of a selected database event triggers a rule that evaluates some parameter of an object related to the event against a condition in the rule. If the condition is met, a specified action is taken that can alter the execution of the database event or database system performance. Lightweight aggregation tables are utilized to enable aggregation of object parameter values so that presently occurring events can be compared to a summary of the object parameter values from previously occurring database events. Signatures are assigned to queries based on the structure of the query plan so that information in the lightweight aggregation tables can be grouped according to query signature.
    Type: Application
    Filed: February 26, 2004
    Publication date: September 1, 2005
    Inventors: Surajit Chaudhuri, Arnd Konig, Vivek Narasayya
  • Patent number: 6912547
    Abstract: Relational database applications such as index selection, histogram tuning, approximate query processing, and statistics selection have recognized the importance of leveraging workloads. Often these applications are presented with large workloads, i.e., a set of SQL DML statements, as input. A key factor affecting the scalability of such applications is the size of the workload. The invention concerns workload compression which helps improve the scalability of such applications. The exemplary embodiment is broadly applicable to a variety of workload-driven applications, while allowing for incorporation of application specific knowledge. The process is described in detail in the context of two workload-driven applications: index selection and approximate query processing.
    Type: Grant
    Filed: June 26, 2002
    Date of Patent: June 28, 2005
    Assignee: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Ashish Kumar Gupta, Vivek Narasayya, Sanjay Agrawal
  • Publication number: 20050102305
    Abstract: Relational database applications such as index selection, histogram tuning, approximate query processing, and statistics selection have recognized the importance of leveraging workloads. Often these applications are presented with large workloads, i.e., a set of SQL DML statements, as input. A key factor affecting the scalability of such applications is the size of the workload. The invention concerns workload compression which helps improve the scalability of such applications. The exemplary embodiment is broadly applicable to a variety of workload-driven applications, while allowing for incorporation of application specific knowledge. The process is described in detail in the context of two workload-driven applications: index selection and approximate query processing.
    Type: Application
    Filed: December 8, 2004
    Publication date: May 12, 2005
    Applicant: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Ashish Gupta, Vivek Narasayya, Sanjay Agrawal
  • Publication number: 20050033759
    Abstract: A method for estimating the result of a query on a database having data records arranged in tables. The database has an expected workload that includes a set of queries that can be executed on the database. An expected workload is derived comprising a set of queries that can be executed on the database. A sample is constructed by selecting data records for inclusion in the sample in a manner that minimizes an estimation error when the data records are acted upon by a query in the expected workload to provide an expected workload to provide an expected result. The query accesses the sample and is executed on the sample, returning an estimated query result. The expected workload can be constructed by specifying a degree of overlap between records selected by queries in the given workload and records selected by queries in the expected workload.
    Type: Application
    Filed: September 8, 2004
    Publication date: February 10, 2005
    Applicant: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Vivek Narasayya, Gantam Das
  • Publication number: 20050033730
    Abstract: Database system query optimizers use several techniques such as histograms and sampling to estimate the result sizes of operators and sub-plans (operator trees) and the number of distinct values in their outputs. Instead of estimates, the invention uses the exact actual values of the result sizes and the number of distinct values in the outputs of sub-plans encountered by the optimizer. This is achieved by optimizing the query in phases. In each phase, newly encountered sub-plans are recorded for which result size and/or distinct value estimates are required. These sub-plans are executed at the end of the phase to determine their actual result sizes and the actual number of distinct values in their outputs. In subsequent phases, the optimizer uses these actual values when it encounters the same sub-plan again.
    Type: Application
    Filed: September 15, 2004
    Publication date: February 10, 2005
    Applicant: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Ashraf Aboulnaga
  • Publication number: 20050033739
    Abstract: A method for estimating the result of a query on a database having data records arranged in tables. The database has an expected workload that includes a set of queries that can be executed on the database. An expected workload is derived comprising a set of queries that can be executed on the database. A sample is constructed by selecting data records for inclusion in the sample in a manner that minimizes an estimation error when the data records are acted upon by a query in the expected workload to provide an expected workload to provide an expected result. The query accesses the sample and is executed on the sample, returning an estimated query result. The expected workload can be constructed by specifying a degree of overlap between records selected by queries in the given workload and records selected by queries in the expected workload.
    Type: Application
    Filed: September 8, 2004
    Publication date: February 10, 2005
    Applicant: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Vivek Narasayya, Gantam Das
  • Patent number: 6850925
    Abstract: Database system query optimizers use several techniques such as histograms and sampling to estimate the result sizes of operators and sub-plans (operator trees) and the number of distinct values in their outputs. Instead of estimates, the invention uses the exact actual values of the result sizes and the number of distinct values in the outputs of sub-plans encountered by the optimizer. This is achieved by optimizing the query in phases. In each phase, newly encountered sub-plans are recorded for which result size and/or distinct value estimates are required. These sub-plans are executed at the end of the phase to determine their actual result sizes and the actual number of distinct values in their outputs. In subsequent phases, the optimizer uses these actual values when it encounters the same sub-plan again.
    Type: Grant
    Filed: May 15, 2001
    Date of Patent: February 1, 2005
    Assignee: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Ashraf I. Aboulnaga
  • Patent number: 6842753
    Abstract: Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled in one of many known ways to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data. Further methods involve the use of weighted sampling and weighted selection of outlier values for low selectivity queries, or queries having group by.
    Type: Grant
    Filed: January 12, 2001
    Date of Patent: January 11, 2005
    Assignee: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Vivek R. Narasayya, Rajeev Motwani, Mayur D. Datar
  • Publication number: 20050004907
    Abstract: By transforming a query into a product of conditional selectivity expressions, an existing set of statistics on query expressions can be used more effectively to estimate cardinality values. Conditional selectivity values are progressively separated according to rules of conditional probability to yield a set of non-separable decompositions that can be matched with the stored statistics on query expressions. The stored statistics are used to estimate the selectivity of the query and the estimated selectivity can be multiplied by the Cartesian product of referenced tables to yield a cardinality value.
    Type: Application
    Filed: June 27, 2003
    Publication date: January 6, 2005
    Inventors: Nicolas Bruno, Surajit Chaudhuri
  • Publication number: 20040267713
    Abstract: A method of estimating selectivity of a given string predicate in a database query. In the method selectivities of substrings of various substring lengths are estimated. For example, the selectivity of substrings between length l (or some constant q) to the length of the given string predicate may be estimated. The method then selects a candidate sub string for each sub string length based on estimated selectivities of the substrings. The estimated selectivities of the candidate substrings are combined. The combined estimated selectivity of the candidate substrings is returned as the estimated selectivity of the given string predicate.
    Type: Application
    Filed: June 24, 2003
    Publication date: December 30, 2004
    Applicant: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Venkatesh Ganti, Luis Gravano
  • Publication number: 20040260675
    Abstract: A method of estimating cardinality of a join of tables using multi-column density values and additionally using coarser density values of a subset of the multi-column density attributes. In one embodiment, the subset of attributes for the coarser densities is a prefix of the set of multi-column density attributes. A number of tuples from each table that participate in the join may be estimated using densities of the subsets. The cardinality of the join can be estimated using the multi-column density for each table and the estimated number of tuples that participate in the join from each table.
    Type: Application
    Filed: June 19, 2003
    Publication date: December 23, 2004
    Applicant: Microsoft Corporation
    Inventors: Nicolas Bruno, Murali Krishna, Ming-Chuan Wu, Surajit Chaudhuri
  • Publication number: 20040260694
    Abstract: To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.
    Type: Application
    Filed: June 20, 2003
    Publication date: December 23, 2004
    Applicant: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani
  • Publication number: 20040260684
    Abstract: Integrating the partitioning of physical design structures with the physical design process can result in more efficient query execution. When candidate structures are evaluated for their relative benefit, one or more partitioning methods is associated with each structure so that the benefits of various partitioning methods are taken into consideration when the structures are selected for use by the database. A pool of partitioned candidate structures is formed by proposing and evaluating the benefit of candidate structures with associated partitioning on a per query basis. The selected partitioned candidates are then used to construct generalized structures with associated partitioning methods that are evaluated for their benefit over the workload. Those generalized structures are added to the pool of partitioned candidate structures. From this augmented pool of partitioned candidate structures, an optimal set of partitioned structures is enumerated for use by the database system.
    Type: Application
    Filed: June 23, 2003
    Publication date: December 23, 2004
    Applicant: Microsoft Corporation
    Inventors: Sanjay Agrawal, Surajit Chaudhuri, Vivek Narasayya
  • Publication number: 20040254938
    Abstract: A search of an index database or another search method is conducted to identify as a preliminary results listing one or more selected computer objects having selected identifying information stored in an index database. In addition, one or more selected computer objects of the preliminary search results are correlated with one or more other computer objects that have associations with the selected computer objects of the preliminary search results. Integrated search results are then returned and include the preliminary search results and one or more other computer objects that have associations with the selected computer objects of the preliminary search results. The associations may be determined by a association system and represent relationships between computer files based upon user or other interactions between the objects. The associations between the objects may include similarities between them and their importance.
    Type: Application
    Filed: March 31, 2003
    Publication date: December 16, 2004
    Inventors: Cezary Marcjan, Ryszard Kott, Surajit Chaudhuri, Lili Cheng
  • Publication number: 20040249789
    Abstract: A process for finding a similar data records from a set of data records. A database table or tables provide a number of data records from which one or more canonical data records are identified. Tokens are identified within the data records and classified according to attribute field. A similarity score is assigned to data records in relation to other data records based on a similarity between tokens of the data records. Data records whose similarity score with respect to each other is greater than a threshold form one or more groups of data records. The records or tuples form nodes of a graph wherein edges between nodes represent a similarity score between records of a group. Within each group a canonical record is identified based on the similarity of data records to each other within the group.
    Type: Application
    Filed: June 4, 2003
    Publication date: December 9, 2004
    Applicant: Microsoft Corporation
    Inventors: Rahul Kapoor, Venkatesh Ganti, Surajit Chaudhuri
  • Publication number: 20040249810
    Abstract: In decision support applications, the ability to provide fast approximate answers to aggregation queries is desirable. A disclosed technique for approximate query answering is sampling. For many aggregation queries, appropriately constructed biased (non-uniform) samples can provide more accurate approximations than a uniform sample. The optimal type of bias, however, varies from query to query. An approximate query processing technique is used that dynamically constructs an appropriately biased sample for each query by combining samples selected from a family of non-uniform samples that are constructed during a pre-processing phase. Dynamic selection of appropriate portions of previously constructed samples can more accurate approximate answers than static, non-adaptive usage of uniform or non-uniform samples.
    Type: Application
    Filed: June 3, 2003
    Publication date: December 9, 2004
    Applicant: Microsoft Corporation
    Inventors: Gautam Das, Brian Babcock, Surajit Chaudhuri