Patents by Inventor Surajit Chaudhuri

Surajit Chaudhuri has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20060242102
    Abstract: A system that facilitates automatic selection of a physical configuration of a database comprises an optimizer component that determines simulated physical structures and creates a hypothetical configuration based thereon. A reduction component progressively reduces size of the configuration until the hypothetical configuration is associated with a size below a threshold. For example, the simulated physical structures can be based at least in part upon a workload.
    Type: Application
    Filed: April 21, 2005
    Publication date: October 26, 2006
    Applicant: Microsoft Corporation
    Inventors: Nicolas Bruno, Surajit Chaudhuri
  • Patent number: 7120624
    Abstract: A method for estimating the result of a query on a database having data records arranged in tables. The database has an expected workload that includes a set of queries that can be executed on the database. A sample is constructed by selecting data records for inclusion in the sample in a manner that minimizes an estimation error when the data records are acted upon by a query in the expected workload to provide an estimated result. The query accesses the sample and is executed on the sample, returning an estimated query result. The expected workload can be constructed by specifying a degree of overlap between records selected by queries in the given workload and records selected by queries in the expected workload.
    Type: Grant
    Filed: May 21, 2001
    Date of Patent: October 10, 2006
    Assignee: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Vivek Narasayya, Guatam Das
  • Patent number: 7120623
    Abstract: Methods of optimizing access to a relation queried through a number of predicates. The methods identify one or more candidate predicates of the selection condition that can be used to factorize the selection condition. A gain from using one or more of the candidate predicates to factorize the selection condition is computed. One or more of the candidate predicates that result in a positive gain are factored from the selection condition to produce a rewritten selection condition. The candidate predicates can be predicates that appear exactly in the selection condition more than once and/or merged predicates that may be predicates in the selection condition that overlap.
    Type: Grant
    Filed: August 29, 2002
    Date of Patent: October 10, 2006
    Assignee: Microsoft Corporation
    Inventors: Prasana Ganesan, Surajit Chaudhuri
  • Publication number: 20060123009
    Abstract: A flexible, easy to use, and scalable framework for database generation and mappings of synthetic distributions to the framework. The framework discloses a specification language, database primitives, aspects of a runtime system, and an extension to create table SQL statements, to generate databases with complex synthetic distributions and inter-table correlations. The framework facilitates generation of a data generator which can output the synthetic data distribution. The data distribution includes at least one of a complex intra-table correlation and a complex inter-table correlation. The framework further comprises an annotations component that facilitates annotation of a relational database statement (e.g., a CREATE TABLE statement) which specifies concisely how a table will be populated. The framework further comprises a language component (e.g., a Data Generation Language (DGL)) that specifies the data distribution.
    Type: Application
    Filed: December 7, 2004
    Publication date: June 8, 2006
    Applicant: Microsoft Corporation
    Inventors: Nicolas Bruno, Surajit Chaudhuri
  • Publication number: 20060085463
    Abstract: An outlier index for a database and a given workload is generated by identifying sub-relations of tuples in the database induced by selection and group by conditions in queries in the workload. A variance is then generated for values in each sub-relation. Sub-relations having higher variances are selected, and outliers from such sub-relations having higher variances are generated.
    Type: Application
    Filed: December 7, 2005
    Publication date: April 20, 2006
    Applicant: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Vivek Narasayya, Rajeev Motwani, Mayur Datar
  • Publication number: 20060085410
    Abstract: A method of estimating the Results of a database query are estimated by performing a sampling of weighted tuples in a database based on a probability of usage of tuples required in executing a workload. A probability is associated with each tuple sampled. And, can aggregate is computed over values in each sampled tuple while multiplying by the inverses of the probabilities associated with each tuple sampled.
    Type: Application
    Filed: December 7, 2005
    Publication date: April 20, 2006
    Applicant: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Vivek Narasayya, Rajeev Motwani, Mayur Datar
  • Publication number: 20060085484
    Abstract: An automated physical database design tool may provide an integrated physical design recommendation for horizontal partitioning, indexes and indexed views, all three features being tuned together (in concert). Manageability requirements may be specified when optimizing for performance. User-specified configuration may enable the specification of a partial physical design without materialization of the physical design. The tuning process may be performed for a production server but may be conducted substantially on a test server. Secondary indexes may be suggested for XML columns. Tuning of a database may be invoked by any owner of a database. Usage of objects may be evaluated and a recommendation for dropping unused objects may be issued. Reports may be provided concerning the count and percentage of queries in the workload that reference a particular database, and/or the count and percentage of queries in the workload that reference a particular table or column.
    Type: Application
    Filed: October 15, 2004
    Publication date: April 20, 2006
    Applicant: Microsoft Corporation
    Inventors: Alexander Raizman, Arunprasad Marathe, Djana Milton, Dmitry Sonkin, Lubor Kollar, Maciej Sarnowicz, Manoj Syamala, Raja Duddupudi, Sanjay Agrawal, Surajit Chaudhuri, Vivek Narasayya
  • Publication number: 20060085378
    Abstract: Internal communications within components of an automated physical database design tool may be conducted in a data description language such as XML. Inputs to and outputs from the automated physical database design tool may also be presented in the data description language (e.g., XML). The communications, inputs and outputs may comply with a schema for the data description language. The schema may be written in a schema language such as XSD. Inputs presented in the data description language may comprise tuning options. Outputs may comprise a proposed physical design for a database and reports.
    Type: Application
    Filed: October 15, 2004
    Publication date: April 20, 2006
    Applicant: Microsoft Corporation
    Inventors: Alexander Raizman, Arunprasad Marathe, Djana Ophelia Milton, Dmitry Sonkin, Lubor Kollar, Maciej Sarnowicz, Manoj Syamala, Raja Duddupudi, Sanjay Agrawal, Surajit Chaudhuri, Vivek Narasayya
  • Publication number: 20060053103
    Abstract: Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data.
    Type: Application
    Filed: October 7, 2005
    Publication date: March 9, 2006
    Applicant: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Vivek Narasayya, Rajeev Motwani, Mayur Datar
  • Publication number: 20060053129
    Abstract: At least one implementation, described herein, detects fuzzy duplicates and eliminates such duplicates. Fuzzy duplicates are multiple, seemingly distinct tuples (i.e., records) in a database that represent the same real-world entity or phenomenon.
    Type: Application
    Filed: August 30, 2004
    Publication date: March 9, 2006
    Applicant: Microsoft Corporation
    Inventors: Rajeev Motwani, Surajit Chaudhuri, Venkatesh Ganti
  • Patent number: 7007039
    Abstract: In a database system, a method of maintaining a self-tuning histogram having a plurality of existing rectangular shaped buckets arranged in a hierarchical manner and defined by at least two bucket boundaries, a bucket volume, and a bucket frequency. At least one new bucket is created in response to a query on the database. Each new bucket is contained within at least one existing bucket and the new bucket becomes a child bucket and the existing bucket containing it becomes a parent bucket. The boundaries of each new bucket correspond to a region of the database accessed by the query and the frequency of the new bucket is a number of data records returned by the query. Buckets may be merged based on a merge criterion such as similar bucket density when the total number of buckets exceeds the predetermined budget.
    Type: Grant
    Filed: June 14, 2001
    Date of Patent: February 28, 2006
    Assignee: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Nicolas Bruno, Luis Gravano
  • Publication number: 20060036600
    Abstract: Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data.
    Type: Application
    Filed: October 7, 2005
    Publication date: February 16, 2006
    Applicant: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Vivek Narasayya, Rajeev Motwani, Mayur Datar
  • Publication number: 20060036989
    Abstract: A monitoring component of a database server collects a subset of a query workload along with related statistics. A remote index tuning component uses the workload subset and related statistics to determine a physical design that minimizes the cost of executing queries in the workload subset while ensuring that queries omitted from the subset do not degrade in performance.
    Type: Application
    Filed: August 10, 2004
    Publication date: February 16, 2006
    Applicant: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Arnd Konig, Vivek Narasayya
  • Publication number: 20060036581
    Abstract: At least one implementation of database management technology, described herein, utilizes categorization of query results when querying a relational database in order to reduce information overload. To reduce information overload even further, another implementation, described herein, utilizes both categorization and ranking of query results when searching a relational database.
    Type: Application
    Filed: August 13, 2004
    Publication date: February 16, 2006
    Applicant: Microsoft Corporation
    Inventors: Kaushik Chakrabarti, Seung-won Hwang, Surajit Chaudhuri
  • Publication number: 20050289102
    Abstract: A system and methods rank results of database queries. An automated approach for ranking database query results is disclosed that leverages data and workload statistics and associations. Ranking functions are based upon the principles of probabilistic models from Information Retrieval that are adapted for structured data. The ranking functions are encoded into an intermediate knowledge representation layer. The system is generic, as the ranking functions can be further customized for different applications. Benefits of the disclosed system and methods include the use of adapted probabilistic information retrieval (PIR) techniques that leverage relational/structured data, such as columns, to provide natural groupings of data values. This permits the inference and use of pair-wise associations between data values across columns, which are usually not possible with text data.
    Type: Application
    Filed: June 29, 2004
    Publication date: December 29, 2005
    Applicant: Microsoft Corporation
    Inventors: Gautam Das, Surajit Chaudhuri, Vagelis Hristidis, Gerhard Weikum
  • Publication number: 20050267877
    Abstract: A method for evaluating a user query on a relational database having records stored therein, a workload made up of a set of queries that have been executed on the database, and a query optimizer that generates a query execution plan for the user query. Each query plan includes a plurality of intermediate query plan components that verify a subset of records from the database meeting query criteria. The method accesses the query plan and a set of stored intermediate statistics for records verified by query components, such as histograms that summarize the cardinality of the records that verify the query component. The method forms a transformed query plan based on the selected intermediate statistics (possibly by rewriting the query plan) and estimates the cardinality of the transformed query plan to arrive at a more accurate cardinality estimate for the query.
    Type: Application
    Filed: July 7, 2005
    Publication date: December 1, 2005
    Applicant: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Nicolas Bruno
  • Publication number: 20050262044
    Abstract: The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key-foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.
    Type: Application
    Filed: July 14, 2005
    Publication date: November 24, 2005
    Applicant: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Venkatesh Ganti, Rohit Ananthakrishna
  • Patent number: 6961721
    Abstract: The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.
    Type: Grant
    Filed: June 28, 2002
    Date of Patent: November 1, 2005
    Assignee: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Venkatesh Ganti, Rohit Ananthakrishna
  • Publication number: 20050228779
    Abstract: Selectivity estimates are produced that meet a desired confidence threshold. To determine the confidence level of a given selectivity estimate for a query expression, the query expression is evaluated on a sample tuples. A probability density function is derived based on the number of tuples in the sample that satisfy the query expression. The cumulative distribution for the probability density function is solved for the given threshold to determine a selectivity estimate at the given confidence value.
    Type: Application
    Filed: April 6, 2004
    Publication date: October 13, 2005
    Inventors: Surajit Chaudhuri, Brian Babcock
  • Publication number: 20050223026
    Abstract: A database object summarization tool is provided that selects a subset of database objects subject to filtering constraints such as a partial order or optimization of some attribute. A dominance primitive filters out tuples that are dominated according to a partial order constraint by another tuple. A representation primitive selects a representative subset of tuples such than an optimization criteria is met.
    Type: Application
    Filed: March 31, 2004
    Publication date: October 6, 2005
    Inventors: Surajit Chaudhuri, Vivek Narasayya, Prasanna Ganesan