Patents by Inventor Venkatesh Ganti

Venkatesh Ganti has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20090282012
    Abstract: Entities, such as people, places and things, are labeled based on information collected across a possibly large number of documents. One or more documents are scanned to recognize the entities, and features are extracted from the context in which those entities occur in the documents. Observed entity-feature pairs are stored either in an in-memory store or an external store. A store manager optimizes use of the limited amount of space for an in-memory store by determining which store to put an entity-feature pair in, and when to evict features from the in-memory store to make room for new pairs. Feature that may be observed in an entity's context may take forms such as specific word sequences or membership in a particular list.
    Type: Application
    Filed: May 5, 2008
    Publication date: November 12, 2009
    Applicant: Microsoft Corporation
    Inventors: Arnd Christian Konig, Venkatesh Ganti
  • Patent number: 7610283
    Abstract: Input set indexing for set-similarity lookups. The architecture provides input to an indexing process that enables more efficient lookups for large data sets (e.g., disk-based) without requiring a full scan of the input. A new index structure is provided, the output of which is exact, rather than approximate. The similarity of two sets is specified using a similarity function that maps two sets to a numeric value that represents similarity of the two sets. Threshold-based lookups are addressed where two sets are considered similar if the numeric similarity score is above a threshold. The structure efficiently identifies all input sets within a distance k (e.g., a hamming distance) of the query set. Additional information in the form of frequency of elements (the number of input sets in which an element occurs) is used to improve index performance.
    Type: Grant
    Filed: June 12, 2007
    Date of Patent: October 27, 2009
    Assignee: Microsoft Corporation
    Inventors: Arvind Arasu, Venkatesh Ganti, Shriraghav Kaushik
  • Patent number: 7562067
    Abstract: A system that facilitates estimating functional relationships associated with one or more columns in a database comprises a sampling component that receives a random sample of records within the database. An estimate generator component calculates an estimate of strength of functional relationships based at least in part upon the received samples. For example, the estimate generator component can calculate an estimate of strength of a column as a key column based at least in part upon the received samples.
    Type: Grant
    Filed: May 6, 2005
    Date of Patent: July 14, 2009
    Assignee: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Venkatesh Ganti, Kaushik Shriraghav
  • Patent number: 7558780
    Abstract: The subject disclosure pertains to efficient computation of the difference between queries by exploiting commonality between them. A minimal difference query (MDQ) is generated that roughly corresponds to removal of as many joins as possible while still accurately representing the query difference. The minimal difference can be employed to further substantially the scope of view matching where a query is not wholly subsumed by a view. Additionally, the minimal difference query can be employed as an analytical tool in various contexts.
    Type: Grant
    Filed: November 30, 2006
    Date of Patent: July 7, 2009
    Assignee: Microsoft Corporation
    Inventors: Kaushik Shriraghav, Venkatesh Ganti, Xin Dong
  • Patent number: 7516149
    Abstract: At least one implementation, described herein, detects fuzzy duplicates and eliminates such duplicates. Fuzzy duplicates are multiple, seemingly distinct tuples (i.e., records) in a database that represent the same real-world entity or phenomenon.
    Type: Grant
    Filed: August 30, 2004
    Date of Patent: April 7, 2009
    Assignee: Microsoft Corporation
    Inventors: Rajeev Motwani, Surajit Chaudhuri, Venkatesh Ganti
  • Publication number: 20090006392
    Abstract: Architecture that provides a data profile computation technique which employs key profile computation and data pattern profile computation. Key profile computation in a data table includes both exact keys as well as approximate keys, and is based on key strengths. A key strength of 100% is an exact key, and any other percentage in an approximate key. The key strength is estimated based on the number of table rows that have duplicated attribute values. Only column sets that exceed a threshold value are returned. Pattern profiling identifies a small set of regular expression patterns which best describe the patterns within a given set of attribute values. Pattern profiling includes three phases: a first phases for determining token regular expressions, a second phase for determining candidate regular expressions, and a third phase for identifying the best regular expressions of the candidates that match the attribute values.
    Type: Application
    Filed: June 27, 2007
    Publication date: January 1, 2009
    Applicant: MICROSOFT CORPORATION
    Inventors: Zhimin Chen, Venkatesh Ganti, Gunjan Jha, Shriraghav Kaushik, Vivek Narasayya
  • Publication number: 20080313128
    Abstract: Input set indexing for set-similarity lookups. The architecture provides input to an indexing process that enables more efficient lookups for large data sets (e.g., disk-based) without requiring a full scan of the input. A new index structure is provided, the output of which is exact, rather than approximate. The similarity of two sets is specified using a similarity function that maps two sets to a numeric value that represents similarity of the two sets. Threshold-based lookups are addressed where two sets are considered similar if the numeric similarity score is above a threshold. The structure efficiently identifies all input sets within a distance k (e.g., a hamming distance) of the query set. Additional information in the form of frequency of elements (the number of input sets in which an element occurs) is used to improve index performance.
    Type: Application
    Filed: June 12, 2007
    Publication date: December 18, 2008
    Applicant: MICROSOFT CORPORATION
    Inventors: Arvind Arasu, Venkatesh Ganti, Shriraghav Kaushik
  • Publication number: 20080306908
    Abstract: Architecture for finding related entities for web search queries. An extraction component takes a document as input and outputs all the mentions (or occurrences) of named entities such as names of people, organizations, locations, and products in the document, as well as entity metadata. An indexing component takes a document identifier (docID) and the set of mentions of named entities and, stores and indexes the information for retrieval. A document-based search component takes a keyword query and returns the docIDs of the top documents matching with the query. A retrieval component takes a docID as input, accesses the information stored by the indexing component and returns the set of mentions of named entities in the document. This information is then passed to an entity scoring and thresholding component that computes an aggregate score of each entity and selects the entities to return to the user.
    Type: Application
    Filed: June 5, 2007
    Publication date: December 11, 2008
    Applicant: MICROSOFT CORPORATION
    Inventors: Sanjay Agrawal, Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti
  • Publication number: 20080306945
    Abstract: Example-driven creation of record matching queries. The disclosed architecture employs techniques that exploit the availability of positive (or matching) and negative (non-matching) examples to search through this space and suggest an initial record matching query. The record matching task is modeled as that of designing an operator tree obtained by composing a few primitive operators. This ensures that record matching programs be executable efficiently and scalably over large input relations. The architecture joins records across multiple (e.g., two) relations (e.g., R and S). The architecture exploits the monotonicity property of similarity functions for record matching in the relations, in that, any pair of matching records have a higher similarity value than non-matching record pairs on at least one similarity function.
    Type: Application
    Filed: June 5, 2007
    Publication date: December 11, 2008
    Applicant: MICROSOFT CORPORATION
    Inventors: Surajit Chaudhuri, Bee-Chung Chen, Venkatesh Ganti, Shriraghav Kaushik
  • Publication number: 20080288482
    Abstract: A deduplication algorithm that provides improved accuracy in data deduplication by using aggregate and/or groupwise constraints. Deduplication is accomplished using only as many of these constraints that are satisfied rather than be imposed inflexibly as hard constraints. Additionally, textual similarity between tuples is leveraged to restrict the search space. The algorithm begins with a coarse initial partition of data records and continues by raising the similarity threshold until the threshold splits a given partition. This sequence of splits defines a rich space of alternatives. Over this space, an algorithm finds a partition of the input that maximizes constraint satisfaction. In the context of groupwise aggregation constraints for deduplication all SQL (structured query language) aggregates are allowed, including summation.
    Type: Application
    Filed: May 18, 2007
    Publication date: November 20, 2008
    Applicant: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Venkatesh Ganti, Shriraghav Kaushik
  • Publication number: 20080183693
    Abstract: A machine implemented system and method that efficiently facilitates and effectuates exact similarity joins between collections of sets. The system and method obtains a collection of sets and a threshold value from an interface, and based at least in part on an identifiable similarity, such as an overlap or intersection, between the collection of sets the analysis component generates and outputs a candidate pair that at least equals or exceeds the threshold value.
    Type: Application
    Filed: January 30, 2007
    Publication date: July 31, 2008
    Applicant: MICROSOFT CORPORATION
    Inventors: Arvind Arasu, Venkatesh Ganti, Raghav Kaushik
  • Patent number: 7406479
    Abstract: A set similarity join system and method are provided. The system can be employed to facilitate data cleaning based on similarities through the identification of “close” tuples (e.g., records and/or rows). “Closeness” can be is evaluated using a similarity function(s) chosen to suit the domain and/or application. Thus, the system facilitates generic domain-independent data cleansing. The system can be employed with a foundational primitive, the set similarity join (SSJoin) operator, which can be used as a building block to implement a broad variety of notions of similarity (e.g., edit similarity, Jaccard similarity, generalized edit similarity, hamming distance, soundex, etc.) as well as similarity based on co-occurrences. The SSJoin operator can exploit the observation that set overlap can be used effectively to support a variety of similarity functions. The SSJoin operator compares values based on “sets” associated with (or explicitly constructed for) each one of them.
    Type: Grant
    Filed: February 10, 2006
    Date of Patent: July 29, 2008
    Assignee: Microsoft Corporation
    Inventors: Kaushik Shriraghav, Surajit Chaudhuri, Venkatesh Ganti
  • Publication number: 20070294221
    Abstract: The subject disclosure pertains to a powerful and flexible framework for record matching. The framework facilitates design of a record matching query or package composed of a set of well-defined primitive operators (e.g., relational, data cleaning . . . ), which can ultimately be executed to match records. To assist design of such packages, a learning technique based on examples is provided. More specifically, a set of matching and non-matching record pairs can be input and employed to facilitate automatic package generation. A generated package can subsequently be transformed manually and/or automatically into a semantically equivalent form optimized for execution.
    Type: Application
    Filed: June 14, 2006
    Publication date: December 20, 2007
    Applicant: MICROSOFT CORPORATION
    Inventors: Bee-Chung Chen, Venkatesh Ganti, Kaushik Shriraghav
  • Publication number: 20070288421
    Abstract: The subject disclosure pertains to a class of object finder queries that return the best target objects that match a set of given keywords. Mechanisms are provided that facilitate identification of target objects related to search objects that match a set of query keywords. Scoring mechanisms/functions are also disclosed that compute relevance scores of target objects. Further, efficient early termination techniques are provided to compute the top K target objects based on a scoring function.
    Type: Application
    Filed: June 9, 2006
    Publication date: December 13, 2007
    Applicant: MICROSOFT CORPORATION
    Inventors: Kaushik Chakrabarti, Venkatesh Ganti, Dong Xin
  • Patent number: 7296011
    Abstract: To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.
    Type: Grant
    Filed: June 20, 2003
    Date of Patent: November 13, 2007
    Assignee: Microsoft Corporation
    Inventors: Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani
  • Patent number: 7287019
    Abstract: A process for finding a similar data records from a set of data records. A database table or tables provide a number of data records from which one or more canonical data records are identified. Tokens are identified within the data records and classified according to attribute field. A similarity score is assigned to data records in relation to other data records based on a similarity between tokens of the data records. Data records whose similarity score with respect to each other is greater than a threshold form one or more groups of data records. The records or tuples form nodes of a graph wherein edges between nodes represent a similarity score between records of a group. Within each group a canonical record is identified based on the similarity of data records to each other within the group.
    Type: Grant
    Filed: June 4, 2003
    Date of Patent: October 23, 2007
    Assignee: Microsoft Corporation
    Inventors: Rahul Kapoor, Venkatesh Ganti, Surajit Chaudhuri
  • Publication number: 20070198469
    Abstract: The subject disclosure pertains to efficient computation of the difference between queries by exploiting commonality between them. A minimal difference query (MDQ) is generated that roughly corresponds to removal of as many joins as possible while still accurately representing the query difference. The minimal difference can be employed to further substantially the scope of view matching where a query is not wholly subsumed by a view. Additionally, the minimal difference query can be employed as an analytical tool in various contexts.
    Type: Application
    Filed: November 30, 2006
    Publication date: August 23, 2007
    Applicant: MICROSOFT CORPORATION
    Inventors: Kaushik Shriraghav, Venkatesh Ganti, Xin Dong
  • Publication number: 20070192342
    Abstract: A set similarity join system and method are provided. The system can be employed to facilitate data cleaning based on similarities through the identification of “close” tuples (e.g., records and/or rows). “Closeness” can be is evaluated using a similarity function(s) chosen to suit the domain and/or application. Thus, the system facilitates generic domain-independent data cleansing. The system can be employed with a foundational primitive, the set similarity join (SSJoin) operator, which can be used as a building block to implement a broad variety of notions of similarity (e.g., edit similarity, Jaccard similarity, generalized edit similarity, hamming distance, soundex, etc.) as well as similarity based on co-occurrences. The SSJoin operator can exploit the observation that set overlap can be used effectively to support a variety of similarity functions. The SSJoin operator compares values based on “sets” associated with (or explicitly constructed for) each one of them.
    Type: Application
    Filed: February 10, 2006
    Publication date: August 16, 2007
    Applicant: Microsoft Corporation
    Inventors: Kaushik Shriraghav, Surajit Chaudhuri, Venkatesh Ganti
  • Publication number: 20070192282
    Abstract: The subject disclosure pertains to efficient computation of the difference between queries by exploiting commonality between them. A minimal difference query (MDQ) is generated that roughly corresponds to removal of as many joins as possible while still accurately representing the query difference. The minimal difference can be employed to further substantially the scope of view matching where a query is not wholly subsumed by a view. Additionally, the minimal difference query can be employed as an analytical tool in various contexts.
    Type: Application
    Filed: February 13, 2006
    Publication date: August 16, 2007
    Applicant: Microsoft Corporation
    Inventors: Kaushik Shriraghav, Venkatesh Ganti, Xin Dong
  • Publication number: 20070192297
    Abstract: The subject disclosure pertains to efficient computation of the difference between queries by exploiting commonality between them. A minimal difference query (MDQ) is generated that roughly corresponds to removal of as many joins as possible while still accurately representing the query difference. The minimal difference can be employed to further substantially the scope of view matching where a query is not wholly subsumed by a view. Additionally, the minimal difference query can be employed as an analytical tool in various contexts.
    Type: Application
    Filed: November 9, 2006
    Publication date: August 16, 2007
    Applicant: MICROSOFT CORPORATION
    Inventors: Kaushik Shriraghav, Venkatesh Ganti, Xin Dong