Patents by Inventor Venkatesh Ganti
Venkatesh Ganti has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20090282012Abstract: Entities, such as people, places and things, are labeled based on information collected across a possibly large number of documents. One or more documents are scanned to recognize the entities, and features are extracted from the context in which those entities occur in the documents. Observed entity-feature pairs are stored either in an in-memory store or an external store. A store manager optimizes use of the limited amount of space for an in-memory store by determining which store to put an entity-feature pair in, and when to evict features from the in-memory store to make room for new pairs. Feature that may be observed in an entity's context may take forms such as specific word sequences or membership in a particular list.Type: ApplicationFiled: May 5, 2008Publication date: November 12, 2009Applicant: Microsoft CorporationInventors: Arnd Christian Konig, Venkatesh Ganti
-
Patent number: 7610283Abstract: Input set indexing for set-similarity lookups. The architecture provides input to an indexing process that enables more efficient lookups for large data sets (e.g., disk-based) without requiring a full scan of the input. A new index structure is provided, the output of which is exact, rather than approximate. The similarity of two sets is specified using a similarity function that maps two sets to a numeric value that represents similarity of the two sets. Threshold-based lookups are addressed where two sets are considered similar if the numeric similarity score is above a threshold. The structure efficiently identifies all input sets within a distance k (e.g., a hamming distance) of the query set. Additional information in the form of frequency of elements (the number of input sets in which an element occurs) is used to improve index performance.Type: GrantFiled: June 12, 2007Date of Patent: October 27, 2009Assignee: Microsoft CorporationInventors: Arvind Arasu, Venkatesh Ganti, Shriraghav Kaushik
-
Patent number: 7562067Abstract: A system that facilitates estimating functional relationships associated with one or more columns in a database comprises a sampling component that receives a random sample of records within the database. An estimate generator component calculates an estimate of strength of functional relationships based at least in part upon the received samples. For example, the estimate generator component can calculate an estimate of strength of a column as a key column based at least in part upon the received samples.Type: GrantFiled: May 6, 2005Date of Patent: July 14, 2009Assignee: Microsoft CorporationInventors: Surajit Chaudhuri, Venkatesh Ganti, Kaushik Shriraghav
-
Patent number: 7558780Abstract: The subject disclosure pertains to efficient computation of the difference between queries by exploiting commonality between them. A minimal difference query (MDQ) is generated that roughly corresponds to removal of as many joins as possible while still accurately representing the query difference. The minimal difference can be employed to further substantially the scope of view matching where a query is not wholly subsumed by a view. Additionally, the minimal difference query can be employed as an analytical tool in various contexts.Type: GrantFiled: November 30, 2006Date of Patent: July 7, 2009Assignee: Microsoft CorporationInventors: Kaushik Shriraghav, Venkatesh Ganti, Xin Dong
-
Patent number: 7516149Abstract: At least one implementation, described herein, detects fuzzy duplicates and eliminates such duplicates. Fuzzy duplicates are multiple, seemingly distinct tuples (i.e., records) in a database that represent the same real-world entity or phenomenon.Type: GrantFiled: August 30, 2004Date of Patent: April 7, 2009Assignee: Microsoft CorporationInventors: Rajeev Motwani, Surajit Chaudhuri, Venkatesh Ganti
-
Publication number: 20090006392Abstract: Architecture that provides a data profile computation technique which employs key profile computation and data pattern profile computation. Key profile computation in a data table includes both exact keys as well as approximate keys, and is based on key strengths. A key strength of 100% is an exact key, and any other percentage in an approximate key. The key strength is estimated based on the number of table rows that have duplicated attribute values. Only column sets that exceed a threshold value are returned. Pattern profiling identifies a small set of regular expression patterns which best describe the patterns within a given set of attribute values. Pattern profiling includes three phases: a first phases for determining token regular expressions, a second phase for determining candidate regular expressions, and a third phase for identifying the best regular expressions of the candidates that match the attribute values.Type: ApplicationFiled: June 27, 2007Publication date: January 1, 2009Applicant: MICROSOFT CORPORATIONInventors: Zhimin Chen, Venkatesh Ganti, Gunjan Jha, Shriraghav Kaushik, Vivek Narasayya
-
Publication number: 20080313128Abstract: Input set indexing for set-similarity lookups. The architecture provides input to an indexing process that enables more efficient lookups for large data sets (e.g., disk-based) without requiring a full scan of the input. A new index structure is provided, the output of which is exact, rather than approximate. The similarity of two sets is specified using a similarity function that maps two sets to a numeric value that represents similarity of the two sets. Threshold-based lookups are addressed where two sets are considered similar if the numeric similarity score is above a threshold. The structure efficiently identifies all input sets within a distance k (e.g., a hamming distance) of the query set. Additional information in the form of frequency of elements (the number of input sets in which an element occurs) is used to improve index performance.Type: ApplicationFiled: June 12, 2007Publication date: December 18, 2008Applicant: MICROSOFT CORPORATIONInventors: Arvind Arasu, Venkatesh Ganti, Shriraghav Kaushik
-
Publication number: 20080306908Abstract: Architecture for finding related entities for web search queries. An extraction component takes a document as input and outputs all the mentions (or occurrences) of named entities such as names of people, organizations, locations, and products in the document, as well as entity metadata. An indexing component takes a document identifier (docID) and the set of mentions of named entities and, stores and indexes the information for retrieval. A document-based search component takes a keyword query and returns the docIDs of the top documents matching with the query. A retrieval component takes a docID as input, accesses the information stored by the indexing component and returns the set of mentions of named entities in the document. This information is then passed to an entity scoring and thresholding component that computes an aggregate score of each entity and selects the entities to return to the user.Type: ApplicationFiled: June 5, 2007Publication date: December 11, 2008Applicant: MICROSOFT CORPORATIONInventors: Sanjay Agrawal, Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti
-
Publication number: 20080306945Abstract: Example-driven creation of record matching queries. The disclosed architecture employs techniques that exploit the availability of positive (or matching) and negative (non-matching) examples to search through this space and suggest an initial record matching query. The record matching task is modeled as that of designing an operator tree obtained by composing a few primitive operators. This ensures that record matching programs be executable efficiently and scalably over large input relations. The architecture joins records across multiple (e.g., two) relations (e.g., R and S). The architecture exploits the monotonicity property of similarity functions for record matching in the relations, in that, any pair of matching records have a higher similarity value than non-matching record pairs on at least one similarity function.Type: ApplicationFiled: June 5, 2007Publication date: December 11, 2008Applicant: MICROSOFT CORPORATIONInventors: Surajit Chaudhuri, Bee-Chung Chen, Venkatesh Ganti, Shriraghav Kaushik
-
Publication number: 20080288482Abstract: A deduplication algorithm that provides improved accuracy in data deduplication by using aggregate and/or groupwise constraints. Deduplication is accomplished using only as many of these constraints that are satisfied rather than be imposed inflexibly as hard constraints. Additionally, textual similarity between tuples is leveraged to restrict the search space. The algorithm begins with a coarse initial partition of data records and continues by raising the similarity threshold until the threshold splits a given partition. This sequence of splits defines a rich space of alternatives. Over this space, an algorithm finds a partition of the input that maximizes constraint satisfaction. In the context of groupwise aggregation constraints for deduplication all SQL (structured query language) aggregates are allowed, including summation.Type: ApplicationFiled: May 18, 2007Publication date: November 20, 2008Applicant: Microsoft CorporationInventors: Surajit Chaudhuri, Venkatesh Ganti, Shriraghav Kaushik
-
Publication number: 20080183693Abstract: A machine implemented system and method that efficiently facilitates and effectuates exact similarity joins between collections of sets. The system and method obtains a collection of sets and a threshold value from an interface, and based at least in part on an identifiable similarity, such as an overlap or intersection, between the collection of sets the analysis component generates and outputs a candidate pair that at least equals or exceeds the threshold value.Type: ApplicationFiled: January 30, 2007Publication date: July 31, 2008Applicant: MICROSOFT CORPORATIONInventors: Arvind Arasu, Venkatesh Ganti, Raghav Kaushik
-
Patent number: 7406479Abstract: A set similarity join system and method are provided. The system can be employed to facilitate data cleaning based on similarities through the identification of “close” tuples (e.g., records and/or rows). “Closeness” can be is evaluated using a similarity function(s) chosen to suit the domain and/or application. Thus, the system facilitates generic domain-independent data cleansing. The system can be employed with a foundational primitive, the set similarity join (SSJoin) operator, which can be used as a building block to implement a broad variety of notions of similarity (e.g., edit similarity, Jaccard similarity, generalized edit similarity, hamming distance, soundex, etc.) as well as similarity based on co-occurrences. The SSJoin operator can exploit the observation that set overlap can be used effectively to support a variety of similarity functions. The SSJoin operator compares values based on “sets” associated with (or explicitly constructed for) each one of them.Type: GrantFiled: February 10, 2006Date of Patent: July 29, 2008Assignee: Microsoft CorporationInventors: Kaushik Shriraghav, Surajit Chaudhuri, Venkatesh Ganti
-
Publication number: 20070294221Abstract: The subject disclosure pertains to a powerful and flexible framework for record matching. The framework facilitates design of a record matching query or package composed of a set of well-defined primitive operators (e.g., relational, data cleaning . . . ), which can ultimately be executed to match records. To assist design of such packages, a learning technique based on examples is provided. More specifically, a set of matching and non-matching record pairs can be input and employed to facilitate automatic package generation. A generated package can subsequently be transformed manually and/or automatically into a semantically equivalent form optimized for execution.Type: ApplicationFiled: June 14, 2006Publication date: December 20, 2007Applicant: MICROSOFT CORPORATIONInventors: Bee-Chung Chen, Venkatesh Ganti, Kaushik Shriraghav
-
Publication number: 20070288421Abstract: The subject disclosure pertains to a class of object finder queries that return the best target objects that match a set of given keywords. Mechanisms are provided that facilitate identification of target objects related to search objects that match a set of query keywords. Scoring mechanisms/functions are also disclosed that compute relevance scores of target objects. Further, efficient early termination techniques are provided to compute the top K target objects based on a scoring function.Type: ApplicationFiled: June 9, 2006Publication date: December 13, 2007Applicant: MICROSOFT CORPORATIONInventors: Kaushik Chakrabarti, Venkatesh Ganti, Dong Xin
-
Patent number: 7296011Abstract: To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.Type: GrantFiled: June 20, 2003Date of Patent: November 13, 2007Assignee: Microsoft CorporationInventors: Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani
-
Patent number: 7287019Abstract: A process for finding a similar data records from a set of data records. A database table or tables provide a number of data records from which one or more canonical data records are identified. Tokens are identified within the data records and classified according to attribute field. A similarity score is assigned to data records in relation to other data records based on a similarity between tokens of the data records. Data records whose similarity score with respect to each other is greater than a threshold form one or more groups of data records. The records or tuples form nodes of a graph wherein edges between nodes represent a similarity score between records of a group. Within each group a canonical record is identified based on the similarity of data records to each other within the group.Type: GrantFiled: June 4, 2003Date of Patent: October 23, 2007Assignee: Microsoft CorporationInventors: Rahul Kapoor, Venkatesh Ganti, Surajit Chaudhuri
-
Publication number: 20070198469Abstract: The subject disclosure pertains to efficient computation of the difference between queries by exploiting commonality between them. A minimal difference query (MDQ) is generated that roughly corresponds to removal of as many joins as possible while still accurately representing the query difference. The minimal difference can be employed to further substantially the scope of view matching where a query is not wholly subsumed by a view. Additionally, the minimal difference query can be employed as an analytical tool in various contexts.Type: ApplicationFiled: November 30, 2006Publication date: August 23, 2007Applicant: MICROSOFT CORPORATIONInventors: Kaushik Shriraghav, Venkatesh Ganti, Xin Dong
-
Publication number: 20070192342Abstract: A set similarity join system and method are provided. The system can be employed to facilitate data cleaning based on similarities through the identification of “close” tuples (e.g., records and/or rows). “Closeness” can be is evaluated using a similarity function(s) chosen to suit the domain and/or application. Thus, the system facilitates generic domain-independent data cleansing. The system can be employed with a foundational primitive, the set similarity join (SSJoin) operator, which can be used as a building block to implement a broad variety of notions of similarity (e.g., edit similarity, Jaccard similarity, generalized edit similarity, hamming distance, soundex, etc.) as well as similarity based on co-occurrences. The SSJoin operator can exploit the observation that set overlap can be used effectively to support a variety of similarity functions. The SSJoin operator compares values based on “sets” associated with (or explicitly constructed for) each one of them.Type: ApplicationFiled: February 10, 2006Publication date: August 16, 2007Applicant: Microsoft CorporationInventors: Kaushik Shriraghav, Surajit Chaudhuri, Venkatesh Ganti
-
Publication number: 20070192282Abstract: The subject disclosure pertains to efficient computation of the difference between queries by exploiting commonality between them. A minimal difference query (MDQ) is generated that roughly corresponds to removal of as many joins as possible while still accurately representing the query difference. The minimal difference can be employed to further substantially the scope of view matching where a query is not wholly subsumed by a view. Additionally, the minimal difference query can be employed as an analytical tool in various contexts.Type: ApplicationFiled: February 13, 2006Publication date: August 16, 2007Applicant: Microsoft CorporationInventors: Kaushik Shriraghav, Venkatesh Ganti, Xin Dong
-
Publication number: 20070192297Abstract: The subject disclosure pertains to efficient computation of the difference between queries by exploiting commonality between them. A minimal difference query (MDQ) is generated that roughly corresponds to removal of as many joins as possible while still accurately representing the query difference. The minimal difference can be employed to further substantially the scope of view matching where a query is not wholly subsumed by a view. Additionally, the minimal difference query can be employed as an analytical tool in various contexts.Type: ApplicationFiled: November 9, 2006Publication date: August 16, 2007Applicant: MICROSOFT CORPORATIONInventors: Kaushik Shriraghav, Venkatesh Ganti, Xin Dong