Patents by Inventor Venkatesh Ganti

Venkatesh Ganti has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

LEVERAGING CROSS-DOCUMENT CONTEXT TO LABEL ENTITY

Publication number: 20090282012

Abstract: Entities, such as people, places and things, are labeled based on information collected across a possibly large number of documents. One or more documents are scanned to recognize the entities, and features are extracted from the context in which those entities occur in the documents. Observed entity-feature pairs are stored either in an in-memory store or an external store. A store manager optimizes use of the limited amount of space for an in-memory store by determining which store to put an entity-feature pair in, and when to evict features from the in-memory store to make room for new pairs. Feature that may be observed in an entity's context may take forms such as specific word sequences or membership in a particular list.

Type: Application

Filed: May 5, 2008

Publication date: November 12, 2009

Applicant: Microsoft Corporation

Inventors: Arnd Christian Konig, Venkatesh Ganti
Disk-based probabilistic set-similarity indexes

Patent number: 7610283

Abstract: Input set indexing for set-similarity lookups. The architecture provides input to an indexing process that enables more efficient lookups for large data sets (e.g., disk-based) without requiring a full scan of the input. A new index structure is provided, the output of which is exact, rather than approximate. The similarity of two sets is specified using a similarity function that maps two sets to a numeric value that represents similarity of the two sets. Threshold-based lookups are addressed where two sets are considered similar if the numeric similarity score is above a threshold. The structure efficiently identifies all input sets within a distance k (e.g., a hamming distance) of the query set. Additional information in the form of frequency of elements (the number of input sets in which an element occurs) is used to improve index performance.

Type: Grant

Filed: June 12, 2007

Date of Patent: October 27, 2009

Assignee: Microsoft Corporation

Inventors: Arvind Arasu, Venkatesh Ganti, Shriraghav Kaushik
Systems and methods for estimating functional relationships in a database

Patent number: 7562067

Abstract: A system that facilitates estimating functional relationships associated with one or more columns in a database comprises a sampling component that receives a random sample of records within the database. An estimate generator component calculates an estimate of strength of functional relationships based at least in part upon the received samples. For example, the estimate generator component can calculate an estimate of strength of a column as a key column based at least in part upon the received samples.

Type: Grant

Filed: May 6, 2005

Date of Patent: July 14, 2009

Assignee: Microsoft Corporation

Inventors: Surajit Chaudhuri, Venkatesh Ganti, Kaushik Shriraghav
Minimal difference query and view matching

Patent number: 7558780

Abstract: The subject disclosure pertains to efficient computation of the difference between queries by exploiting commonality between them. A minimal difference query (MDQ) is generated that roughly corresponds to removal of as many joins as possible while still accurately representing the query difference. The minimal difference can be employed to further substantially the scope of view matching where a query is not wholly subsumed by a view. Additionally, the minimal difference query can be employed as an analytical tool in various contexts.

Type: Grant

Filed: November 30, 2006

Date of Patent: July 7, 2009

Assignee: Microsoft Corporation

Inventors: Kaushik Shriraghav, Venkatesh Ganti, Xin Dong
Robust detector of fuzzy duplicates

Patent number: 7516149

Abstract: At least one implementation, described herein, detects fuzzy duplicates and eliminates such duplicates. Fuzzy duplicates are multiple, seemingly distinct tuples (i.e., records) in a database that represent the same real-world entity or phenomenon.

Type: Grant

Filed: August 30, 2004

Date of Patent: April 7, 2009

Assignee: Microsoft Corporation

Inventors: Rajeev Motwani, Surajit Chaudhuri, Venkatesh Ganti
DATA PROFILE COMPUTATION

Publication number: 20090006392

Abstract: Architecture that provides a data profile computation technique which employs key profile computation and data pattern profile computation. Key profile computation in a data table includes both exact keys as well as approximate keys, and is based on key strengths. A key strength of 100% is an exact key, and any other percentage in an approximate key. The key strength is estimated based on the number of table rows that have duplicated attribute values. Only column sets that exceed a threshold value are returned. Pattern profiling identifies a small set of regular expression patterns which best describe the patterns within a given set of attribute values. Pattern profiling includes three phases: a first phases for determining token regular expressions, a second phase for determining candidate regular expressions, and a third phase for identifying the best regular expressions of the candidates that match the attribute values.

Type: Application

Filed: June 27, 2007

Publication date: January 1, 2009

Applicant: MICROSOFT CORPORATION

Inventors: Zhimin Chen, Venkatesh Ganti, Gunjan Jha, Shriraghav Kaushik, Vivek Narasayya
Disk-Based Probabilistic Set-Similarity Indexes

Publication number: 20080313128

Abstract: Input set indexing for set-similarity lookups. The architecture provides input to an indexing process that enables more efficient lookups for large data sets (e.g., disk-based) without requiring a full scan of the input. A new index structure is provided, the output of which is exact, rather than approximate. The similarity of two sets is specified using a similarity function that maps two sets to a numeric value that represents similarity of the two sets. Threshold-based lookups are addressed where two sets are considered similar if the numeric similarity score is above a threshold. The structure efficiently identifies all input sets within a distance k (e.g., a hamming distance) of the query set. Additional information in the form of frequency of elements (the number of input sets in which an element occurs) is used to improve index performance.

Type: Application

Filed: June 12, 2007

Publication date: December 18, 2008

Applicant: MICROSOFT CORPORATION

Inventors: Arvind Arasu, Venkatesh Ganti, Shriraghav Kaushik
Finding Related Entities For Search Queries

Publication number: 20080306908

Abstract: Architecture for finding related entities for web search queries. An extraction component takes a document as input and outputs all the mentions (or occurrences) of named entities such as names of people, organizations, locations, and products in the document, as well as entity metadata. An indexing component takes a document identifier (docID) and the set of mentions of named entities and, stores and indexes the information for retrieval. A document-based search component takes a keyword query and returns the docIDs of the top documents matching with the query. A retrieval component takes a docID as input, accesses the information stored by the indexing component and returns the set of mentions of named entities in the document. This information is then passed to an entity scoring and thresholding component that computes an aggregate score of each entity and selects the entities to return to the user.

Type: Application

Filed: June 5, 2007

Publication date: December 11, 2008

Applicant: MICROSOFT CORPORATION

Inventors: Sanjay Agrawal, Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti
EXAMPLE-DRIVEN DESIGN OF EFFICIENT RECORD MATCHING QUERIES

Publication number: 20080306945

Abstract: Example-driven creation of record matching queries. The disclosed architecture employs techniques that exploit the availability of positive (or matching) and negative (non-matching) examples to search through this space and suggest an initial record matching query. The record matching task is modeled as that of designing an operator tree obtained by composing a few primitive operators. This ensures that record matching programs be executable efficiently and scalably over large input relations. The architecture joins records across multiple (e.g., two) relations (e.g., R and S). The architecture exploits the monotonicity property of similarity functions for record matching in the relations, in that, any pair of matching records have a higher similarity value than non-matching record pairs on at least one similarity function.

Type: Application

Filed: June 5, 2007

Publication date: December 11, 2008

Applicant: MICROSOFT CORPORATION

Inventors: Surajit Chaudhuri, Bee-Chung Chen, Venkatesh Ganti, Shriraghav Kaushik
Leveraging constraints for deduplication

Publication number: 20080288482

Abstract: A deduplication algorithm that provides improved accuracy in data deduplication by using aggregate and/or groupwise constraints. Deduplication is accomplished using only as many of these constraints that are satisfied rather than be imposed inflexibly as hard constraints. Additionally, textual similarity between tuples is leveraged to restrict the search space. The algorithm begins with a coarse initial partition of data records and continues by raising the similarity threshold until the threshold splits a given partition. This sequence of splits defines a rich space of alternatives. Over this space, an algorithm finds a partition of the input that maximizes constraint satisfaction. In the context of groupwise aggregation constraints for deduplication all SQL (structured query language) aggregates are allowed, including summation.

Type: Application

Filed: May 18, 2007

Publication date: November 20, 2008

Applicant: Microsoft Corporation

Inventors: Surajit Chaudhuri, Venkatesh Ganti, Shriraghav Kaushik
EFFICIENT EXACT SET SIMILARITY JOINS

Publication number: 20080183693

Abstract: A machine implemented system and method that efficiently facilitates and effectuates exact similarity joins between collections of sets. The system and method obtains a collection of sets and a threshold value from an interface, and based at least in part on an identifiable similarity, such as an overlap or intersection, between the collection of sets the analysis component generates and outputs a candidate pair that at least equals or exceeds the threshold value.

Type: Application

Filed: January 30, 2007

Publication date: July 31, 2008

Applicant: MICROSOFT CORPORATION

Inventors: Arvind Arasu, Venkatesh Ganti, Raghav Kaushik
Primitive operator for similarity joins in data cleaning

Patent number: 7406479

Abstract: A set similarity join system and method are provided. The system can be employed to facilitate data cleaning based on similarities through the identification of “close” tuples (e.g., records and/or rows). “Closeness” can be is evaluated using a similarity function(s) chosen to suit the domain and/or application. Thus, the system facilitates generic domain-independent data cleansing. The system can be employed with a foundational primitive, the set similarity join (SSJoin) operator, which can be used as a building block to implement a broad variety of notions of similarity (e.g., edit similarity, Jaccard similarity, generalized edit similarity, hamming distance, soundex, etc.) as well as similarity based on co-occurrences. The SSJoin operator can exploit the observation that set overlap can be used effectively to support a variety of similarity functions. The SSJoin operator compares values based on “sets” associated with (or explicitly constructed for) each one of them.

Type: Grant

Filed: February 10, 2006

Date of Patent: July 29, 2008

Assignee: Microsoft Corporation

Inventors: Kaushik Shriraghav, Surajit Chaudhuri, Venkatesh Ganti
DESIGNING RECORD MATCHING QUERIES UTILIZING EXAMPLES

Publication number: 20070294221

Abstract: The subject disclosure pertains to a powerful and flexible framework for record matching. The framework facilitates design of a record matching query or package composed of a set of well-defined primitive operators (e.g., relational, data cleaning . . . ), which can ultimately be executed to match records. To assist design of such packages, a learning technique based on examples is provided. More specifically, a set of matching and non-matching record pairs can be input and employed to facilitate automatic package generation. A generated package can subsequently be transformed manually and/or automatically into a semantically equivalent form optimized for execution.

Type: Application

Filed: June 14, 2006

Publication date: December 20, 2007

Applicant: MICROSOFT CORPORATION

Inventors: Bee-Chung Chen, Venkatesh Ganti, Kaushik Shriraghav
EFFICIENT EVALUATION OF OBJECT FINDER QUERIES

Publication number: 20070288421

Abstract: The subject disclosure pertains to a class of object finder queries that return the best target objects that match a set of given keywords. Mechanisms are provided that facilitate identification of target objects related to search objects that match a set of query keywords. Scoring mechanisms/functions are also disclosed that compute relevance scores of target objects. Further, efficient early termination techniques are provided to compute the top K target objects based on a scoring function.

Type: Application

Filed: June 9, 2006

Publication date: December 13, 2007

Applicant: MICROSOFT CORPORATION

Inventors: Kaushik Chakrabarti, Venkatesh Ganti, Dong Xin
Efficient fuzzy match for evaluating data records

Patent number: 7296011

Abstract: To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.

Type: Grant

Filed: June 20, 2003

Date of Patent: November 13, 2007

Assignee: Microsoft Corporation

Inventors: Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani
Duplicate data elimination system

Patent number: 7287019

Abstract: A process for finding a similar data records from a set of data records. A database table or tables provide a number of data records from which one or more canonical data records are identified. Tokens are identified within the data records and classified according to attribute field. A similarity score is assigned to data records in relation to other data records based on a similarity between tokens of the data records. Data records whose similarity score with respect to each other is greater than a threshold form one or more groups of data records. The records or tuples form nodes of a graph wherein edges between nodes represent a similarity score between records of a group. Within each group a canonical record is identified based on the similarity of data records to each other within the group.

Type: Grant

Filed: June 4, 2003

Date of Patent: October 23, 2007

Assignee: Microsoft Corporation

Inventors: Rahul Kapoor, Venkatesh Ganti, Surajit Chaudhuri
MINIMAL DIFFERENCE QUERY AND VIEW MATCHING

Publication number: 20070198469

Abstract: The subject disclosure pertains to efficient computation of the difference between queries by exploiting commonality between them. A minimal difference query (MDQ) is generated that roughly corresponds to removal of as many joins as possible while still accurately representing the query difference. The minimal difference can be employed to further substantially the scope of view matching where a query is not wholly subsumed by a view. Additionally, the minimal difference query can be employed as an analytical tool in various contexts.

Type: Application

Filed: November 30, 2006

Publication date: August 23, 2007

Applicant: MICROSOFT CORPORATION

Inventors: Kaushik Shriraghav, Venkatesh Ganti, Xin Dong
Primitive operator for similarity joins in data cleaning

Publication number: 20070192342

Abstract: A set similarity join system and method are provided. The system can be employed to facilitate data cleaning based on similarities through the identification of “close” tuples (e.g., records and/or rows). “Closeness” can be is evaluated using a similarity function(s) chosen to suit the domain and/or application. Thus, the system facilitates generic domain-independent data cleansing. The system can be employed with a foundational primitive, the set similarity join (SSJoin) operator, which can be used as a building block to implement a broad variety of notions of similarity (e.g., edit similarity, Jaccard similarity, generalized edit similarity, hamming distance, soundex, etc.) as well as similarity based on co-occurrences. The SSJoin operator can exploit the observation that set overlap can be used effectively to support a variety of similarity functions. The SSJoin operator compares values based on “sets” associated with (or explicitly constructed for) each one of them.

Type: Application

Filed: February 10, 2006

Publication date: August 16, 2007

Applicant: Microsoft Corporation

Inventors: Kaushik Shriraghav, Surajit Chaudhuri, Venkatesh Ganti
MINIMAL DIFFERENCE QUERY AND VIEW MATCHING

Publication number: 20070192282

Abstract: The subject disclosure pertains to efficient computation of the difference between queries by exploiting commonality between them. A minimal difference query (MDQ) is generated that roughly corresponds to removal of as many joins as possible while still accurately representing the query difference. The minimal difference can be employed to further substantially the scope of view matching where a query is not wholly subsumed by a view. Additionally, the minimal difference query can be employed as an analytical tool in various contexts.

Type: Application

Filed: February 13, 2006

Publication date: August 16, 2007

Applicant: Microsoft Corporation

Inventors: Kaushik Shriraghav, Venkatesh Ganti, Xin Dong
MINIMAL DIFFERENCE QUERY AND VIEW MATCHING

Publication number: 20070192297

Abstract: The subject disclosure pertains to efficient computation of the difference between queries by exploiting commonality between them. A minimal difference query (MDQ) is generated that roughly corresponds to removal of as many joins as possible while still accurately representing the query difference. The minimal difference can be employed to further substantially the scope of view matching where a query is not wholly subsumed by a view. Additionally, the minimal difference query can be employed as an analytical tool in various contexts.

Type: Application

Filed: November 9, 2006

Publication date: August 16, 2007

Applicant: MICROSOFT CORPORATION

Inventors: Kaushik Shriraghav, Venkatesh Ganti, Xin Dong

prev 1 2 3 4 next