Patents by Inventor Rupesh R. Mehta

Rupesh R. Mehta has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Methods and apparatuses for clustering electronic documents based on structural features and static content features

Patent number: 8832102

Abstract: Exemplary methods and apparatuses are provided which may be implemented using one or more computing devices to allow for super clustering of clusters of electronic documents based, at least in part, on structural and static content features.

Type: Grant

Filed: January 12, 2010

Date of Patent: September 9, 2014

Assignee: Yahoo! Inc.

Inventors: Rupesh R. Mehta, Srinivasan H. Sengamedu, Rajeev R. Rastogi
Structural clustering and template identification for electronic documents

Patent number: 8239387

Abstract: Subject matter disclosed herein may relate to clustering electronic documents, such as, for example, web pages, and may also relate to template identification for electronic documents.

Type: Grant

Filed: February 22, 2008

Date of Patent: August 7, 2012

Assignee: Yahoo! Inc.

Inventors: Amit Madaan, V. G. Vinod Vydiswaran, Rupesh R. Mehta
Techniques for inducing high quality structural templates for electronic documents

Patent number: 8046681

Abstract: Techniques are disclosed herein to automatically learn a template that describes a common structure present in documents in a training set. The structure of the template is compared to the structure of the documents (or at least a part of each document) in the training set, one-by-one, and generalized in response to differences between the template and the document to which the template is currently being compared. If the structure of any particular document is considered too dissimilar from the structure of the template, then the template is not modified. Various generalization operators are added to the template to generalize the template. One such generalization operator is an “OR”, which indicates that only one of “n” sub-trees below the “OR” operator in the template is allowed at the corresponding position in a document.

Type: Grant

Filed: November 27, 2007

Date of Patent: October 25, 2011

Assignee: Yahoo! Inc.

Inventors: V. G. Vinod Vydiswaran, Rupesh R. Mehta, Amit Madaan
METHODS AND APPARATUSES FOR CLUSTERING ELECTRONIC DOCUMENTS BASED ON STRUCTURAL FEATURES AND STATIC CONTENT FEATURES

Publication number: 20110173197

Abstract: Exemplary methods and apparatuses are provided which may be implemented using one or more computing devices to allow for super clustering of clusters of electronic documents based, at least in part, on structural and static content features.

Type: Application

Filed: January 12, 2010

Publication date: July 14, 2011

Applicant: Yahoo! Inc.

Inventors: Rupesh R. Mehta, Srinivasan H. Sengamedu, Rajeev R. Rastogi
ROBUST XPATHS FOR WEB INFORMATION EXTRACTION

Publication number: 20110040770

Abstract: An example of a method includes generating an attributed extensible markup language path (XPath) for an annotated entity in a web page. The method further includes determining a first node that satisfy the attributed XPath in the web page and is annotated. The method also includes identifying an attribute property that satisfies predefined criteria in the web page while traversing from the first node to a root node, the attribute property comprising an attribute value and an attribute name. Moreover, the method includes populating the attributed XPath with the attribute property that satisfies predefined criteria. The method also includes filtering the attributed XPath to generate a robust XPath, and extracting content from multiple web pages based on the robust XPath.

Type: Application

Filed: August 13, 2009

Publication date: February 17, 2011

Applicant: Yahoo! Inc.

Inventors: Amit MADAAN, Charu TIWARI, Rupesh R. MEHTA
ADAPTIVE DOCUMENT SAMPLING FOR INFORMATION EXTRACTION

Publication number: 20100228738

Abstract: A method and apparatus for improved sampling documents for training sets input to information extraction systems is provided, which improves the recall and robustness of wrapper extraction. A passive sampling technique provides a list of documents to present for human annotation ordered by representativeness of the document based on structural and content statistics. Thus, the document with the most interesting attributes and which is most representative of the cluster of structurally similar documents to which the document pertains is presented for annotation first. The problem is mapped to classical ‘Set-Cover’ problem and solved using greedy approach. An active sampling technique refines and reorders the sample list produced by the passive sampling technique after initial annotations, based on the human annotation, spatial boundaries of the documents, and structural and content statistics.

Type: Application

Filed: March 4, 2009

Publication date: September 9, 2010

Inventors: Rupesh R. Mehta, Srinivasan H. Sengamedu
WEB PAGE LAYOUT OPTIMIZATION USING SECTION IMPORTANCE

Publication number: 20090265611

Abstract: Methods and apparatus are described which enable the efficient adaptation of web pages to mobile displays. The more important or relevant sections of a web page are identified and configured into a more compact form. Both layout preserving and high compaction techniques are described.

Type: Application

Filed: May 7, 2008

Publication date: October 22, 2009

Applicant: Yahoo ! Inc.

Inventors: Srinivasan H. Sengamedu, Rupesh R. Mehta
SITE-SPECIFIC INFORMATION-TYPE DETECTION METHODS AND SYSTEMS

Publication number: 20090248707

Abstract: Methods and systems are provided herein that may allow for pertinent information-type(s) of data to be located or otherwise identified within one or more documents, such as, for example, web page documents associated with one or more websites. For example, exemplary methods and systems are provided that may be used to determine if information may be more likely to be of an “informative” type of information or possibly more likely to be of a “noise” type of information.

Type: Application

Filed: March 25, 2008

Publication date: October 1, 2009

Applicant: Yahoo! Inc.

Inventors: Rupesh R. Mehta, Amit Madaan
STRUCTURAL CLUSTERING AND TEMPLATE IDENTIFICATION FOR ELECTRONIC DOCUMENTS

Publication number: 20090216708

Abstract: Subject matter disclosed herein may relate to clustering electronic documents, such as, for example, web pages, and may also relate to template identification for electronic documents.

Type: Application

Filed: February 22, 2008

Publication date: August 27, 2009

Applicant: Yahoo! Inc.

Inventors: Amit Madaan, V. G. Vydiswaran, Rupesh R. Mehta
ADAPTIVE SAMPLING OF WEB PAGES FOR EXTRACTION

Publication number: 20090204889

Abstract: Techniques are provided for improving the recall rate of an information extraction system by automatically selecting pages to surface to a user for annotation based on variation data. Techniques are provided for generating the variation data during the construction of the template that is to be used for extraction. During template construction, data is stored to indicate which template-construction pages saw or made changes to nodes in the template. After interesting nodes have been identified in the template, the data stored during template construction is used to determine which pages made changes to interesting-variation nodes. Techniques are also provided for generating the variation data during the extraction phase, when the template is being used to extract information from pages. During the extraction phase, variation data is generated in response to detecting that extraction for a given page resulted in one or more empty attributes.

Type: Application

Filed: February 13, 2008

Publication date: August 13, 2009

Inventors: Rupesh R. Mehta, V.G. Vinod Vydiswaran