Patents by Inventor Sandeepkumar Bhuramal Satpal

Sandeepkumar Bhuramal Satpal has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Techniques for categorizing web pages

Patent number: 8768926

Abstract: Web pages are efficiently categorized in a data processor without analyzing the content of the web pages. According to at least one embodiment, data is maintained that represents sample URLs grouped into a plurality of clusters. The sample URLs of a cluster are used to produce a URL regular expression pattern (“URL-regex”) that differentiates the sample URLs of the cluster from the sample URLs of other clusters and that covers at least a specified percentage of the sample URLs in the cluster. The process of producing a URL-regex is repeated for each of the clusters producing a URL-regex for each cluster. Web pages are then categorized into one of the clusters by determining which of the URL-regex patterns produced for the clusters match URLs that refer to the web pages. Thus, a web page may be categorized based on a URL that refers to the web page without having to obtain and analyze the content of the web page.

Type: Grant

Filed: January 5, 2010

Date of Patent: July 1, 2014

Assignee: Yahoo! Inc.

Inventors: Ashwin Tengli, Rajeev Rastogi, Jeyashankher Ramamirtham, Srinivasan H Sengamedu, Sandeepkumar Bhuramal Satpal
TECHNIQUES FOR CATEGORIZING WEB PAGES

Publication number: 20110167063

Abstract: Web pages are efficiently categorized in a data processor without analyzing the content of the web pages. According to at least one embodiment, data is maintained that represents sample URLs grouped into a plurality of clusters. The sample URLs of a cluster are used to produce a URL regular expression pattern (“URL-regex”) that differentiates the sample URLs of the cluster from the sample URLs of other clusters and that covers at least a specified percentage of the sample URLs in the cluster. The process of producing a URL-regex is repeated for each of the clusters producing a URL-regex for each cluster. Web pages are then categorized into one of the clusters by determining which of the URL-regex patterns produced for the clusters match URLs that refer to the web pages. Thus, a web page may be categorized based on a URL that refers to the web page without having to obtain and analyze the content of the web page.

Type: Application

Filed: January 5, 2010

Publication date: July 7, 2011

Inventors: Ashwin Tengli, Rajeev Rastogi, Jeyashankher Ramamirtham, Srinivasan H. Sengamedu, Sandeepkumar Bhuramal Satpal
HIGH PRECISION WEB EXTRACTION USING SITE KNOWLEDGE

Publication number: 20100257440

Abstract: Techniques for high precision web extraction using site knowledge are provided. Portions of repeating text are identified in unlabeled web pages from a particular web site. Based on the portions of repeating text, the unlabeled web pages are partitioned into a set of segments. Multiple labels are assigned to respectively corresponding multiple attributes in the set of segments, where assigning the multiple labels comprises applying a classification model to each separate segment in the set of segments. First one or more labels are identified that were erroneously assigned to one or more attributes in the set of segments. Second one or more correct labels for the one or more attributes are determined. The first one or more labels in the set of segments are corrected by assigning the second one or more labels to the one or more attributes.

Type: Application

Filed: April 1, 2009

Publication date: October 7, 2010

Inventors: Meghana Kshirsagar, Rajeev Rastogi, Sandeepkumar Bhuramal Satpal, Srinivasan H. Sengamedu, Venu Satuluri
AUTOMATIC EXTRACTION USING MACHINE LEARNING BASED ROBUST STRUCTURAL EXTRACTORS

Publication number: 20100223214

Abstract: A method and apparatus for automatically extracting information from a large number of documents through applying machine learning techniques and exploiting structural similarities among documents. A machine learning model is trained to have at least 50% accuracy. The trained machine learning model is used to identify information attributes in a sample of pages from a cluster of structurally similar documents. A structure-specific model of the cluster is created by compiling a list of top-K locations for each attribute identified by the trained machine learning model in the sample. These top-K lists are used to extract information from the pages of the cluster from which the sample of pages was taken.

Type: Application

Filed: February 27, 2009

Publication date: September 2, 2010

Inventors: Alok S. Kirpal, Sandeepkumar Bhuramal Satpal, Meghana Kshirsagar, Srinivasan H. Sengamedu

Techniques for categorizing web pages

TECHNIQUES FOR CATEGORIZING WEB PAGES

HIGH PRECISION WEB EXTRACTION USING SITE KNOWLEDGE

AUTOMATIC EXTRACTION USING MACHINE LEARNING BASED ROBUST STRUCTURAL EXTRACTORS