Patents by Inventor Piyoosh Jalan

Piyoosh Jalan has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

System and method for administering data ingesters using taxonomy based filtering rules

Patent number: 8489578

Abstract: A method, system, and article are provided for management of a data ingester and associated content collected by the data ingester. The computer system is configured with a taxonomy together with rules and policies for ingesting and classifying the collected data. Based upon the classification of the collected data with respect to the taxonomy, the data is assigned to a location in the taxonomy.

Type: Grant

Filed: October 20, 2008

Date of Patent: July 16, 2013

Assignee: International Business Machines Corporation

Inventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
Microhubs and its applications

Patent number: 8041705

Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.

Type: Grant

Filed: January 5, 2009

Date of Patent: October 18, 2011

Assignee: International Business Machines Corporation

Inventors: Srinivasan Balasubramanian, Michael Ching, Piyoosh Jalan, Satish C. Penmetsa, Andrew S. Tomkins
System and method for crawl policy management utilizing IP address and IP address range

Patent number: 8024774

Abstract: The present invention relates to a method for configuring a policy management protocol for a web crawler, the method further comprising the steps of determining a web space that is to be crawled by a web crawler, wherein the web space is comprised of an IP address and/or a range of IP addresses, and determining additional hostnames that are associated with the IP address and/range of IP addresses. The method further comprises the steps of configuring the web crawler to crawl the IP address and/range of IP addresses, and determine additional hostnames that are associated with the IP address or range of IP addresses, and performing a web crawling function upon the determined additional hostnames by the web crawler.

Type: Grant

Filed: May 30, 2008

Date of Patent: September 20, 2011

Assignee: International Business Machines Corporation

Inventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
System and Method for Administering Data Ingesters Using Taxonomy Based Filtering Rules

Publication number: 20100114895

Abstract: A method, system, and article are provided for management of a data ingester and associated content collected by the data ingester. The computer system is configured with a taxonomy together with rules and policies for ingesting and classifying the collected data. Based upon the classification of the collected data with respect to the taxonomy, the data is assigned to a location in the taxonomy.

Type: Application

Filed: October 20, 2008

Publication date: May 6, 2010

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
System and method for crawl policy management utilizing IP address and IP address range

Patent number: 7701944

Abstract: The present invention relates to a method for configuring a policy management protocol for a web crawler, the method further comprising the steps of determining a web space that is to be crawled by a web crawler, wherein the web space is comprised of an IP address and/or a range of IP addresses, and determining additional hostnames that are associated with the IP address and/ range of IP addresses. The method further comprises the steps of configuring the web crawler to crawl the IP address and/ range of IP addresses, and determine additional hostnames that are associated with the IP address or range of IP addresses, and performing a web crawling function upon the determined additional hostnames by the web crawler.

Type: Grant

Filed: January 19, 2007

Date of Patent: April 20, 2010

Assignee: International Business Machines Corporation

Inventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
MICROHUBS AND ITS APPLICATIONS

Publication number: 20090119291

Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.

Type: Application

Filed: January 5, 2009

Publication date: May 7, 2009

Applicant: International Business Machines Corporation

Inventors: Srinivasan Balasubramanian, Michael Ching, Piyoosh Jalan, Satish C. Penmetsa, Andrew S. Tomkins
Microhubs and its applications

Patent number: 7496557

Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.

Type: Grant

Filed: September 30, 2005

Date of Patent: February 24, 2009

Assignee: International Business Machines Corporation

Inventors: Srinivasan Balasubramanian, Michael Ching, Piyoosh Jalan, Satish C. Penmetsa, Andrew S. Tomkins
System And Method For Crawl Policy Management Utilizing IP Address and IP Address Range

Publication number: 20080295148

Abstract: The present invention relates to a method for configuring a policy management protocol for a web crawler, the method further comprising the steps of determining a web space that is to be crawled by a web crawler, wherein the web space is comprised of an IP address and/or a range of IP addresses, and determining additional hostnames that are associated with the IP address and/range of IP addresses. The method further comprises the steps of configuring the web crawler to crawl the IP address and/range of IP addresses, and determine additional hostnames that are associated with the IP address or range of IP addresses, and performing a web crawling function upon the determined additional hostnames by the web crawler.

Type: Application

Filed: May 30, 2008

Publication date: November 27, 2008

Applicant: International Business Machines Corporation

Inventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
SYSTEM AND METHOD FOR ONLINE DUPLICATE DETECTION AND ELIMINATION IN A WEB CRAWLER

Publication number: 20080235163

Abstract: As part of the normal crawling process, a crawler parses a page and computes a de-tagged hash, called a fingerprint, of the page content. A lookup structure consisting of the host hash (hash of the host portion of the URL) and the fingerprint of the page is maintained. Before the crawler writes a page to a store, this lookup structure is consulted. If the lookup structure already contains the tuple (i.e., host hash and fingerprint), then the page is not written to the store. Thus, a lot of duplicates are eliminated at the crawler itself, saving CPU and disk cycles which would otherwise be needed during current duplicate elimination processes.

Type: Application

Filed: March 22, 2007

Publication date: September 25, 2008

Inventors: Srinivasan Balasubramanian, Rajesh M. Desai, Piyoosh Jalan
SYSTEM AND METHOD FOR CRAWL POLICY MANAGEMENT UTILIZING IP ADDRESS AND IP ADDRESS RANGE

Publication number: 20080175243

Abstract: The present invention relates to a method for configuring a policy management protocol for a web crawler, the method further comprising the steps of determining a web space that is to be crawled by a web crawler, wherein the web space is comprised of an IP address and/or a range of IP addresses, and determining additional hostnames that are associated with the IP address and/ range of IP addresses. The method further comprises the steps of configuring the web crawler to crawl the IP address and/ range of IP addresses, and determine additional hostnames that are associated with the IP address or range of IP addresses, and performing a web crawling function upon the determined additional hostnames by the web crawler.

Type: Application

Filed: January 19, 2007

Publication date: July 24, 2008

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
METHOD FOR TRACKING SYNTACTIC PROPERTIES OF A URL

Publication number: 20080162448

Abstract: A method of classifying URLs by analyzing each URL discovered by a crawler and matching against a set of words corresponding to each class such as pornography, archive, obituary, business news, archive, politics, terrorism, etc. A count of the prefix of the URL to the class is updated and an action is performed with respect to electronic documents on the computer system based on the count. The action performed could be blocking the computer system from the crawling, or adjusting the frequency with which the computer system should be crawled.

Type: Application

Filed: December 28, 2006

Publication date: July 3, 2008

Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventor: Piyoosh Jalan
Microhubs and its applications

Publication number: 20070078811

Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.

Type: Application

Filed: September 30, 2005

Publication date: April 5, 2007

Applicant: International Business Machines Corporation

Inventors: Srinivasan Balasubramanian, Michael Ching, Piyoosh Jalan, Satish Penmetsa, Andrew Tomkins