Patents by Inventor Piyoosh Jalan
Piyoosh Jalan has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 8489578Abstract: A method, system, and article are provided for management of a data ingester and associated content collected by the data ingester. The computer system is configured with a taxonomy together with rules and policies for ingesting and classifying the collected data. Based upon the classification of the collected data with respect to the taxonomy, the data is assigned to a location in the taxonomy.Type: GrantFiled: October 20, 2008Date of Patent: July 16, 2013Assignee: International Business Machines CorporationInventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
-
Patent number: 8041705Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.Type: GrantFiled: January 5, 2009Date of Patent: October 18, 2011Assignee: International Business Machines CorporationInventors: Srinivasan Balasubramanian, Michael Ching, Piyoosh Jalan, Satish C. Penmetsa, Andrew S. Tomkins
-
Patent number: 8024774Abstract: The present invention relates to a method for configuring a policy management protocol for a web crawler, the method further comprising the steps of determining a web space that is to be crawled by a web crawler, wherein the web space is comprised of an IP address and/or a range of IP addresses, and determining additional hostnames that are associated with the IP address and/range of IP addresses. The method further comprises the steps of configuring the web crawler to crawl the IP address and/range of IP addresses, and determine additional hostnames that are associated with the IP address or range of IP addresses, and performing a web crawling function upon the determined additional hostnames by the web crawler.Type: GrantFiled: May 30, 2008Date of Patent: September 20, 2011Assignee: International Business Machines CorporationInventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
-
Publication number: 20100114895Abstract: A method, system, and article are provided for management of a data ingester and associated content collected by the data ingester. The computer system is configured with a taxonomy together with rules and policies for ingesting and classifying the collected data. Based upon the classification of the collected data with respect to the taxonomy, the data is assigned to a location in the taxonomy.Type: ApplicationFiled: October 20, 2008Publication date: May 6, 2010Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
-
Patent number: 7701944Abstract: The present invention relates to a method for configuring a policy management protocol for a web crawler, the method further comprising the steps of determining a web space that is to be crawled by a web crawler, wherein the web space is comprised of an IP address and/or a range of IP addresses, and determining additional hostnames that are associated with the IP address and/ range of IP addresses. The method further comprises the steps of configuring the web crawler to crawl the IP address and/ range of IP addresses, and determine additional hostnames that are associated with the IP address or range of IP addresses, and performing a web crawling function upon the determined additional hostnames by the web crawler.Type: GrantFiled: January 19, 2007Date of Patent: April 20, 2010Assignee: International Business Machines CorporationInventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
-
Publication number: 20090119291Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.Type: ApplicationFiled: January 5, 2009Publication date: May 7, 2009Applicant: International Business Machines CorporationInventors: Srinivasan Balasubramanian, Michael Ching, Piyoosh Jalan, Satish C. Penmetsa, Andrew S. Tomkins
-
Patent number: 7496557Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.Type: GrantFiled: September 30, 2005Date of Patent: February 24, 2009Assignee: International Business Machines CorporationInventors: Srinivasan Balasubramanian, Michael Ching, Piyoosh Jalan, Satish C. Penmetsa, Andrew S. Tomkins
-
Publication number: 20080295148Abstract: The present invention relates to a method for configuring a policy management protocol for a web crawler, the method further comprising the steps of determining a web space that is to be crawled by a web crawler, wherein the web space is comprised of an IP address and/or a range of IP addresses, and determining additional hostnames that are associated with the IP address and/range of IP addresses. The method further comprises the steps of configuring the web crawler to crawl the IP address and/range of IP addresses, and determine additional hostnames that are associated with the IP address or range of IP addresses, and performing a web crawling function upon the determined additional hostnames by the web crawler.Type: ApplicationFiled: May 30, 2008Publication date: November 27, 2008Applicant: International Business Machines CorporationInventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
-
Publication number: 20080235163Abstract: As part of the normal crawling process, a crawler parses a page and computes a de-tagged hash, called a fingerprint, of the page content. A lookup structure consisting of the host hash (hash of the host portion of the URL) and the fingerprint of the page is maintained. Before the crawler writes a page to a store, this lookup structure is consulted. If the lookup structure already contains the tuple (i.e., host hash and fingerprint), then the page is not written to the store. Thus, a lot of duplicates are eliminated at the crawler itself, saving CPU and disk cycles which would otherwise be needed during current duplicate elimination processes.Type: ApplicationFiled: March 22, 2007Publication date: September 25, 2008Inventors: Srinivasan Balasubramanian, Rajesh M. Desai, Piyoosh Jalan
-
Publication number: 20080175243Abstract: The present invention relates to a method for configuring a policy management protocol for a web crawler, the method further comprising the steps of determining a web space that is to be crawled by a web crawler, wherein the web space is comprised of an IP address and/or a range of IP addresses, and determining additional hostnames that are associated with the IP address and/ range of IP addresses. The method further comprises the steps of configuring the web crawler to crawl the IP address and/ range of IP addresses, and determine additional hostnames that are associated with the IP address or range of IP addresses, and performing a web crawling function upon the determined additional hostnames by the web crawler.Type: ApplicationFiled: January 19, 2007Publication date: July 24, 2008Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
-
Publication number: 20080162448Abstract: A method of classifying URLs by analyzing each URL discovered by a crawler and matching against a set of words corresponding to each class such as pornography, archive, obituary, business news, archive, politics, terrorism, etc. A count of the prefix of the URL to the class is updated and an action is performed with respect to electronic documents on the computer system based on the count. The action performed could be blocking the computer system from the crawling, or adjusting the frequency with which the computer system should be crawled.Type: ApplicationFiled: December 28, 2006Publication date: July 3, 2008Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATIONInventor: Piyoosh Jalan
-
Publication number: 20070078811Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.Type: ApplicationFiled: September 30, 2005Publication date: April 5, 2007Applicant: International Business Machines CorporationInventors: Srinivasan Balasubramanian, Michael Ching, Piyoosh Jalan, Satish Penmetsa, Andrew Tomkins