Patents by Inventor Piyoosh Jalan

Piyoosh Jalan has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 8489578
    Abstract: A method, system, and article are provided for management of a data ingester and associated content collected by the data ingester. The computer system is configured with a taxonomy together with rules and policies for ingesting and classifying the collected data. Based upon the classification of the collected data with respect to the taxonomy, the data is assigned to a location in the taxonomy.
    Type: Grant
    Filed: October 20, 2008
    Date of Patent: July 16, 2013
    Assignee: International Business Machines Corporation
    Inventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
  • Patent number: 8041705
    Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.
    Type: Grant
    Filed: January 5, 2009
    Date of Patent: October 18, 2011
    Assignee: International Business Machines Corporation
    Inventors: Srinivasan Balasubramanian, Michael Ching, Piyoosh Jalan, Satish C. Penmetsa, Andrew S. Tomkins
  • Patent number: 8024774
    Abstract: The present invention relates to a method for configuring a policy management protocol for a web crawler, the method further comprising the steps of determining a web space that is to be crawled by a web crawler, wherein the web space is comprised of an IP address and/or a range of IP addresses, and determining additional hostnames that are associated with the IP address and/range of IP addresses. The method further comprises the steps of configuring the web crawler to crawl the IP address and/range of IP addresses, and determine additional hostnames that are associated with the IP address or range of IP addresses, and performing a web crawling function upon the determined additional hostnames by the web crawler.
    Type: Grant
    Filed: May 30, 2008
    Date of Patent: September 20, 2011
    Assignee: International Business Machines Corporation
    Inventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
  • Publication number: 20100114895
    Abstract: A method, system, and article are provided for management of a data ingester and associated content collected by the data ingester. The computer system is configured with a taxonomy together with rules and policies for ingesting and classifying the collected data. Based upon the classification of the collected data with respect to the taxonomy, the data is assigned to a location in the taxonomy.
    Type: Application
    Filed: October 20, 2008
    Publication date: May 6, 2010
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
  • Patent number: 7701944
    Abstract: The present invention relates to a method for configuring a policy management protocol for a web crawler, the method further comprising the steps of determining a web space that is to be crawled by a web crawler, wherein the web space is comprised of an IP address and/or a range of IP addresses, and determining additional hostnames that are associated with the IP address and/ range of IP addresses. The method further comprises the steps of configuring the web crawler to crawl the IP address and/ range of IP addresses, and determine additional hostnames that are associated with the IP address or range of IP addresses, and performing a web crawling function upon the determined additional hostnames by the web crawler.
    Type: Grant
    Filed: January 19, 2007
    Date of Patent: April 20, 2010
    Assignee: International Business Machines Corporation
    Inventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
  • Publication number: 20090119291
    Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.
    Type: Application
    Filed: January 5, 2009
    Publication date: May 7, 2009
    Applicant: International Business Machines Corporation
    Inventors: Srinivasan Balasubramanian, Michael Ching, Piyoosh Jalan, Satish C. Penmetsa, Andrew S. Tomkins
  • Patent number: 7496557
    Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.
    Type: Grant
    Filed: September 30, 2005
    Date of Patent: February 24, 2009
    Assignee: International Business Machines Corporation
    Inventors: Srinivasan Balasubramanian, Michael Ching, Piyoosh Jalan, Satish C. Penmetsa, Andrew S. Tomkins
  • Publication number: 20080295148
    Abstract: The present invention relates to a method for configuring a policy management protocol for a web crawler, the method further comprising the steps of determining a web space that is to be crawled by a web crawler, wherein the web space is comprised of an IP address and/or a range of IP addresses, and determining additional hostnames that are associated with the IP address and/range of IP addresses. The method further comprises the steps of configuring the web crawler to crawl the IP address and/range of IP addresses, and determine additional hostnames that are associated with the IP address or range of IP addresses, and performing a web crawling function upon the determined additional hostnames by the web crawler.
    Type: Application
    Filed: May 30, 2008
    Publication date: November 27, 2008
    Applicant: International Business Machines Corporation
    Inventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
  • Publication number: 20080235163
    Abstract: As part of the normal crawling process, a crawler parses a page and computes a de-tagged hash, called a fingerprint, of the page content. A lookup structure consisting of the host hash (hash of the host portion of the URL) and the fingerprint of the page is maintained. Before the crawler writes a page to a store, this lookup structure is consulted. If the lookup structure already contains the tuple (i.e., host hash and fingerprint), then the page is not written to the store. Thus, a lot of duplicates are eliminated at the crawler itself, saving CPU and disk cycles which would otherwise be needed during current duplicate elimination processes.
    Type: Application
    Filed: March 22, 2007
    Publication date: September 25, 2008
    Inventors: Srinivasan Balasubramanian, Rajesh M. Desai, Piyoosh Jalan
  • Publication number: 20080175243
    Abstract: The present invention relates to a method for configuring a policy management protocol for a web crawler, the method further comprising the steps of determining a web space that is to be crawled by a web crawler, wherein the web space is comprised of an IP address and/or a range of IP addresses, and determining additional hostnames that are associated with the IP address and/ range of IP addresses. The method further comprises the steps of configuring the web crawler to crawl the IP address and/ range of IP addresses, and determine additional hostnames that are associated with the IP address or range of IP addresses, and performing a web crawling function upon the determined additional hostnames by the web crawler.
    Type: Application
    Filed: January 19, 2007
    Publication date: July 24, 2008
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Varun Bhagwan, Rajesh M. Desai, Piyoosh Jalan
  • Publication number: 20080162448
    Abstract: A method of classifying URLs by analyzing each URL discovered by a crawler and matching against a set of words corresponding to each class such as pornography, archive, obituary, business news, archive, politics, terrorism, etc. A count of the prefix of the URL to the class is updated and an action is performed with respect to electronic documents on the computer system based on the count. The action performed could be blocking the computer system from the crawling, or adjusting the frequency with which the computer system should be crawled.
    Type: Application
    Filed: December 28, 2006
    Publication date: July 3, 2008
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventor: Piyoosh Jalan
  • Publication number: 20070078811
    Abstract: A system and method of crawling at least one website comprising at least one URL includes maintaining a lookup structure comprising all of the URLs known to be on a website; calculating a hub score for each webpage of the website to be recrawled, wherein the hub score measures how likely the to be recrawled webpage includes links to fresh content published on the website; sorting all the to be recrawled pages by their hub scores; and crawling the to be recrawled pages in order from highest hub scores to lowest hub scores. The calculating comprises computing a first value equaling a percentage of a number of new relative URLs on the to be recrawled page; computing a second value equaling a percentage of a previous hub score of the to be recrawled page; and computing the hub score as a sum of the first and the second values.
    Type: Application
    Filed: September 30, 2005
    Publication date: April 5, 2007
    Applicant: International Business Machines Corporation
    Inventors: Srinivasan Balasubramanian, Michael Ching, Piyoosh Jalan, Satish Penmetsa, Andrew Tomkins