Patents by Inventor Marc Alexander Najork

Marc Alexander Najork has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Using content analysis to detect spam web pages

Patent number: 7962510

Abstract: Evaluating content includes receiving content, analyzing the content for web spam using a content-based identification technique, and classifying the content according to the analysis. An index of analyzed contents may be created. A system for evaluating content includes a storage device configured to store data and a processor configured to analyze content for web spam using content-based identification techniques.

Type: Grant

Filed: February 11, 2005

Date of Patent: June 14, 2011

Assignee: Microsoft Corporation

Inventors: Marc Alexander Najork, Dennis Craig Fetterly, Mark Steven Manasse, Alexandros Ntoulas
Systems and methods for inferring uniform resource locator (URL) normalization rules

Patent number: 7680785

Abstract: Different URLs that actually reference the same web page or other web resource are detected and that information is used to only download one instance of a web page or web resource from a web site. All web pages or web resources downloaded from a web server are compared to identify which are substantially identical. Once identical web pages or web resources with different URLs are found, the different URLs are then analyzed to identify what portions of the URL are essential for identifying a particular web page or web resource, and what portions are irrelevant. Once this has been done for each set of substantially identical web pages or web resources (also referred to as an “equivalence class” herein), these per-equivalence-class rules are generalized to trans-equivalence-class rules.

Type: Grant

Filed: March 25, 2005

Date of Patent: March 16, 2010

Assignee: Microsoft Corporation

Inventor: Marc Alexander Najork
Fault tolerance scheme for distributed hyperlink database

Patent number: 7627777

Abstract: Fault tolerance is provided for a database of hyperlinks distributed across multiple machines, such as a scalable hyperlink store. The fault tolerance enables the distributed database to continue operating, with brief interruptions, even when some of the machines in the cluster have failed. A primary database is provided for normal operation, and a secondary database is provided for operation in the presence of failures.

Type: Grant

Filed: March 17, 2006

Date of Patent: December 1, 2009

Assignee: Microsoft Corporation

Inventor: Marc Alexander Najork
System and method for distributed web crawling

Patent number: 7139747

Abstract: The present invention provides for the efficient downloading of data set addresses from among a plurality of host computers, using a plurality of web crawlers. Each web crawler identifies URL's in data sets downloaded by that web crawler, and identifies the host computer identifier within each such URL. The host computer identifier for each URL is mapped to the web crawler identifier of one of the web crawlers. If the URL is mapped to the web crawler identifier of a different web crawler, the URL is sent to that web crawler for processing, and otherwise the URL is processed by the web crawler that identified the URL. Each web crawler sends URL's to the other web crawlers for processing, and each web crawler receives URL's from the other web crawlers for processing. In a preferred embodiment, each web crawler processes only the URL's assigned to it, which are the URL's whose host identifier is mapped to the web crawler identifier for that web crawler.

Type: Grant

Filed: November 3, 2000

Date of Patent: November 21, 2006

Assignee: Hewlett-Packard Development Company, L.P.

Inventor: Marc Alexander Najork
System and method for efficient filtering of data set addresses in a web crawler

Patent number: 6952730

Abstract: A web crawler stores fixed length representations of document addresses in a buffer and a disk file, and optionally in a cache. When the web crawler downloads a document from a host computer, it identifies URL's (document addresses) in the downloaded document. Each identified URL is converted into a fixed size numerical representation. The numerical representation may optionally be systematically compared to the contents of a cache containing web sites which are likely to be found during the web crawl, for example previously visited web sites. The numerical representation is then systematically compared to numerical representations in the buffer, which stores numerical representations of recently-identified URL's. If the representation is not found in the buffer, it is stored in the buffer. When the buffer is full, it is ordered and then merged with numerical representations stored, in order, in the disk file.

Type: Grant

Filed: June 30, 2000

Date of Patent: October 4, 2005

Assignee: Hewlett-Packard Development Company, L.P.

Inventors: Marc Alexander Najork, Clark Allan Heydon
System and method for near-uniform sampling of web page addresses

Patent number: 6594694

Abstract: A system generates a list of near-uniform samples of data sets (e.g., web pages) from among a plurality of host computers. The system performs a random walk so as to generate a set of visited addresses. For each address in the set, a reachability measure is computed. Then, samples are selected from the set, such that the probability of selecting a given address is inversely proportional to the reachability measure for the address. The selected samples form the list of near-uniform samples.

Type: Grant

Filed: May 12, 2000

Date of Patent: July 15, 2003

Assignee: Hewlett-Packard Development Company, LP.

Inventors: Marc Alexander Najork, Clark Allan Heydon, Michael Mitzenmacher, Monika H. Henzinger
Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue

Patent number: 6377984

Abstract: A method and system for scheduling downloads in a web crawler. A web crawler may use multiple threads to download documents from the world wide web. Both threads and queues are identified by numerical ID's. Each thread in the web crawler is assigned to dequeue from a queue until the assigned queue is empty. Each thread enqueues URL's as new URL's are discovered in the course of downloading web pages. In one embodiment, when a thread discovers a new URL, a numerical function is performed on the URL's host component to determine the queue in which to enqueue the new URL. In another embodiment, each queue in a web crawler may be dynamically assigned to a host computer so that URL's enqueued into the same queue all have the same host component. When a queue becomes empty, a new host may be dynamically assigned to it.

Type: Grant

Filed: November 2, 1999

Date of Patent: April 23, 2002

Assignee: Alta Vista Company

Inventors: Marc Alexander Najork, Clark Allan Heydon
System and method for associating an extensible set of data with documents downloaded by a web crawler

Patent number: 6351755

Abstract: A web crawler downloads documents from among a plurality of host computers. The web crawler enqueues document addresses in a data structure called the Frontier. The Frontier generally includes a set of queues, with all document addresses sharing a respective common host component being stored in a respective common one of the queues. Multiple threads substantially concurrently process the document addresses in the queues. The web crawler includes a set of tools for storing an extensible set of data with each document address in the Frontier. These tools enable the applications to which the web crawler passes downloaded documents to store a record of information associated with each download, where each record of information includes an extensible set of name/value pairs specified by the applications. The applications also determine how many records of information to retain for each document, when to delete records of information, and so on.

Type: Grant

Filed: November 2, 1999

Date of Patent: February 26, 2002

Assignee: Alta Vista Company

Inventors: Marc Alexander Najork, Clark Allan Heydon
System and method for enforcing politeness while scheduling downloads in a web crawler

Patent number: 6321265

Abstract: A web crawler downloads data sets from among a plurality of host computers. The web crawler enqueues data set addresses in a set of queues, with all data set addresses sharing a respective common host address being stored in a respective common one of the queues. Each non-empty queue is assigned a next download time. Multiple threads substantially concurrently process the data set addresses in the queues. The number of queues is at least as great as the number of threads, and the threads are dynamically assigned to the queues. In particular, each thread selects a queue not being serviced by any of the other threads. The queue is selected in accordance with the next download times assigned to the queues. The data set corresponding to a data set address in the selected queue is downloaded and processed, and the data set address is dequeued from the selected queue. When the selected queue is not empty after the dequeuing step, it is assigned an updated download time.

Type: Grant

Filed: November 2, 1999

Date of Patent: November 20, 2001

Assignee: AltaVista Company

Inventors: Marc Alexander Najork, Clark Allan Heydon
System and method for efficient representation of data set addresses in a web crawler

Patent number: 6301614

Abstract: A web crawler stores fixed length representations of document addresses in first and second caches and a disk file. When the web crawler downloads a document from a host computer, it identifies URL's (document addresses) in the downloaded document. Each identified URL is converted into a fixed size numerical representation. The numerical representation is systematically compared to numerical representations in the caches and disk file. If the representation is not found in the caches and disk file, the document corresponding to the representation is scheduled for downloading, and the representation is stored in the second cache. If the representation is not found in the caches but is found in the disk file, the representation is added to the first cache. When the second cache is full, it is merged with the disk file and the second cache is reset to an initial state. When the first cache is full, one or more representations are evicted in accordance with an eviction policy.

Type: Grant

Filed: November 2, 1999

Date of Patent: October 9, 2001

Assignee: Alta Vista Company

Inventors: Marc Alexander Najork, Clark Allan Heydon
Web crawler system using plurality of parallel priority level queues having distinct associated download priority levels for prioritizing document downloading and maintaining document freshness

Patent number: 6263364

Abstract: A web crawler downloads documents from among a plurality of host computers. The web crawler enqueues document addresses in a data structure called the Frontier. The Frontier generally includes a set of queues, with all document addresses sharing a respective common host component being stored in a respective common one of the queues. Multiple threads substantially concurrently process the document addresses in the queues. The Frontier includes a set of parallel “priority queues,” each associated with a distinct priority level. Queue elements for documents to be downloaded are assigned a priority level, and then stored in the corresponding priority queue. Queue elements are then distributed from the priority queues to a set of underlying queues in accordance with their relative priorities. The threads then process the queue elements in the underlying queues.

Type: Grant

Filed: November 2, 1999

Date of Patent: July 17, 2001

Assignee: Alta Vista Company

Inventors: Marc Alexander Najork, Clark Allan Heydon, Janet Lynn Wiener

prev 1 2