Patents by Inventor Marc Alexander Najork

Marc Alexander Najork has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 7962510
    Abstract: Evaluating content includes receiving content, analyzing the content for web spam using a content-based identification technique, and classifying the content according to the analysis. An index of analyzed contents may be created. A system for evaluating content includes a storage device configured to store data and a processor configured to analyze content for web spam using content-based identification techniques.
    Type: Grant
    Filed: February 11, 2005
    Date of Patent: June 14, 2011
    Assignee: Microsoft Corporation
    Inventors: Marc Alexander Najork, Dennis Craig Fetterly, Mark Steven Manasse, Alexandros Ntoulas
  • Patent number: 7680785
    Abstract: Different URLs that actually reference the same web page or other web resource are detected and that information is used to only download one instance of a web page or web resource from a web site. All web pages or web resources downloaded from a web server are compared to identify which are substantially identical. Once identical web pages or web resources with different URLs are found, the different URLs are then analyzed to identify what portions of the URL are essential for identifying a particular web page or web resource, and what portions are irrelevant. Once this has been done for each set of substantially identical web pages or web resources (also referred to as an “equivalence class” herein), these per-equivalence-class rules are generalized to trans-equivalence-class rules.
    Type: Grant
    Filed: March 25, 2005
    Date of Patent: March 16, 2010
    Assignee: Microsoft Corporation
    Inventor: Marc Alexander Najork
  • Patent number: 7627777
    Abstract: Fault tolerance is provided for a database of hyperlinks distributed across multiple machines, such as a scalable hyperlink store. The fault tolerance enables the distributed database to continue operating, with brief interruptions, even when some of the machines in the cluster have failed. A primary database is provided for normal operation, and a secondary database is provided for operation in the presence of failures.
    Type: Grant
    Filed: March 17, 2006
    Date of Patent: December 1, 2009
    Assignee: Microsoft Corporation
    Inventor: Marc Alexander Najork
  • Patent number: 7139747
    Abstract: The present invention provides for the efficient downloading of data set addresses from among a plurality of host computers, using a plurality of web crawlers. Each web crawler identifies URL's in data sets downloaded by that web crawler, and identifies the host computer identifier within each such URL. The host computer identifier for each URL is mapped to the web crawler identifier of one of the web crawlers. If the URL is mapped to the web crawler identifier of a different web crawler, the URL is sent to that web crawler for processing, and otherwise the URL is processed by the web crawler that identified the URL. Each web crawler sends URL's to the other web crawlers for processing, and each web crawler receives URL's from the other web crawlers for processing. In a preferred embodiment, each web crawler processes only the URL's assigned to it, which are the URL's whose host identifier is mapped to the web crawler identifier for that web crawler.
    Type: Grant
    Filed: November 3, 2000
    Date of Patent: November 21, 2006
    Assignee: Hewlett-Packard Development Company, L.P.
    Inventor: Marc Alexander Najork
  • Patent number: 6952730
    Abstract: A web crawler stores fixed length representations of document addresses in a buffer and a disk file, and optionally in a cache. When the web crawler downloads a document from a host computer, it identifies URL's (document addresses) in the downloaded document. Each identified URL is converted into a fixed size numerical representation. The numerical representation may optionally be systematically compared to the contents of a cache containing web sites which are likely to be found during the web crawl, for example previously visited web sites. The numerical representation is then systematically compared to numerical representations in the buffer, which stores numerical representations of recently-identified URL's. If the representation is not found in the buffer, it is stored in the buffer. When the buffer is full, it is ordered and then merged with numerical representations stored, in order, in the disk file.
    Type: Grant
    Filed: June 30, 2000
    Date of Patent: October 4, 2005
    Assignee: Hewlett-Packard Development Company, L.P.
    Inventors: Marc Alexander Najork, Clark Allan Heydon
  • Patent number: 6594694
    Abstract: A system generates a list of near-uniform samples of data sets (e.g., web pages) from among a plurality of host computers. The system performs a random walk so as to generate a set of visited addresses. For each address in the set, a reachability measure is computed. Then, samples are selected from the set, such that the probability of selecting a given address is inversely proportional to the reachability measure for the address. The selected samples form the list of near-uniform samples.
    Type: Grant
    Filed: May 12, 2000
    Date of Patent: July 15, 2003
    Assignee: Hewlett-Packard Development Company, LP.
    Inventors: Marc Alexander Najork, Clark Allan Heydon, Michael Mitzenmacher, Monika H. Henzinger
  • Patent number: 6377984
    Abstract: A method and system for scheduling downloads in a web crawler. A web crawler may use multiple threads to download documents from the world wide web. Both threads and queues are identified by numerical ID's. Each thread in the web crawler is assigned to dequeue from a queue until the assigned queue is empty. Each thread enqueues URL's as new URL's are discovered in the course of downloading web pages. In one embodiment, when a thread discovers a new URL, a numerical function is performed on the URL's host component to determine the queue in which to enqueue the new URL. In another embodiment, each queue in a web crawler may be dynamically assigned to a host computer so that URL's enqueued into the same queue all have the same host component. When a queue becomes empty, a new host may be dynamically assigned to it.
    Type: Grant
    Filed: November 2, 1999
    Date of Patent: April 23, 2002
    Assignee: Alta Vista Company
    Inventors: Marc Alexander Najork, Clark Allan Heydon
  • Patent number: 6351755
    Abstract: A web crawler downloads documents from among a plurality of host computers. The web crawler enqueues document addresses in a data structure called the Frontier. The Frontier generally includes a set of queues, with all document addresses sharing a respective common host component being stored in a respective common one of the queues. Multiple threads substantially concurrently process the document addresses in the queues. The web crawler includes a set of tools for storing an extensible set of data with each document address in the Frontier. These tools enable the applications to which the web crawler passes downloaded documents to store a record of information associated with each download, where each record of information includes an extensible set of name/value pairs specified by the applications. The applications also determine how many records of information to retain for each document, when to delete records of information, and so on.
    Type: Grant
    Filed: November 2, 1999
    Date of Patent: February 26, 2002
    Assignee: Alta Vista Company
    Inventors: Marc Alexander Najork, Clark Allan Heydon
  • Patent number: 6321265
    Abstract: A web crawler downloads data sets from among a plurality of host computers. The web crawler enqueues data set addresses in a set of queues, with all data set addresses sharing a respective common host address being stored in a respective common one of the queues. Each non-empty queue is assigned a next download time. Multiple threads substantially concurrently process the data set addresses in the queues. The number of queues is at least as great as the number of threads, and the threads are dynamically assigned to the queues. In particular, each thread selects a queue not being serviced by any of the other threads. The queue is selected in accordance with the next download times assigned to the queues. The data set corresponding to a data set address in the selected queue is downloaded and processed, and the data set address is dequeued from the selected queue. When the selected queue is not empty after the dequeuing step, it is assigned an updated download time.
    Type: Grant
    Filed: November 2, 1999
    Date of Patent: November 20, 2001
    Assignee: AltaVista Company
    Inventors: Marc Alexander Najork, Clark Allan Heydon
  • Patent number: 6301614
    Abstract: A web crawler stores fixed length representations of document addresses in first and second caches and a disk file. When the web crawler downloads a document from a host computer, it identifies URL's (document addresses) in the downloaded document. Each identified URL is converted into a fixed size numerical representation. The numerical representation is systematically compared to numerical representations in the caches and disk file. If the representation is not found in the caches and disk file, the document corresponding to the representation is scheduled for downloading, and the representation is stored in the second cache. If the representation is not found in the caches but is found in the disk file, the representation is added to the first cache. When the second cache is full, it is merged with the disk file and the second cache is reset to an initial state. When the first cache is full, one or more representations are evicted in accordance with an eviction policy.
    Type: Grant
    Filed: November 2, 1999
    Date of Patent: October 9, 2001
    Assignee: Alta Vista Company
    Inventors: Marc Alexander Najork, Clark Allan Heydon
  • Patent number: 6263364
    Abstract: A web crawler downloads documents from among a plurality of host computers. The web crawler enqueues document addresses in a data structure called the Frontier. The Frontier generally includes a set of queues, with all document addresses sharing a respective common host component being stored in a respective common one of the queues. Multiple threads substantially concurrently process the document addresses in the queues. The Frontier includes a set of parallel “priority queues,” each associated with a distinct priority level. Queue elements for documents to be downloaded are assigned a priority level, and then stored in the corresponding priority queue. Queue elements are then distributed from the priority queues to a set of underlying queues in accordance with their relative priorities. The threads then process the queue elements in the underlying queues.
    Type: Grant
    Filed: November 2, 1999
    Date of Patent: July 17, 2001
    Assignee: Alta Vista Company
    Inventors: Marc Alexander Najork, Clark Allan Heydon, Janet Lynn Wiener