SYSTEM, METHOD AND COMPUTER READABLE MEDIUM FOR WEB CRAWLING
In a web crawler, a URL selection module selects URLs for pages to be downloaded. The URL selection module accesses an interaction data store that stores interaction data for web pages, including interaction data that indicates human interactions with the pages. To reduce the effects of link farms, the URL selection module filters the URLs to select only those URLs that have human interaction histories and provides the selected URLs to a download module for web page downloading.
Latest Patents:
This invention relates to a system, method and computer medium for crawling the web to find relevant internet content.
BACKGROUND OF THE INVENTIONIn internet technology, web crawlers are used to find new web pages by collecting and following URLs (Uniform Resource Locators). By following an URL and downloading the corresponding web page the links within that web page can be added to the web crawler's URL collection. The web pages are stored for indexing and ranking by internet search engines. Internet search engines use web page ranking algorithms that relate the links within a web page to the relevance of the web page.
The use of link popularity algorithms to rank web pages has lead to the problem of “link farms”. In order to manipulate a web page's ranking, a large sub-web of interlinked web pages is created and linked to a web page so that the page receives a high search engine ranking. In addition to distortion of web page rankings, a problem with link farms is that a web crawler spends a lot of resources following links and collecting web pages for eventual indexing into a search engine, even though many of these pages are created only for page ranking and are not otherwise used by, nor useful for humans.
What is required is a system, method and computer readable medium that provides enhanced web crawling.
SUMMARY OF THE INVENTIONIn one aspect of the disclosure, there is provided a method for web crawling comprising determining a plurality of Uniform Resource Locators (URL)s, determining a subset of the plurality of URLs that have associated interaction data, selecting at least one URL of the subset, and downloading a web page corresponding to the at least one selected URL.
In one aspect of the disclosure, there is provided a web crawler comprising at least one Uniform Resource Locator (URL) data store that stores a plurality of URLs, at least one interaction data store that stores interaction data for a plurality of web pages, at least one download module that downloads web page content corresponding to a URL, and at least one URL selection module in communication with the at least one URL data store and the at least one interaction data store. The interaction data indicates an interaction between a human and a web page corresponding to a URL. The at least one URL selection module selects at least one URL from the at least one URL data store that has interaction data in the at least one interaction data store. The at least one URL selection module provides the at least one selected URL to the at least one download module.
In one aspect of the disclosure, there is provided a computer-readable medium comprising computer-executable instructions for execution by a processor, that, when executed, cause the processor to select a Uniform Resource Locator (URL) from a URL data store, look up the selected URL in an interaction data store to determine if interaction data exists for the selected URL in the interaction data store, and if interaction data exists for the selected URL, provide the selected URL to a download module.
Reference will now be made, by way of example only, to specific embodiments and to the accompanying drawings in which:
A system 10 for providing web crawling in accordance with an embodiment of the disclosure is illustrated in
A web crawling method using the system 10 of
The download module 14 downloads web pages 22 from the internet 20 and extracts linked URLs 13 from the download pages. The operation of the download module in accordance with an embodiment of the disclosure is illustrated in the flowchart 200 of
The operation of the URL selection module 18 in accordance with an embodiment of the disclosure is shown in the flowchart 300 of
The interaction data in the interaction data store 19 may be derived from interactions between users and the web page at client browsers, for example as described in any of the Applicant's co-pending patent applications Attorney Docket Nos. HAUSER001, HAUSER002, HAUSER006, HAUSER007, HAUSER007B, HAUSER008, HAUSER009, HAUSER010, the entire contents of each of which are explicitly incorporated herein by reference. In particular, event recorders provided within the web pages may record event data during these interactions and provide event streams to an event server. An example of an event data processing system is illustrated in
The web server 114 may be modified such that the web page content provided to the client 118 includes an event observer module 126 which may be provided as appropriate code or scripts that run in the background of the client's browser 115. In one embodiment, code for providing the event observer module 126 is provided to the web server 114 by a third party service, such as provided from an event server 112, described in greater detail below.
The event observer module 126 observes events generated in a user interaction with the web page 111 at the client 118. The event observer module 126 records events generated within the web browser 115, such as mouse clicks, mouse moves, text entries etc., and generates event streams 121 including an event header message 122 and one or more event stream messages 123. It will be apparent to a person skilled in the art that terms used to describe mouse movements are to be considered broadly and to encompass all such cursor manipulation devices and will include a plug-in mouse, on board mouse, touch pad, pixel pen, eye-tracker, etc.
The event observer module 126 provides the event streams 121 to the event server 112. An example of an event header message 30 is illustrated in
During an interaction with the web page 111, a user navigates the web page 111 and may enter content where appropriate, such as in the HTML form elements. During this interaction events are generated and recorded by the event observer module 126. Periodically, the event observer module 126 formulates an event stream message 123 preceded by an event header message 122 if one has not yet been sent. The event observer module 126 passes the event stream messages 123 to an event module 125 of the event server 112. In the embodiment illustrated in
The event server 112 processes the event stream 121 in the event module 125 or an equivalent component, to analyze the event stream data. Analyzed data may be stored with the raw event stream messages in a content data store 128. Additional modules of the event server may include an attention analysis module 139 as described in the Applicant's co-pending application HAUSER008 reference above, and a content interest processing module 138 as described in the Applicant's co-pending application HAUSER009 referenced above. In one embodiment, the event stream data can be analyzed to determine the probability that the interaction that created the event stream at the client is a human dependent interaction, for example as described in the Applicant's co-pending patent application Attorney Docket No. HAUSER001 referenced above. In the present embodiment, the existence of any human interaction within the content areas of the web page, such as hints, lingers or clicks within the content areas, may be used to indicate the validity of a URL, and such statistics may be loaded into the interaction data store 19. In one embodiment, the web crawler 12 may include the event server 112 such that the web crawler is self contained. In an alternative embodiment, human interaction data may be provided to the interaction data store as a third party service by an event server operator. Alternatively, the event server 112 may maintain its own interaction data store and provide access to the interaction data store as a service.
The interaction data store 19 may store raw event streams with processing of the event streams being performed by the URL selection module 18, for example to rank the URLs according. Alternatively, the interaction data store may have an associated processing module (not shown) that pre-processes the interaction data so that the interaction data store stores the URLs in a ranked form. For example, a processing module may process the event streams to determine an event generator type (e.g. human, non-human, computer assisted human, etc) as described in the Applicant's co-pending patent application HAUSER001 and HAUSER006 referenced above. Once an interaction with a webpage has been classified as a human interaction, the data may be further processed to rank the particular behavior of the interactions. For example, the event streams may be processed to select those events streams containing out-click events, i.e. events that a user produces to exit a web page. The event streams and/or the page content may also be analyzed to determine additional preferred behavior, such as a breadth-first traversal of the web site, backlink count, partial page-rank calculations, page-rank calculations using a link graph with URLs only if those URLs have sufficient human interaction, etc. In one embodiment, the interaction pattern for parked pages, link farms, auto generated “spam” pages (that use random snippets from a variety of authentic web pages just to get high search engine ranking based on the keywords in the snippets) may be identified and used to remove these URLs from the crawl graph (not pursue the links) and/or remove such URLs from page-rank calculations.
A summary of the event statistics including any data used to rank the web pages may be stored in the interaction data store 19.
An alternative embodiment is illustrated in
An operation of the download module 214 is illustrated in the flowchart 400 of
In a further embodiment, the modified web crawler 212 of
An alternative URL selection policy may specify that URLs (or human out-click URLs) will only be followed if there is some form of human area of interest within the page where the URL was found, e.g. a content element with a high enough content interest score. A further alternative URL selection policy may specify that URLs (or human out-click URLs) will only be followed if they are found within a content element with high enough content interest.
The URL selection policies followed by the URL selection module focus the web crawlers resources towards those web pages that are actively used by humans and thus generate particular attention events. Using the selection policies may significantly increase the efficiency of the web crawler and assist in providing higher quality page ranking statistics. Furthermore, as described above, common human browsing patterns, can be recognized via attention analysis for link farm pages, parked pages where the most interesting content is advertisements, and auto-generated “spam” pages. Human outclicks on pages that have no content of interest other than ads can be ignored by the URL selection module.
The embodiments described herein provide an enhanced system and method for web crawling that avoids spending resources collecting web pages that are not useful to humans. The effect of these embodiments is to reduce or eliminate the advantages of a link farm and to remove search engine spam. At current internet growth rates, the requirement to crawl less of the internet can provide large resource savings as well as making page ranking of web pages more efficient and useful for humans. By focusing crawling to the web pages relevant to and used by humans, the ability of artificially manipulate search engine rankings is reduced.
The web crawler 12 may be embodied in hardware, software, firmware or a combination of hardware, software and/or firmware. In a hardware embodiment, components of the web crawler 12 may be embodied in a device, such as server hardware, computer, etc. For example, the URL selection module 18 may include a processor 61 operatively associated with a memory 62 as shown in
Although embodiments of the present invention have been illustrated in the accompanied drawings and described in the foregoing description, it will be understood that the invention is not limited to the embodiments disclosed, but is capable of numerous rearrangements, modifications, and substitutions without departing from the spirit of the invention as set forth and defined by the following claims. For example, the capabilities of the invention can be performed fully and/or partially by one or more of the blocks, modules, processors or memories. Also, these capabilities may be performed in the current manner or in a distributed manner and on, or via, any device able to provide and/or receive information. Further, although depicted in a particular manner, various modules or blocks may be repositioned without departing from the scope of the current invention. Still further, although depicted in a particular manner, a greater or lesser number of modules and connections can be utilized with the present invention in order to accomplish the present invention, to provide additional known features to the present invention, and/or to make the present invention more efficient. Also, the information sent between various modules can be sent between the modules via at least one of a data network, the Internet, an Internet Protocol network, a wireless source, and a wired source and via plurality of protocols.
Claims
1. A method for web crawling comprising:
- determining a plurality of Uniform Resource Locators (URL)s;
- determining a subset of the plurality of URLs that have associated interaction data;
- selecting at least one URL of the subset; and
- downloading a web page corresponding to the at least one selected URL.
2. The method according to claim 1 wherein determining the subset comprises:
- accessing an interaction data store that stores interaction data that associates a URL with an interaction with a web page corresponding to the respective URL; and
- selecting a URL into the subset if a web page corresponding to the URL has interaction data.
3. The method according to claim 2 comprising ranking the subset of URLs using the interaction data.
4. The method according to claim 3 wherein the interaction data indicates one or more out-click events from the corresponding web page of an associated URL and wherein ranking the subset of URLs comprises ranking the URLs dependent on the one or more out-click events.
5. The method according to claim 4 wherein ranking a URL is dependent on a source content element of the web page prior to an out-click event.
6. The method according to claim 5 wherein ranking a URL is dependent on an attention analysis ranking of the source content element.
7. The method according to claim 3 wherein selecting at least one URL comprises selecting a highest ranked URL.
8. The method according to claim 2 comprising selecting a URL into the subset if a web page corresponding to the URL has interaction data that indicates at least one human dependent interaction with a web page associated with the URL.
9. The method according to claim 1 wherein downloading a web page comprises determining content interest of one or more content elements of the web page.
10. The method according to claim 9 comprising storing content elements that satisfy a threshold content interest requirement.
11. The method according to claim 1 wherein said downloading comprises providing the subset of URLs to a download module.
12. The method according to claim 1 further comprising storing the interaction data comprising:
- receiving an event stream from an interaction between a user and a web page;
- analyzing the event stream; and
- storing the analyzed event stream in association with a URL for the respective web page.
13. A web crawler comprising:
- at least one Uniform Resource Locator (URL) data store that stores a plurality of URLs;
- at least one interaction data store that stores interaction data for a plurality of web pages, the interaction data indicating an interaction between a human and a web page corresponding to a URL;
- at least one download module that downloads web page content corresponding to a URL; and
- at least one URL selection module in communication with the at least one URL data store and the at least one interaction data store;
- wherein the at least one URL selection module selects at least one URL from the at least one URL data store that has interaction data in the at least one interaction data store; and
- wherein the at least one URL selection module provides the at least one selected URL to the at least one download module.
14. The web crawler according to claim 13 further comprising:
- at least one content interest data store that stores an attention ranking of one or more content elements of a web page; and
- at least one web page data store;
- wherein the download module is configured to: utilize the at least one content interest data store to determine a content interest score of one or more content elements of a downloaded web page; and store the one or more content elements of the downloaded web page in the at least one web page data store dependent on the respective content interest score.
15. The web crawler according to claim 14 wherein the at least one web page data store is configured to group content elements of a plurality of web pages according to their content interest score.
16. The web crawler according to claim 13 comprising an event server that:
- receives at least one event stream generated during an interaction with a web page on a client browser;
- analyzes the event stream; and
- stores the analyzed event stream in the interaction data store.
17. The web crawler according to claim 16 wherein the event server analyzes the at least one event stream to determine an event generator type of the event stream.
18. The web crawler according to claim 13 wherein the at least one interaction data store receives interaction data from an event server.
19. A computer-readable medium comprising computer-executable instructions for execution by a processor, that, when executed, cause the processor to:
- select a Uniform Resource Locator (URL) from a URL data store;
- look up the selected URL in an interaction data store to determine if interaction data exists for the selected URL in the interaction data store; and
- if interaction data exists for the selected URL, provide the selected URL to a download module.
20. The computer readable medium according to claim 19 comprising computer-executable instructions for execution by the processor, that, when executed, cause the processor to:
- select a plurality of URLs from the URL data store that have corresponding interaction data in the interaction data store;
- rank the plurality of URLs according to out-click event data of the interaction data;
- provide at least one of the plurality of URLs to the download module; wherein the at least one URL provided to the download module is provided depending on the rank.
Type: Application
Filed: May 5, 2009
Publication Date: Nov 11, 2010
Applicants: (Frisco, TX), SUBOTI, LLC (Frisco, TX)
Inventor: Robert R. Hauser (Frisco, TX)
Application Number: 12/435,774
International Classification: G06F 17/30 (20060101);