AVOIDING MASKED WEB PAGE CONTENT INDEXING ERRORS FOR SEARCH ENGINES
Multiple non-host client sites provide cached user copies of web pages and/or web content, or summaries thereof, to a server. Obtaining data from non-host sources for indexing purposes avoids masked web page content indexing errors for search engines. The server aggregates, summarizes and indexes the web pages and/or web content in an index of cached content, in conjunction with updating, generating and storing a search index using an indexing agent such as a web crawler or spider. In response to receiving search requests from end users, the search engine uses comparisons between the index of cached content and the index of crawled content to identify potential page masking errors for specific search results and appropriately rank or omit results with a high risk of masking errors in a search result list.
This patent application is a continuation of U.S. application Ser. No. 12/425,269, filed Apr. 16, 2009, now U.S. Pat. No. 9,405,831, which claims priority pursuant to 35 U.S.C. §119(e) to U.S. provisional application Ser. No. 61/045,491, filed Apr. 16, 2008, which are hereby incorporated by reference, in their entireties.
BACKGROUND1. Field
This application relates to computer search engines, and more particularly to avoiding masked web page content indexing errors.
2. Description of Related Art
Obtaining useful data parameters for generating search indexes used by search engines has become increasingly important for designers of search engines. Search engines are being used by computer users of all ages and abilities, and endeavor to provide information correctly matched to the users' search requests.
Generally, search engines use corresponding search indexes to obtain search results for these computer users. In turn, search engines use a variety of techniques to obtain data for these search indexes. For example, some search engines automatically generate their listings using software known as “crawlers” or “bots” or “spiders”. Generally speaking, crawlers find and interact with web pages, request the web page from the host for the web page, read the web page, and follow links on each web page to other pages within the web site. The read information may consist of words, terms, network addresses, or other parameters useful for obtaining search results desired by computer users. After obtaining these parameters, crawlers provide their results for indexing in a search index available to the search engine. The search index may include the web pages themselves or summaries of the web pages' content. Finally, search engine software may process the web pages or the summarized content in the search index to retrieve search results and rank the pages according to a specific algorithm.
Other search engines rely upon hosts' descriptions of web pages or web sites to generate listings in the search index. The search engine software searches only for matches in the descriptions submitted by the hosts, which may be prepared by a human operator. In addition, some search engines combine crawler-based search indexes with human-based search indexes to generate hybrid search indices.
All of these methods generate search indexes by reading web pages on the hosts' servers or databases, or by relying upon the hosts' descriptions of the content of their web pages. In either situation, these search engines cannot avoid content errors caused by the hosts themselves. Oftentimes, hosts seek to generate higher ranking scores on popular search engines by responding to a crawler's request with false copies of web pages, or by submitting false descriptions of a web page's content to a human- based search engine. The hosts' actual content is therefore said to be “masked” by misleading information provided in response to a crawler request. Inaccurate indexing caused by hosts providing deliberately inaccurate data about hosted content may be referred to as a masked web page indexing error.
Accordingly, it is desirable to provide methods and systems to avoid these masked web page content indexing errors, thereby generating more useful results for search requests by computer users.
SUMMARYMasked web page content indexing errors are avoided by obtaining cached user copies of web pages from sources other than the hosts of the web pages. The hosts of the web pages may be indicated by the uniform resource locator (URL), network address associated with each web page or some other identifier. Sources other than the hosts of the web pages may include consumers of the information on each of the web pages or non-indexing sources that do not have an interest in either providing erroneous data to “spiders” or “bots” used in creating indexes for search engines or submitting false descriptions of their web page's content. Generally, such sources do not publish the cached web pages, which are stored on a private file system that is not publicly accessible using a URL or other address. User sources that are not the hosts of the web pages may acquire user copies of the web pages from the hosts and store (cache) the web pages in a non-public file system. By acquiring these cached user copies of web pages from such sources, the method and system avoids erroneous search results caused by hosts of web pages that “mask” their web pages with false content.
It should be understood, however, that use of cached content from non-public sources thwarts the systematic acquisition of content to process for an index that can be accomplished using an indexing agent such as a web crawler, spider, or “bot.” In addition, the private sources that cache web content should not be configured merely as indexing agents in disguise. If the search engine operator desires to operate a disguised indexing agent, it would be simpler to do so directly. In fact, search engine operators generally do not desire to operate disguised indexing agents, because doing so creates a definite risk of adverse technical or legal consequences. Therefore a barrier to use of privately cached content arises, in that such content may include a random or haphazard collection of content collected as a consequence of casual web surfing or other private use not intended for indexing purposes. The non-public cached content is not assembled in a systematic fashion, and will not include all of the content otherwise available to an indexing application. Relying exclusively on such private caches will likely result in a search index substantially smaller than generated by a traditional indexing agent, for comparable resources expended. For at least these reasons, such private caches of web content are an unexpected and surprising source for use in generating a search index.
According to various embodiments of the invention, systems and methods for indexing web pages on the Internet are provided. The method includes: accessing a web page to create a first index of the web page; receiving a cached copy of the web page from a client; generating a second index of the web page using the cached copy; and ranking the web page based on a comparison between the first and second indexes of the web page, meaning using information obtained from the comparison to rank the web page in query results lists, including omitting any reference to the page from a results list if the first and second indexes for the page are not the same or similar.
The method may also include: generating an updated index based on the comparison between the first and second indexes of the web page; and generating a search result based on the updated index for at least one client. The cached copy of the web page from the client may be identified by an address that designates a network address, which may be in a secured private file system, different from the web page. The cached copy may include a uniform resource locator (URL) for the original web page, a network address, and one or more key terms. The URL may be used to correlate data from the second index to data from the first index. The second index may be generated by summarizing the cached copy of the web page. The second index may also be generated by aggregating a plurality of cached copies of the web page from one or more clients.
In one embodiment, the popularity for each of the web pages is measured by counting a number of the cached user copies received. In yet another embodiment, the method also include distributing an application to the client. The application is configured to operate on computer system of the client and to periodically transmit the cached copy of a web page on the user's computer system to a server. Each cached copy of the web page may be summarized prior to it being sent to the server.
In yet another one embodiment, a software similar to a crawler may be used to obtain cached user copies of web pages from user sources that obtain user copies of web pages from the web pages' hosts and store the user copies in an associated private cache. The software may be executed on a server. This quasi-crawler transmits the cached user copies to a server for indexing on the search engine's database. Instead of, or in addition to, obtaining data parameters for the search index by requesting web pages from host web servers, the crawler requests cached user copies from sources that are not affiliated with the hosts of the web servers.
In still another one embodiment, software may be used that prompts non-host clients (e.g., clients operated by content consumers of hosted data) to allow an application to be downloaded on the non-host client computer systems. In this arrangement, the application periodically sends the cached user copies or summaries of the cached user copies for indexing on the search engine's database. Again, this arrangement avoids errors resulting from information requests directed to the hosts themselves.
A more complete understanding of the method and system for avoiding masked web page content indexing errors for search engines will be realized by one of ordinary skill in the art, as well as a realization of additional advantages and objects thereof, by considering the following detailed description. Reference will be made to the appended sheets of drawings, which will first be described briefly.
Like numerals refer to like parts throughout the several views of the drawings.
DETAILED DESCRIPTION OF VARIOUS EMBODIMENTSThe present method and system avoids masked web page content indexing errors for search engines. One of ordinary skill in the art will find that there are a variety of ways to design a client or server architecture. Therefore, the method and system disclosed herein is not limited to a specific client or server architecture. For example, summarizing cached user copies of web pages may be performed at the client or server level. For further example, it may be advantageous to perform calculations and processor commands at the client level, thereby freeing up server capacity and network bandwidth.
Referring to
The client sites 105 may comprise a personal computer, portable computers, a compact player or a cell phone or digital assistant. The host web servers 104 store and serve data and software to the client sites 105. The host web servers may store web pages and/or web content on attached databases 110, such as on a local disk cache. The client sites 105 may include software or firmware that is configured to work cooperatively with software or firmware running on the host web servers 104. Generally, the client sites 105 may use web browsers 111 or other access applications to access and display data and software from the host web servers 104, and may store copies of web sites in a cache 112.
The client sites should not operate search indexing agent applications that crawl or locate web content for indexing, and therefore in the first instance the clients access and cache web content for reasons other than building a search index. Use of the cached content from the client sites for indexing purposes as described herein should be secondary to the initial or primary purpose for accessing and storing the content, in the first instance. For example, the client sites may operate web browsers that access web content through casual web surfing performed by their respective users. Therefore to provide an adequately large database of privately cached web content, numerous client sites may be required. For example, hundreds, thousands, tens of thousands, or more participating client sites may be desirable, depending on the amount of content that is to be screened for page masking errors. Methods as described herein may be applied to limit the amount of content that is screened for errors, which may help reduce the number of client sites required.
In an example embodiment, the server 101 includes server agents 113, indexing applications 114 and various other applications to ensure communication of data between the server 101 and the network 102, as well as between the server 101 and the database server 107. In another arrangement, the database server 107 may be connected to the network 102 instead of directly connected to the server 101. The server 101 is configured to communicate through the network 102 with client sites 105, a search engine 103 and host web servers 104. To obtain indexing data for use with the search engine 103, the server agents 113 may be used to communicate with the host web servers 104 to locate and request web pages, read the web pages and transmit the web pages back to the server 101. As discussed above, this type of indexing agent may be referred to as a crawler. A special crawler, herein called a quasi-crawler, may be specially configured to obtain cached user copies from non-host client sites. A specific algorithm may be coded as the application to be executed on the server 101. The server 101 may also obtain indexing data by allowing the host web servers 104 to submit summaries of their web pages' content. The server agents 113 may therefore obtain indexing information from crawlers, quasi-crawlers 115 that obtain data from client sites 105 that have obtained copies of web pages from host web servers 104 and the hosts of the host web servers 104 themselves.
The search engine 103 may be a specialized application server configured to scan a search index and return a list of ranked uniform resource locators (URLs). Various proprietary search engines have been developed that scan web pages for URLs, network addresses, key terms, number of visits, and numerous other variables. The search engines further developed complex algorithms that weigh these variables in order to return the most relevant search results 116. Examples of search engines include Google®, MetaCrawler®, Yahoo®, MSN Search®, AltaVista®, Lycos®, Ask® and others.
Referring now to
A search engine administrative site, such as server 101, may automatically identify host sites suspected of providing masked web pages to search indexing agents, and thereby to reduce the amount of content that needs to be screened for page masking errors. This might be done using a statistical filtering process. For example, certain hosts may be trusted and known to never engage in content masking in response to search index requests. Pages from these trusted hosts may be eliminated from prospective candidate sources of masked page errors. Of the remaining sources, analysis may reveal patterns of masked page errors that regularly occur with certain keywords. For example, it may be learned that over the past 30 day period, the top 100 search results for the terms “mortgage loan” contained 10% masked results, while for the next 100 results, the masked error rate was 2%. Thus, based on results such as this an automatic quality control agent might check and correct more highly-ranked results for certain popular keywords, using the technology disclosed herein, while ignoring masked errors having a ranking lower than a selected threshold, for example, ranked lower than the first 100 results, to limit resource allocation to correction of the most highly-ranked masked page errors. A quality control agent may therefore prepare requests identifying host URL's to check for page masking by comparison with cached user copies, and communicate such requests to participating clients.
A program 115, similar to a crawler, may be executed in conjunction with the server agents 113 on the server 101 to request the cached user copies 206 of web pages from the client sites 200. Each cached copy may be identified by a corresponding host URL. The cached user copies 206 may be stored on a hard disk 205 or other memory device on the client sites 200. As such, the cached user copies 206 are not accessible using only the host URL, because the host URL indicates the original host location and not the location of the cached copy. In addition, the cached user copies 206 may be protected by a firewall or may be password protected. Therefore, the cached user copies are generally not publicly accessible using the host URL. This is in contrast to the web pages available from the host web servers 104, which are generally freely available on the World Wide Web using the corresponding host URLs. Faced with cached user copies behind a firewall or password protected, the application 207 may request access to the cached user copies 206 from the client sites 200. The application may identify specific copies using a host URL, which the client may translate as indicating a specific copy. In the alternative, or in addition, the application may request and receive all cached copies located on a particular client or clients.
In another arrangement, the server agent may prompt the client sites 200 to download an application 207. By agreeing to download the application 207, the client sites 200 allow the application 207 access to the cached user copies 206 stored on the client sites 200. Incentives may be provided to encourage clients to download application 207, such as access to free or discounted on-line services, merchandise coupons, cash payments, or other benefits. The application 207 may be stored on the hard disk 205 or other memory device of the client sites 204. The application 207 is configured to transmit the cached user copies 206 to the server 101 for indexing. Alternatively, the application 207 may be configured to summarize the cached user copies 206 before transmitting the summarized cached user copies to the server 101. The client application 207 may also be used to request and cache a web page hosted at a specific URL designated by the server 101. However, as this may be viewed as operating a disguised indexing agent, such functionality may be designed out or limited to rare circumstances. It is anticipated that, given a sufficient number of participating clients, there should be little or no need to request specific pages. This is because page masking errors should be much more frequent with popular URL's, which are correspondingly likely to be accessed by at least some of the clients without specifically directing the clients to do so. After the cached user copy 206 is obtained, the client application may provide the copy or a summary thereof to the server 101 as previously described.
Accordingly, the client sites 200 may comprise a secondary source for web pages and web content, having received the web pages and web content from the host web servers 104 and cached them in the client site's 200 memory. These cached user copies 206 of web pages and/or web content should not comprise masked content, because the host web servers 104 should not be able to identify the client sites 200 as search indexing agents and will therefore not supply masked content to them. The cached copies may therefore be used lessen masked web page content indexing errors by comparing the cached pages or index data from cached pages to pages or index data from corresponding host pages, or by using the cached pages or index data in lieu of host data to build the search index. An indexing process may therefore use cached pages when available to identify masked content and to correct index data based on masked content that is thereby identified. A search index 107 for a search engine 103 composed of this data may therefore provide search results 116 that contain far fewer errors than existing search indexes for search engines. This in turn may result in more useful, focused search results for the end users.
At 410, optionally in parallel with 400, an application may be distributed to the client sites, prompting the client sites to download the application and agree to allow the application to transmit cached user copies of web pages and/or web content to the server. Either one of these steps for obtaining cached user copies may be repeated periodically, or at any other beneficial time.
At 420, the cached user copies may be received by the server. Once received, the server may store the cached user copies in an attached database or may process the cached user copies prior to storing the processed data. As an option, the server and/or server agent may summarize the cached user copies and provide a shortened version of each cached user copy. The summarized cached user copies may comprise, for example, URLs, network addresses and key terms. As another option, the server and/or server agent may aggregate the cached user copies, wherein the aggregated cached user copies are cached user copies from multiple client sites.
At 430, the cached user copies may be compared to a search index, and at 440, only the changes to the search index may be updated to the search index. Alternatively, at 450, the cached user copies may be combined with the search index, and at 460, an updated search index may be generated. The indexing function may also include ranking the cached user copies or summarized data to alleviate the burden of ranking from the search engine. At 470, the updated search index may be stored in the database for use by the search engine in obtaining search results for end users.
As another option, the application downloaded on the client sites may perform the summarizing and/or aggregating functions rather than the server. This could be beneficial for the server, freeing up network resources and processing commands.
At 520, the search engine may receive the search index from either the server or the network database server. As stated above, the search index received by the search engine may already be ranked according to relevant criteria. For example, the server 101 or server agent 113 from
Methods disclosed herein, for example, methods 400 or 500, may be encoded as program instructions on a computer-readable medium. For example, suitable instructions for performing these methods may be encoded on media such as a hard drive or other memory of server 101 in
An index or index information may also be created by hashing a portion or the entire content of the web page. Various hashing algorithms that are well known in the industry may be used to generate the index information. For example, an index information may be created using information embedded in the <META> tags of web pages. The <META> tag may contain information about the web page such as its subject, content, author, date, and keywords. However, web site administrators may manipulate a web page <META> tag when a crawler identified itself to the web site as a crawler or spider to provide misleading <META> tags or other misleading information that differ from what will be provided in response to non-indexing requests. Thus, as described below, another index or index information may be generated using cached copy of the web page. Cached copies of web pages as described herein should be collected by ordinary users of the Internet and not by an indexing agent, thus such pages will not contain <META> tags or other information that differs from what will be provided to non-indexing requests.
At 615, a cached copy of the web page is received from a remote client. The remote client may comprise any computer on the network with the capability to view web pages in on the Internet, operating a browser application and not operating a web crawler or the like. The remote client may be used to view web pages using a conventional web browser such as Internet Explorer® or Firefox®. In one embodiment, instead of receiving the entire cached copy of the web page, a summary of the cached copy is received from the client. The summary may contain index information such as relevant terms, frequently repeated terms, subject headings, internet address, URL, etc.
At 620, a second index of the web page may be generated using the received cached copy of the web page. By using the user's cached copy, the second index of the web page may more accurately index the actual content of the web page than an index generated by the web crawler. Typically, web site's administrators have the capability of sending false pages to web crawlers in order to manipulate the search results. Thus, by using cache copies of the web page from one or more actual users, the administrators' ability to manipulate the search result is substantially taken away or eliminated.
At 625, the web page may be ranked based on a comparison between the index generated using web page information obtained by the crawler (crawler index) and the index generated using web page information obtained from the cached copy or copy of the web page (cached index) from one or more clients operating a web browser. If there are appreciable differences between the crawler and cached indexes, then the rank may be based entirely on the cached index, where there is a cached index available. In certain circumstances, where a cached copy of a web page is not available, then the rank may be based entirely on the crawler index. If there are differences between the two indexes, the rank may be based on a combination of both indexes or based on their differences. Once the rank is generated, a search result is produced and displayed to a client or user.
In one embodiment, the cached index may be generated using a summary of the cached copy of the web page. Alternatively, the cached index is generated using an aggregated summary of multiple versions of the cached copy of the web page from one or more clients. An accuracy factor may be generated based on the number of available cached versions and the number of available clients that provided the cached copy of the web page. The accuracy factor is high if the numbers of available cached versions and cached page providers are high. The inverse would yield a low accuracy factor. In one embodiment, the rank of the web page is additionally based on the accuracy factor.
Referring now to
At 640, an application may be distributed or uploaded to a client system for gathering cached copy of the web page on the user's computer system. The application may be configured to transmit the cached copies and/or versions of the web page to the server for indexing. Alternatively, the application may be configured to summarize the cached copies before transmitting the summary to the server. The application may also be used to request and cache a web page hosted at a specific URL designated by the server. After the cached user copy is obtained, it may be provided to the server as previously described.
The first index information and the second index information may comprise a portion or an entire index normally generated in the process of indexing a web page. It should be noted that an entire portion of the index need not be generated to implement the present invention. Additionally, system 700 may also include a means 765 for ranking the web page based on a comparison between the first index information and the second index information, comprising a search engine application operating on a server, using both the first index information and the second index information as input to provide ranked search results in response to user queries. This application may be configured to operate as described hereinabove.
In
System 700 may further comprise a means 765 for distributing an application that permits the server to receive cached copies of web pages from a client system, comprising an interactive interface application operating on a server and configured to transmit the application to requesting clients in response to client requests.
As shown in
Processor 730 may effect initiation and scheduling of the processes or functions performed by means 750-765, and components thereof, and may be considered as a component of such means.
In related aspects, system 700 may include a software module 720 which may house search engine 103, hosting software, application 207, and other software for implementing steps 750-765.
In further related aspects, system 700 may optionally include a means for storing information, such as, for example, a memory device/module 740. A computer-readable medium or memory device/module 740 may be operatively coupled to the other components of apparatus 700 via bus 710 or the like. The computer readable medium or memory device 740 may be adapted to store computer readable instructions and data for implementing the processes and functions of means 750-765, and components thereof, or processor 730 (in the case of system 700 configured as a computing device) or the methods disclosed herein.
In yet further related aspects, the memory module 540 may optionally include executable code for the processor module 530 to perform processes 750 through 765. One or more of 750-765 may be performed by processor module 730 in lieu of or in conjunction with the means 750-765 described above.
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not of limitation. Likewise, the various diagrams may depict an example architectural or other configuration for the invention, which is done to aid in understanding the features and functionality that can be included in the invention. The invention is not restricted to the illustrated example architectures or configurations, but the desired features can be implemented using a variety of alternative architectures and configurations. Additionally, with regard to flow diagrams, operational descriptions and method claims, the order in which the operations are presented herein shall not mandate that various embodiments be implemented to perform the recited functionality in the same order unless the context dictates otherwise. The breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments.
Claims
1. A system for measuring the popularity of a web page, the system comprising:
- a server configured to receive cached copies of the web page and count the cached copies received;
- a popularity indicator, wherein the popularity indicator is adjusted by the server based on a number of cached copies received.
2. The system of claim 1, wherein the cached copies are received from an application operating on a client computer.
3. The system of claim 2, wherein the application is configured to periodically transmit the cached copies to the server.
4. The system of claim 2, wherein each cached copy is summarized by the application prior to transmitting the cached copies to the server.
5. The system of claim 1, wherein the server receives the cached copies from a private cache that stores user copies of web pages.
6. The system of claim 5, wherein the user copies of web pages are obtained from one or more hosts of the web pages.
7. The system of claim 5, wherein the user copies of web pages are obtained from one or more sources not affiliated with hosts of the web pages.
8. The system of claim 1, wherein the cached copies are indexed in a search engine database.
9. A method for measuring the popularity of a web page, the method comprising:
- receiving cached copies of the web page at a server;
- counting the cached copies received;
- adjusting a popularity indicator based on a number of cached copies received.
10. The method of claim 9, further comprising transmitting the cached copies to an indexing server for indexing in a search engine database.
11. The method of claim 9, wherein the cached copies are received from an application operating on a client computer.
12. The method of claim 11, wherein the application is configured to periodically transmit the cached copies to the server.
13. The method of claim 9, further comprising prompting non-host clients to download an application that periodically sends cached user copies of web pages or summaries of cached user copies of web pages for indexing in a search engine database.
14. The method of claim 9, wherein the cached copies are indexed by an indexing server in a search engine database.
15. The method of claim 9, wherein the server receives the cached copies from a private cache that stores user copies of web pages.
16. The method of claim 15, wherein the user copies of web pages are obtained from sources not affiliated with hosts of the web pages.
17. The method of claim 15, wherein the user copies of web pages are obtained from hosts of the web pages.
18. The method of claim 17, further comprising obtaining data parameters for a search index from the hosts of the web pages.
19. A system for measuring the popularity of a web page, the system comprising:
- a server configured to receive and count cached copies of the web page, wherein each of the cached copies are summarized by an application operating on a client's computer prior to being received by the server;
- a popularity indicator, wherein the popularity indicator is adjusted by the server based on a number of cached copies received.
20. The system of claim 19, wherein the application periodically transmits the cached copies to the server.
Type: Application
Filed: Aug 1, 2016
Publication Date: Nov 24, 2016
Inventor: Gary Stephen SHUSTER (Fresno, CA)
Application Number: 15/225,736