DETECTION OF MALWARE FEATURES IN A CONTENT ITEM
Intrusion features of a landing page associated with sponsored content are identified. A feature score for the landing page based on the identified intrusion features is generated, and if the feature score for the landing page exceeds a feature threshold, the landing page is classified as a candidate landing page. A sponsor account associated with the candidate landing page can be suspended, or sponsored content associated with the candidate landing page can be suspended.
This application is a continuation of U.S. patent application Ser. No. 15/392,414 filed Dec. 28, 2016, which application is a continuation of U.S. patent application Ser. No. 13/587,025, now U.S. Pat. No. 9,563,776, filed Aug. 16, 2012, which is a continuation of U.S. patent application Ser. No. 13/230,544, now U.S. Pat. No. 8,515,896, filed Sep. 12, 2011, which is a continuation of U.S. patent application Ser. No. 11/868,321, now U.S. Pat. No. 8,019,700, filed Oct. 5, 2007. The disclosure of each of the foregoing applications is incorporated herein by reference.
TECHNICAL FIELDThe document relates to management of intrusive software.
BACKGROUNDInteractive media (e.g., the Internet) has great potential for improving the targeting of sponsored content, e.g., advertisements (“ads”), to receptive audiences. For example, some websites provide information search functionality that is based on keywords entered by the user seeking information. This user query can be an indicator of the type of information of interest to the user. By comparing the user query to a list of keywords specified by an advertiser, it is possible to provide targeted ads to the user.
Another form of online advertising is ad syndication, which allows advertisers to extend their marketing reach by distributing ads to additional partners. For example, third party online publishers can place an advertiser's text or image ads on web properties with desirable content to drive online customers to the advertiser's website.
The ads, such as creatives that include several lines of text, images, or video clips, include links to landing pages. These landing pages are pages on advertiser websites or on syndicated publisher websites that users are directed to when the users click on the ads. Some of these landing pages, however, may include intrusive software, e.g., software, scripts, or any other entities that are deceptively, surreptitiously and/or automatically installed. Such software entities that are intrusively installed can be generally characterized as “malware,” a portmanteau of the words “malicious” and “software.” The software, however, need not take malicious action to be malware; any software that is intrusively installed can be considered malware, regardless of whether the actions taken by the software are malicious. Thus, in addition to Trojan Horses, viruses, and browser exploits, other software such as monitoring software can be considered malware. The malware can be present in the landing page intentionally or unintentionally. For example, an advertiser's site can be hacked and malware inserted directly onto the landing page; a malicious advertiser can insert malware into a landing page; a click-tracker can insert malware through a chain of redirects that lead to the final uniform resource locator (URL) of the landing page; an advertiser may place ads or gadgets on a page populated by third parties that insert malware onto the landing page; etc.
Once a landing page is known to have malware, an advertisement publisher can preclude the serving of the landing page. However, an advertisement publisher, e.g., Google, Inc., may have access to hundreds of millions of advertisements and corresponding landing pages associated with the advertisements. As could be understood, it may be it may be difficult to check/re-check each landing page in depth for the presence of malware.
SUMMARYDisclosed herein are apparatus, methods and systems for the detection and processing of malware in sponsored content. In an implementation, intrusion features of a landing page associated with sponsored content are identified. A feature score for the landing page based on the identified intrusion features is generated, and if the feature score for the landing page exceeds a feature threshold, the landing page is classified as a candidate landing page. The candidate landing page can be provided to a malware detector to determine whether malware is present in the landing page. In some implementations, a sponsor account associated with the candidate landing page can be suspended. In some implementations, an advertisement associated with the candidate landing page can be suspended.
In another implementation, a method includes partitioning landing pages associated with advertisements into training landing pages and testing landing pages. A classification model is iteratively trained on intrusion features of the training landing pages, and is iteratively tested on the intrusion features of the testing landing pages. The training and testing continues until the occurrence of a cessation event. An association of feature weights and intrusion features that are derived from the iterative training and testing are stored in the classification model in response to the cessation event.
In another implementation, a system includes a scoring engine including software instructions stored in computer readable medium and executable by a processing system. Upon execution, the processing system identifies a landing page associated with sponsored content and identifies intrusion features of the landing page. A feature score for the landing page is generated based on the identified intrusion features, and if the feature score for the landing page exceeds a feature threshold, the landing page is classified as a candidate landing page. The candidate landing page can be provided to a malware detector to determine whether malware is present in the landing page. In some implementations, a sponsor account associated with the candidate landing page can be suspended. In some implementations, an advertisement associated with the candidate landing page can be suspended.
The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
Other entities, such as users 108 and the advertisers 102, can provide usage information to the system 104, such as, for example, whether or not a conversion or click-through related to an ad has occurred. This usage information can include measured or observed user behavior related to ads that have been served. The system 104 performs financial transactions, such as crediting the publishers 106 and charging the advertisers 102 based on the usage information.
A computer network 110, such as a local area network (LAN), wide area network (WAN), the Internet, or a combination thereof, connects the advertisers 102, the system 104, the publishers 106, and the users 108.
One example of a publisher 106 is a general content server that receives requests for content (e.g., articles, discussion threads, music, video, graphics, search results, web page listings, information feeds, etc.), and retrieves the requested content in response to the request. The content server may submit a request for ads to an ad server in the system 104. The ad request may include a number of ads desired. The ad request may also include content request information. This information can include the content itself (e.g., page or other content document), a category corresponding to the content or the content request (e.g., arts, business, computers, arts-movies, arts-music, etc.), part or all of the content request, content age, content type (e.g., text, graphics, video, audio, mixed media, etc.), geo-location information, etc.
In some implementations, the content server can combine the requested content with one or more of the ads provided by the system 104. This combined content and ads can be sent to the user 108 that requested the content for presentation in a viewer (e.g., a browser or other content display system). The content server can transmit information about the ads back to the ad server, including information describing how, when, and/or where the ads are to be rendered (e.g., in HTML or JavaScript™).
Another example publisher 106 is a search service. A search service can receive queries for search results. In response, the search service can retrieve relevant search results from an index of documents (e.g., from an index of web pages). An exemplary search service is described in the article S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Search Engine,” Seventh International World Wide Web Conference, Brisbane, Australia and in U.S. Pat. No. 6,285,999. Search results can include, for example, lists of web page titles, snippets of text extracted from those web pages, and hypertext links to those web pages, and may be grouped into a predetermined number of (e.g., ten) search results.
The search service can submit a request for ads to the system 104. The request may include a number of ads desired. This number may depend on the search results, the amount of screen or page space occupied by the search results, the size and shape of the ads, etc. In some implementations, the number of desired ads will be from one to ten, or from three to five. The request for ads may also include the query (as entered or parsed), information based on the query (such as geo-location information, whether the query came from an affiliate and an identifier of such an affiliate), and/or information associated with, or based on, the search results. Such information may include, for example, identifiers related to the search results (e.g., document identifiers or “docIDs”), scores related to the search results (e.g., information retrieval (“IR”) scores), snippets of text extracted from identified documents (e.g., web pages), full text of identified documents, feature vectors of identified documents, etc. In some implementations, IR scores can be computed from, for example, dot products of feature vectors corresponding to a query and a document, page rank scores, and/or combinations of IR scores and page rank scores, etc.
The search service can combine the search results with one or more of the ads provided by the system 104. This combined information can then be forwarded to the user 108 that requested the content. The search results can be maintained as distinct from the ads, so as not to confuse the user between paid advertisements and presumably neutral search results. Finally, the search service can transmit information about the ad and when, where, and/or how the ad was to be rendered back to the system 104.
As can be appreciated from the foregoing, the advertising management system 104 can serve publishers 106, such as content servers and search services. The system 104 permits serving of ads targeted to documents served by content servers. For example, a network or inter-network may include an ad server serving targeted ads in response to requests from a search service with ad spots for sale. Suppose that the inter-network is the World Wide Web. The search service crawls much or all of the content. Some of this content will include ad spots (also referred to as “inventory”) available. More specifically, one or more content servers may include one or more documents. Documents may include web pages, email, content, embedded information (e.g., embedded media), meta-information and machine executable instructions, and ad spots available. The ads inserted into ad spots in a document can vary each time the document is served or, alternatively, can have a static association with a given document.
In one implementation, the advertisement management system 104 may include an auction process to select advertisements from the advertisers 102. The advertisers 102 may be permitted to select, or bid, an amount the advertisers are willing to pay for each click of an advertisement, e.g., a cost-per-click amount an advertiser pays when, for example, a user clicks on an advertisement. The cost-per-click can include a maximum cost-per-click, e.g., the maximum amount the advertiser is willing to pay for each click of advertisement based on a keyword, e.g., a word or words in a query. Other bid types, however, can also be used. Based on these bids, advertisements can be selected and ranked for presentation.
In some implementations, the system 104 includes an ad page malware detection system that can determine the likelihood that sponsored content (e.g., an ad's landing page) contains malware. Malware may include any type of computer contaminant, such as dishonest adware, computer viruses, spyware, Trojan horses, computer worms, or other such malicious, unelected and/or unwanted software. Specifically, malware can include any suspicious software installation that happens automatically upon landing on a webpage, such as an ad's landing page. In some implementations, the ad page malware detection system may cover the case where a user must click a link on the page (such as “free download”) for the malware to be installed. The software, however, need not take malicious action to be malware; any software that is intrusively installed can be considered malware, regardless of whether the actions taken by the software are malicious. Thus, in addition to Trojan Horses, viruses, worms, and browser exploits, other software that does not necessarily harm a computer system, such as monitoring software, start page hijacks, etc., can be considered malware.
The malware detection system can, for example, automatically test landing pages (e.g., a web page defined by a URL embedded or associated with sponsored content) for malware and take appropriate action when malware is detected. Such actions may follow pre-determined policies, such as to suspend an advertiser's account (e.g., an advertiser's account with Google AdSense or AdWords), “flag” the ad or ads associated with the landing page as malware-related, and help the end-user avoid the negative effects of such ads in the future. The malware detection system can provide a process for an advertiser to have its “flagged” ads re-checked and its accounts unsuspended. Moreover, if the malware detection system re-checks the landing pages of an advertiser's flagged ad or ads and determines that the associated landing pages are clean (e.g., free from malware), the advertiser's account can be reinstated (or cleared). In some implementations, the ads associated with the landing page can be suppressed completely, e.g., serving of the ad can be precluded.
In some implementations, the malware detection system may have the flexibility to suspend groups of ads, such as all ads in an ad group or ad campaign, or all ads with a common URL. For example, the malware detection system may determine that only a subset of an advertiser's ads contain malware, and thus suspend only those ads. Such determination may be based on common features shared by the ads' landing page.
Malware may be encountered in an ad's landing page or redirect chain, or may originate in various ways. Specifically, the redirect chain can include the series of URLs that include the clicked ad (or destination URL), URLs that are instantiated by scripts, etc., as a result of the click on the ad, and the final URL of the ad's landing page. In some cases, an advertiser's site can be hacked and malware inserted directly onto the landing page. In another example, a malicious advertiser may purposely install or enable malware on its ad landing page. In a third example, a click-tracker can insert malware through the chain of redirects before the final URL is reached. In a fourth example, an advertiser may install ads and/or gadgets on its landing page that may be populated by third parties who insert malware. In these and other examples of malware, when a user clicks such an ad, the user's computer can be compromised by the installation of intrusive software.
For example,
When trying to retrieve the iFrame, the browser may be redirected, such as via a Location header, towards an IP address of an exploit server. For example, the IP address may be, for example, of the form xx.xx.xx.xx/<exploit server>/, such as the IP address of an exploit server 212. The IP address served can include encrypted JavaScript which may enable the exploit server 212 to attempt multiple exploits against the user's browser. As a result, several malware binaries may be installed on the user's computer. The malware encountered and/or installed in this scenario may be unknown to the initial ads company 206a. However, each redirection from the destination URL (e.g., ads company 206a) to the landing page associated with sponsored content (e.g., on the exploit server 212) can give another party control over the content on the original web page. In this way, the sub-syndication of sponsored content, characterized here by several URL redirects, can lead a user to an undesired encounter with malware.
Detecting malware may include the use of commercially available malware detection software or other such virus scanning software or systems. Malware may also be detected by monitoring system behaviors, such as monitoring the use of registry and system files after visiting a URL. For example, an intrusion detection engine may monitor the behavior of a browser on a virtual machine to determine whether malware is present.
The system 300 includes a malware evaluator 304 that can be used to detect malware in a landing page associated with an ad, or in the ad itself. For example, the malware evaluator 304 may initially evaluate an ad's landing page for its likelihood to include malware, and if the landing page is considered likely to include malware, the malware evaluator 304 can submit the ad to a more thorough evaluation process. Such a two-step evaluation process can lead to efficiencies gained by using the more thorough malware evaluation process only on the candidate ads considered most likely to include malware.
The initial evaluation performed by the malware evaluator 304 may identify intrusion features of the ad's landing page or URLs in the redirect chain. The evaluation may inspect the ad for iFrame features, URL features, script features, etc. and compare such features against a repository of features that are known to be associated with landing pages that include malware. As a result of initial evaluation of the ad's landing page features, the malware evaluator 304 may generate a feature score that indicates the likelihood that the ad's landing page includes malware. For example, a higher score may mean that the ad's landing page is more likely to include malware. Any ads' landing pages having a feature score that exceeds a feature threshold can be classified as candidates for a more thorough malware evaluation process. In this way, the identification of features can facilitate reduction heuristics, allowing the system to significantly reduce the number of landing pages to a smaller set of candidate landing pages that may be subsequently evaluated by the more thorough malware evaluation process.
In some implementations, the malware evaluator 304 can use an intrusion detection engine 305 that implements a more thorough malware evaluation process. For example, the malware evaluator 304 can provide the intrusion detection engine 305 with a web page (e.g., the landing page of an ad) and receive an intrusion score for the web page. In other implementations, the malware evaluator 304 can include the intrusion detection engine 305.
The more thorough process can be initiated by the malware evaluator 304 when the malware evaluator submits the candidate landing page to the intrusion detection engine 305. The intrusion detection engine 305 may include, for example, a virtual machine via which the system 300 can load the ad in a browser, navigate to the ad's landing page (e.g., via one or more URL redirects), and execute one or more malware detection systems, such as commercially available computer malware and virus detection systems. During the process, the virtual machine also can, for example, monitor the use of system files and the creation of unauthorized processes. The intrusion detection engine 305 can generate an intrusion score and provide the intrusion score to the malware evaluator 304. The intrusion score can indicate the level of malware in the ad's landing page. If the intrusion score is sufficiently high, such as above a pre-defined intrusion threshold, the system 300 can flag the ad (e.g., in the ads data base 302) as being likely to contain malware in its landing page.
Ads that are flagged in the ads data base 302 may be precluded from being served to users, or the ads may be annotated in some way to indicate their likelihood of the ad's landing page including malware. In some implementations, the annotations may include an intrusion score that rates each of the ads' likelihood to be malware-related. As the result of determining that any part of an advertiser's sponsored content (e.g., a single ad's landing page) includes malware, the system 300 may flag some or all of the advertiser's ads. The system 300 may also suspend the account of the advertiser, such as to prevent the advertiser from submitting new ads. The system 300 may perform some actions automatically, such as when it is clear that ads are malware-related, e.g., a relatively high intrusion score. Other actions may be based on user decisions, such as after reviewing the results of malware evaluations.
An account manager 306 can receive the results of malware evaluations from the malware evaluator 304. The evaluations may include, for example, the sponsor's account information, the URLs of the destination and landing pages and any pages in the redirect chain. The evaluations can also include information identifying the reasons that the malware evaluator 304 identified the ad as malware-related. A user of the account manager 306 may be able to facilitate manual disposition of ads and/or accounts based on the evaluation. For example, a user may be able to suspend the account for an advertiser if one or more of the advertiser's ad landing page are discovered to include malware. In another example, a user may decide to flag one or more ads in an advertiser's ad campaign.
A customer service representative (CSR) front end 308 can exist within the system 300 that allows advertisers to initiate an appeal process for flagged ads. For example, a customer (e.g., an advertiser) may have one or more landing pages corresponding to sponsored content that the malware evaluator 304 has determined include malware. After cleaning such sites from malware, for example, the advertiser may initiate an appeal of the ad. Such an appeal may be, for example, in a communication between the CSR front end 308 and the malware evaluator 304 and/or the account manager 306. The communication can include, for example, the advertiser's name and the URLs of the landing pages to be re-evaluated by the malware evaluator 304. If an advertiser's appeal of a flagged ad is successful, the system 300 can un-flag the ad. In some implementations, the system 300 may also reinstate the advertiser's account as the result of a successful appeal. In some implementations, when an advertiser appeals an ad, the system 300 can check all of the ads for the advertiser and only reinstate the advertiser's account (and un-flag the ad) if all of the advertiser's ads are clean.
In some implementations, the system 300 can include a tiered suspension account model. For example, based on the likelihood of the presence of malware in a landing page, the landing page can be categorized in various categories, or levels, of malware infection. Such categories may include, for example, “OK” (e.g., determined likely to be malware-free), “suspect” (e.g., may contain malware) or “confirmed” (e.g., very likely or certain to contain malware). The suspect category may be further categorized, such as with a rating based on an intrusion score.
In some implementations, malware detection scores may be accumulated with respect to an account, and an account itself can be “tiered” into risk categories, each of which is handled differently, ranging from automatic review, manual review, and automatic suspension. For example, the system 300 may automatically suspend an account when one or more ads are in the “confirmed” malware category, or may suspend an account when 5% or more of the ads are “suspect,” etc.
In one implementation, for example, the malware evaluator 308 can identify landing pages associated with a sponsor account having features scores that exceed a feature threshold. The feature scores for these landing pages can be accumulated to obtain an account score, and a risk category can be assigned to the sponsor account based on the account score. One of several account remediation processes for the sponsor account can be selected based on the risk category, e.g., automatic review, manual review, automatic suspension, partial suspension of only candidate landing pages, etc.
Detection of potential malware can occur continuously, periodically, or aperiodically. For example, the ads database 302 can be continuously checked by the malware evaluator 304. In another example, the ads database 302 can be periodically checked by the malware evaluator, e.g., monthly or weekly. In yet another example, each advertisement that is added to the ads database 302 can be checked when the advertisement is added to the ads database 302. Other detection schedules can also be used.
The training process 400 can be used to iteratively train the classification model 402 using intrusion features of the “training” landing pages content. At the same time, the process 400 can iteratively test the classification model 402 using intrusion features of the “testing” landing pages content. The iterative process 400 can continue until the occurrence of a testing cessation event, such as a determination that associations between the feature weights and intrusion features are stabilizing. Such a determination may be made, for example, by implementing a linear regression based model.
In an example general flow of the training process 400 for producing the classification model 402, processing can begin with the use of the ads 302. Information used for the training process 400 can be identified from the landing pages and URLs 404. The process 400 can further partition the landing pages and URLs 404 into “training” landing pages and “testing” landing pages. For example, a larger number of landing pages (e.g., 10,000) may be used as training examples to train the classification model 402, while a smaller number (e.g., 1,000) may be used to test the classification model 402.
A feature extraction engine 406 can extract features from the landing pages and URLs 404. The features can, for example, be indicative of the likelihood that a landing page associated with an ad includes malware. For example, one or more malware-related (or intrusion) features can correspond to small iFrames that may be indicative of an attempt to embed other HTML documents (e.g., malware-related) inside a main document. Another example of an intrusion feature is a bad or suspicious URL, such as a URL that matches a URL on a known list of malware-infected domains. A third example of an intrusion feature is suspicious script language. For example, JavaScript or other scripting languages may have certain function calls or language elements that are known to be used in serving malware. Several other types of intrusion features may exist, such as the existence of multiple frames, scripts or iFrames appearing in unusual places (e.g., after the end of the HTML), or any other features that the training process 400 determines over time is a marker for likely malware infections.
In some implementations, the feature extraction engine 406 can include a list of features that are weighted. For example, a particular intrusion feature for a URL that is a known malware site may receive a higher weight than an intrusion feature that is less likely to be associated with malware. The weights of features may be adjusted over time as the classification model 402 is used to classify landing pages as to their likelihood of including malware.
Weights may be cumulative, so that the overall likeliness that a landing page includes malware may be determined by adding, or otherwise combining the weights corresponding to the features detected. In some implementations, a feature's weight can be included in the sum for each occurrence of the corresponding feature that may be detected in a landing page. In other implementations, a feature's weight may be added to the total score once, regardless of the number of occurrences of the feature in the ad. Other evaluations based on feature weights can also be used.
While many features may have a corresponding positive weight, other features may have a negative weight. For example, feature A, (e.g., corresponding to a likely malware-related function call), may have a weight of 2.5. At the same time, the presence of feature X may partially negate the likelihood that feature A is malicious, prompting the system 400 to assign a negative weight to feature X.
A control evaluation 408 can be used in the training phase of the training process 400. The control evaluation 408 can include a human evaluation of ad landing pages. For example, the human review of the landing page for a particular ad may include an examination of the ad's features. The review may also provide an overall rating of the landing page's likelihood of including malware, such as extremely malware infected, semi-malware infected, etc.
The information generated by the control evaluation 408 can be referenced during a training phase that assigns feature weights to the features extracted by the feature extraction engine 406. For example, a machine learning engine 410 can assign features weights to the features to test the results of the control evaluation 408, for example, by examining similar features in other URLs (e.g., URLs from the “testing” landing pages). Specifically, the machine learning engine 410 can use features from the testing landing pages to iteratively refine the associations of feature weights and intrusion features.
Such refinement can be realized, for example, by a linear-regression based model. For example, the machine learning engine 410 may use training and testing landing pages partitioned in the landing pages and URLs 404. The machine learning engine 410 may, for example, adjust the feature weights based on the training and testing landing pages to generate feature scores for the testing landing pages. If the feature scores yield malware detection results that are close to the control evaluation results, the classification model can be considered trained. Conversely, if the feature scores yield malware detection results that are substantially different that the control evaluation results, the machine learning engine 410 can readjust the feature weights. For example, over several iterations the machine learning engine 410 may determine that feature X is weighted too high, and may thus decrease the feature weight associated with feature X.
The iterative training and testing of the classification model 402 on intrusion features of the training and testing landing pages can continue until the occurrence of a testing cessation event, e.g., a convergence of test results to the control evaluation 408, or until an iteration limit is reached. After the cessation event, the association of feature weights and intrusion features can be persisted in the classification model 402.
Other processes to train the classification model 402 can also be used.
The candidate URLs 506 can include information associated with the ad that may be needed for a thorough examination by a malware evaluator 508. For example, the candidate URLs 506 can include the ad's URL and account information of the advertiser that supplies the sponsored content. The ad's URL (or some other identifier for the ad) may be used, for example, to identify additional information for the ad in the ad data base 504 that may be needed by the malware evaluator 506. The ad's URL may also be used by the malware evaluator 506 to simulate selection of the ad in a user's browser. For example, the malware evaluator 506 can provide the landing page to the intrusion detection engine 305 which may load the URL into a virtual machine that includes virus detection software and that monitors the use of system files and the creation of unauthorized processes.
In some implementations, when the malware evaluator 508 determines that a candidate URL is infected with malware (e.g., based on a high intrusion score received from the intrusion detection engine 305), other related candidate URLs 506 may be assigned a similar score. For example, it may be clear that candidate URLs 506 having the same domain name are also just as likely to be infected. Such determination may be partially based on geographical factors, e.g., if the domain is from Russia, China or any other country statistically known to have higher rates of infected domains.
In an implementation, information from the ads database 604 can be provided to an adgroup criteria features data base 606 and a URL features database 608. For example, the information in the databases 606 and 608 can include pertinent information from the ads, such as the URLs, keywords from the ads, the names of the associated advertisers, the account information of the advertisers, and the like. Provisioning of this information can, for example, obviate the need to store images, video, audio or other such ad-related information. Having the ad information local to the adgroup criteria features data base 606 and the URL features data base 608 can also provide the advantage of organizing and/or indexing the data for more efficient use within the ad malware detection system 602. Such information stored in the databases 606 and 608 can be sufficient to determine malware feature-based associations with a particular ad without having to crawl the ad's landing page. In another implementation, the system 600 can crawl the landing pages of ads and use the information available from the landing pages instead of (or in addition to) using the databases 606 and 608.
The adgroup criteria features data base 606 can contain information for one or more adgroups for an advertiser, keywords associated with the ads, product categorization information, account information for the advertiser, and other ad-related information used by the ad malware detection system 602. The URL features database 608 can contain the URL (e.g., the landing page URL) of each individual ad, the name of the advertiser, and any other information or indexes that may allow associated data in the adgroup criteria features data base 606 to be accessed.
The ad malware detection system 602 includes a sampler 610 that can serve as a first filter in identifying ads that may contain malware. Specifically, the sampler 610 can identify ads for which malware detection is recommended. The identification process can use ad-related information stored in the adgroup criteria features data base 606 and the URL features database 608. For example, the sampler 610 may search an ad for any of a set of per-determined ad content features identified in the adgroup criteria features data base 606.
The sampler 610 may use the classification model 402 described in reference to
In some implementations, the URL features database 608 may include obfuscation information that an obfuscation detector in the sampler 610 may use to screen HTML pages for obfuscated scripts, such as scripts written in JavaScript, VBScript, and the like. Such scripts can often contain an apparently gibberish collection of characters that, when the ad is clicked by the user, will rewrite itself to another URL string, then again to yet another string, and so on until the exploit code is written or downloaded onto a computer device. This level or re-writing that can occur along the redirect chain can make it difficult to identify the malicious HTML code.
In some implementations, the URL features database 608 may include Geo-location information. Such information may be used, for example, to geographically categorize the URLs used for ads. Often malware may be provided from certain countries, and thus analyzing the location information of embedded links may help in identifying a potential malware site. For example, a US-.com domain have an iFrame to a site in a geographically remote location known for a high incidence of malware may provide a strong signal of potential malware.
When the sampler 610 has identified candidate ads that are suspected to contain malware, the sampler 610 can send the candidate URLs and account information to a malware hub 612. The malware hub 612 can serve as a central interface for receiving ads to be more thoroughly checked for malware, and as will be described below, for receiving appeals for ads flagged as containing malware. For any ad that the malware hub 612 is requested by the sampler 610 to review, the malware hub 612 can update a status database 614 with the ad's URL and corresponding tracking information, such as the account information of the advertiser associated with the ad. In some implementations, the information stored in the status database 614 can include information that the sampler 614 considered the reason for the more advanced malware detection. In some implementations, the reasons may be used to group ad statuses in the status database 614 in order to group them for more efficient processing.
In some implementations, the sampler 610 can also evaluate the relative age of domains and URLs for (or links to) those domains. The age of a domain can be used to identify suspected malware sites, as malware is often distributed from new sites. For example, new distribution sites are constantly being created and may exist for only several weeks before the sites are taken down. To determine the age of domains, the sampler 610 may use public or private lists of recently-activated domain names that may be available, for example, from domain registry clearing houses.
In some implementations, the malware hub 612 may serve as a central interface for receiving ad malware detection requests from other advertising management systems 104. For example, while the ad malware detection system 602 may be a component of Google's AdSense system, competing advertising management systems 104 may pay a fee to have ads under their control screened for malware. As such, the ad malware detection system 602 may serve as a clearinghouse for malware detection for several advertising management systems 104.
A malware detector 616 can process the ads represented by entries in the status database 614. For example, the malware detector 616 may process one or more ads, using the URL and the account ID for each ad. If additional information for an ad is needed (e.g., that is not stored in the status database 614), the malware detector 616 can pull additional information for the ad from the ads database 604. Such information may include, for example, account information, or portions of the ad itself that may not have been provided to the sampler 610 for the initial first-filter screening.
The malware detector 616 can then cause a more thorough screening to be performed. In addition, the malware detector 616 can submit the URL to an intrusion detection engine, e.g., intrusion detection engine 305, that performs a more detailed malware evaluation, such as closely examining the “destination” URL, “final” URL, URLs in the redirect chain, and the ad's landing page (e.g., identified by the final URL).
The malware detector 616 may receive an intrusion score for the destination URL from the intrusion detection engine 305. For each landing page with an intrusion score above a pre-defined threshold, the ad malware detection system 602 can take one or more predefined actions, such as automatically flagging ads as malware-related and suspending the account for an advertiser, or providing such information to a user who may manually suspend the accounts of malicious advertisers and/or block their ads. The intrusion score threshold that the malware detector 616 may apply may be set conservatively high so as not to produce significant false positives.
In some implementations, an intrusion detection engine can be implemented or integrated with the malware detector 305.
Actions that occur when ad malware is detected can follow a pre-defined policy. For example, the advertiser's account may be suspended manually, and the advertiser may be notified. The ad associated with malware can be flagged to avoid serving the ad to users. The malware detector 616 may provide information regarding flagged ads, suspended accounts and the like to the status database 614. In some implementations, a process may run on a regular basis to use such information in the status database 614 to update the ads database 604.
A customer front end 618 can serve as a graphical user interface (GUI) for a user, e.g., a customer service representative, to review any results of ad malware detections performed by the malware detector 616. For example, the results may list instances of specific landing pages and the reasons they are determined to contain malware. The instances may be grouped or sorted in various ways, such as by advertiser account, URL, etc.
An appeal process can allow the advertiser having a flagged ad to have the ad re-checked by the ad malware detection system 602. For example, the advertiser may rid the ad's final URL, or all URLs in the redirect chain, of malware after being notified that the ad's landing page contains malware, and then contact a customer service representative as part of the appeal process. The customer service representative can utilize the customer front end 618 to send appeal requests to the malware hub 612. Each appeal request can represent one or more ads for which the advertiser requests the ad malware detection system 602 to re-evaluate for malware content. For example, if the ad malware detection system 602 has previously flagged the advertiser's ad as malware-related, and the advertiser has cleaned the landing page URL(s) associated with the ad, the request may be to re-evaluate that specific ad.
The malware hub 612 can receive the appeal request and update an appeals data base 620. Specifically, pending and completed appeal requests may be stored in the appeals data base 620. The information for each ad stored in the appeals data base 620 may include, for example, the advertiser name, the advertiser's account information, the URL(s) associated with the ad's landing pages and URLs in the redirect chain, and any other information that may be used to process the appeal.
To process as appeal, the malware detector 616 may use a process similar to the process described above to initially evaluate an ad's landing page for malware. In some implementations, the appeal process may also automatically include the re-evaluation of the landing pages of all ads for the advertiser, all ads in an ad group, or any other such grouping that may be used to search for other malware-related ads that the advertiser may have.
When processing an appeal, the malware detector 616 may use information for each ad that is stored in the appeals database 620. The malware detector 616 may use a similar process as described above to evaluate an ad's landing page, generate an intrusion score, and apply a threshold to determine if the ad's landing page is likely to have been cleared of malware. The results of ad landing page re-evaluations can be stored in the appeals data base 620. In some implementations, a process may run on a regular basis to use such information in the appeals data base 620 to update the ads database 604.
In one example scenario of a malware appeal, a customer may receive a notification, such as an email, stating that the customer's account has been suspended for malware. The notification may include details of where malware was found (e.g., destination URL, account information, etc.). The notification may also provide advice on how to remove the malware, and may direct follow-ups, for example, with malware customer support representatives. The customer may then clean their landing page and/or other URLs associated with the malware, and use the customer front end 618 to initiate the appeal process. If the malware detector 616 determines that the ad's landing page is now free of malware, the customer may receive a notification that the appeal was successful and that the account is now reinstated. However, if the malware detector 616 determines that the ad's landing page still includes malware, the customer may receive a notification that the appeal was denied, including detailed information about the malware detected. In some implementations, the notification process for malware detections and appeal results may be accomplished in groups, for example, such as not to overwhelm the customer with a high number of email notifications.
In some implementations, ads associated with a sponsor account are precluded on a per-ad basis, e.g., only ads having an intrusion score that exceeds an intrusion threshold are precluded from being served. Upon an appeal, the candidate landing page is re-submitted to the intrusion detection engine, and another intrusion score for the candidate landing page is received from the intrusion detection engine. The ad remains suspended or is reinstated depending on the intrusion score received during the appeal.
In some implementations, ads associated with a sponsor account are precluded on a per-account basis if any one ad in the account is determined to have an intrusion score that exceeds the intrusion threshold. Upon an appeal, all ads in the sponsor account are identified and checked for malware. The account remains suspended if any one of the landing pages associated with the sponsor account is determined to have an intrusion score that exceeds the intrusion threshold.
Stage 702 identifies a landing page associated with sponsored content. For example, the landing page may be the landing page for an ad that a user may see in a web browser after clicking on an ad. In general, the context of “landing pages” can include any content or headers, including redirects that may be encountered or seen by the user of a web browser following an ad click.
Stage 704 identifies intrusion features of the landing page. For example, the process 700 may use the scoring engine 502 in
Stage 706 generates a feature score for the landing page based on the identified intrusion features. For example, the scoring engine 502 (see
Stage 708 determines if the feature score for the landing page exceeds a feature threshold. For example, the scoring engine 502 may determine if the feature score generated for the ad's landing page exceeds a pre-defined feature threshold. In another example, the sampler 610 may determine if the feature score generated for the ad exceeds a pre-defined feature threshold. Feature thresholds may be a numeric, for example. In some implementations, different feature thresholds may exist for different tiers of advertisers, such as tiers based on malware risk. For example, advertisers who are known to have little or no malware-related ads may have a higher threshold; or advertisers may request to have a lower threshold established in order to identify potential infected ads more easily to guard against a poor customer experience; etc.
Stage 710 classifies the landing page as a candidate landing page if the feature score for the landing page exceeds the feature threshold. For example, if the scoring engine 502 determines that the feature score generated for the ad's landing page exceeds the pre-defined feature threshold, the scoring engine 502 can output the corresponding candidate URLs 506. In another example, if the sampler 610 determines that the feature score generated for the ad's landing page exceeds the pre-defined feature threshold, the sampler 610 can provide the candidate URL to the malware hub 612.
Stage 802 submits the candidate landing page to an intrusion detection engine. For example, referring to
Stage 804 receives an intrusion score for the candidate landing page from the intrusion detection engine. For example, referring to
Stage 806 precludes the serving of the advertisement associated with the candidate landing page if the intrusion score exceeds an intrusion threshold. For example, if the intrusion score of the ad's landing page processed by the malware detector 616 exceeds an intrusion threshold, the malware detector 616 may update the status data base 614 with information that the corresponding ad is to be flagged. Such information in the status data base 614 may be used later to update the ads data base 604. Ads that are flagged in the ads data base 604 may be precluded in various ways, such as by marking the served ads (e.g., in a user's browser) as containing potential malware or by preventing the ads from being served. Preclusion in stage 806 may also include suspending the advertiser's account, or in a tiered account system, raising the malware risk rating for the advertiser.
Stage 902 receives an appeal request for the sponsor account. For example, the appeal may originate from the customer front end 618 of
Stage 904 re-submits the candidate landing page to an intrusion detection engine. For example, the system 600 may use information corresponding to the appeal that is stored in the appeals data base 620 to re-submit the candidate landing page to the malware detector 616, which can include or communicate with an intrusion detection engine.
Stage 906 receives another intrusion score of the candidate landing page from the intrusion detection engine. For example, as a result of the re-submission of stage 904, a new intrusion score for the ad can be generated and received. In general, this intrusion score may be lower for the ad's landing page, for example, if the advertiser who appealed the ad has since rid the ad's landing page of malware or provided a new landing page for the ad, e.g., by engaging a new publisher.
Stage 908 determines if the intrusion score exceeds an intrusion threshold. If the intrusion score exceeds the intrusion threshold, stage 910 precludes the serving of the advertisement associated with the candidate landing page if another intrusion score exceeds the intrusion threshold. For example, if the new intrusion score of the ad's landing page processed by the malware detector 616 exceeds the intrusion threshold, the malware detector 616 may update the appeals data base 620 with information that the corresponding ad is still associated with malware.
If the intrusion score does not exceed the intrusion threshold, then stage 912 allows the serving of the advertisement associated with the candidate landing page if another intrusion score does not exceed the intrusion threshold. For example, if the new intrusion score of the ad's landing page processed by the malware detector 616 does not exceed the intrusion threshold, the malware detector 616 may update the appeals data base 620 with information that the corresponding ad is now clean and may be served without restriction.
Stage 1002 identifies a sponsor account associated with the advertisement, the sponsor account including additional advertisements. For example, referring to
Stage 1004 precludes the serving of the additional advertisements associated with the sponsor account if the intrusion score of the candidate landing page exceeds the intrusion threshold. For example, using the sponsor's account information identified in stage 1002, the malware detector 616 can preclude the serving of the advertiser's additional ads. In particular, under the business policy represented by the process 1000, once one ad for an advertiser is determined to be associated with malware, that ad and all others for the advertiser can be flagged (and precluded).
Stage 1006 receives an appeal request for the sponsor account. For example, the appeal may originate from a user executing the customer front end 618 (see
Stage 1008 submits the candidate landing page and additional landing pages associated with the additional advertisements to the intrusion detection engine. For example, the system 600 may use information corresponding to the appeal that is stored in the appeals data base 620 to submit all of the advertiser's candidate landing pages to the malware detector 616, which can include an intrusion detection engine or provide the landing page information to an intrusion detection engine. As part of the process, the account information corresponding to the candidate landing page may be used to identify other ads in the ads data base 604 that correspond to the advertiser's account. Specifically, the candidate landing pages can include the original candidate landing page and additional landing pages associated with the additional advertisements for the advertiser.
Stage 1010 receives another intrusion score of the candidate landing page and additional intrusion scores for the additional landing pages from the intrusion detection engine. For example, as a result of the malware detector 616 evaluating all of the candidate landing pages for the advertiser, intrusion scores corresponding to the landing pages can be generated. In particular, the intrusion scores may be stored in (or received by) the appeals data base 620. In some implementations, the intrusion scores of the additional landing pages may be stored in the status data base 614.
Stage 1012 determines if the intrusion scores for the landing pages exceed the intrusion threshold. For example, the malware detector 616 can determine which, if any, of the landing pages' intrusion scores received in stage 1010 exceed the intrusion threshold.
Stage 1014 precludes the serving of advertisements associated with the sponsor account if an intrusion score for any of the landing pages exceeds the intrusion threshold. For example, if any of the intrusion scores are determined by the malware detector 616 to exceed the intrusion threshold, the malware detector 616 may update the appeals data base 620 with information that the sponsor's ads (as a whole) are still include malware and can be precluded from being served.
Stage 1102 partitions landing pages associated with advertisements into training landing pages and testing landing pages. For example, referring to
Stage 1104 iteratively trains a classification model on intrusion features of the training landing pages. For example, using features extracted by the feature extraction engine 406 from the training landing pages obtained from the landing pages and URLs 404, the system 400 can iteratively train the classification model 402. The training may be performed by a combination of the control evaluation 408 and the machine learning engine 410.
Stage 1106 iteratively tests the classification model on the intrusion features of the testing landing pages until the occurrence of a testing cessation event. For example, using features extracted by the feature extraction engine 406 from the testing landing pages obtained from the landing pages and URLs 404, the system 400 can iteratively test the classification model 402. The testing may be performed by the machine learning engine 410. During testing, associations between feature weights and intrusion features can be adjusted, such as by using a linear regression model. Stages 1104 and 1106 can be repeated iteratively, for example, until the occurrence of a testing cessation event, such as the determination that the feature weights are good enough. Stage 1108 stores an association of feature weights and intrusion features in the classification model, the association of feature weights and intrusion features derived from the iterative training and testing. For example, the associations between feature weights and intrusion features that are iteratively generated by stages 1104 and 1106 can be stored in the classification model 402.
The apparatus, methods, flow diagrams, and structure block diagrams described in this patent document may be implemented in computer processing systems including program code comprising program instructions that are executable by the computer processing system. Other implementations may also be used. Additionally, the flow diagrams and structure block diagrams described in this patent document, which describe particular methods and/or corresponding acts in support of steps and corresponding functions in support of disclosed structural means, may also be utilized to implement corresponding software structures and algorithms, and equivalents thereof.
This written description sets forth the best mode of the invention and provides examples to describe the invention and to enable a person of ordinary skill in the art to make and use the invention. This written description does not limit the invention to the precise terms set forth. Thus, while the invention has been described in detail with reference to the examples set forth above, those of ordinary skill in the art may effect alterations, modifications and variations to the examples without departing from the scope of the invention.
Claims
1. (canceled)
2. A computer-implemented method comprising:
- simulating, by one or more processors, selection of a content item that is linked to a landing page;
- evaluating, by the one or more processors, one or more additional pages that differ from the landing page and that are in a redirect chain followed in response to the simulated selection of the content item for characteristics of malware;
- flagging, by the one or more processors, the content item as malware when the evaluating detects malware in the one or more additional pages that are in the redirect chain; and
- preventing, by the one or more processors, the content item from being served while the content item is flagged as malware.
3. The method of claim 2, wherein evaluating the one or more additional pages comprises evaluating one or more of iFrame features, URL features, or script features of the one or more pages.
4. The method of claim 2, wherein simulating selection of a content item comprises simulating the selection using a virtual machine.
5. The method of claim 4, wherein the virtual machine is configured to:
- monitor usage of system files by the sponsored content item; and
- monitor processes created by the sponsored content item.
6. The method of claim 2, further comprising:
- evaluating features of the landing page using a first malware detection process; and
- determining that the landing page is a malware candidate based on the evaluation of the features of the landing page.
7. The method of claim 2, wherein evaluating the one or more additional pages comprises evaluating the one or more additional pages using an intrusion detection engine that generates intrusion scores for the one or more additional pages.
8. The method of claim 7, further comprising aggregating intrusion scores for one or more features of the one or more additional pages, wherein flagging the content item as malware is based on the aggregate intrusion scores for the one or more features of the one or more additional pages.
9. A system comprising:
- one or more computers; and
- one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: simulating selection of a content item that is linked to a landing page; evaluating one or more additional pages that differ from the landing page and that are in a redirect chain followed in response to the simulated selection of the content item for characteristics of malware; flagging the content item as malware when the evaluating detects malware in the one or more additional pages that are in the redirect chain; and preventing the content item from being served while the content item is flagged as malware.
10. The system of claim 9, wherein evaluating the one or more additional pages comprises evaluating one or more of iFrame features, URL features, or script features of the one or more pages.
11. The system of claim 9, wherein simulating selection of a content item comprises simulating the selection using a virtual machine.
12. The system of claim 11, wherein the virtual machine is configured to:
- monitor usage of system files by the sponsored content item; and
- monitor processes created by the sponsored content item.
13. The system of claim 9, wherein the instructions cause the one or more computers to perform operations further comprising:
- evaluating features of the landing page using a first malware detection process; and
- determining that the landing page is a malware candidate based on the evaluation of the features of the landing page.
14. The system of claim 9, wherein evaluating the one or more additional pages comprises evaluating the one or more additional pages using an intrusion detection engine that generates intrusion scores for the one or more additional pages.
15. The system of claim 14, wherein the instructions cause the one or more computers to perform operations further comprising aggregating intrusion scores for one or more features of the one or more additional pages, and wherein flagging the content item as malware is based on the aggregate intrusion scores for the one or more features of the one or more additional pages.
16. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
- simulating selection of a content item that is linked to a landing page;
- evaluating one or more additional pages that differ from the landing page and that are in a redirect chain followed in response to the simulated selection of the content item for characteristics of malware;
- flagging the content item as malware when the evaluating detects malware in the one or more additional pages that are in the redirect chain; and
- preventing the content item from being served while the content item is flagged as malware.
17. The non-transitory computer-readable medium of claim 16, wherein evaluating the one or more additional pages comprises evaluating one or more of iFrame features, URL features, or script features of the one or more pages.
18. The non-transitory computer-readable medium of claim 16, wherein simulating selection of a content item comprises simulating the selection using a virtual machine.
19. The non-transitory computer-readable medium of claim 18, wherein the virtual machine is configured to:
- monitor usage of system files by the sponsored content item; and
- monitor processes created by the sponsored content item.
20. The non-transitory computer-readable medium of claim 16, wherein the instructions cause the one or more computers to perform operations further comprising:
- evaluating features of the landing page using a first malware detection process; and
- determining that the landing page is a malware candidate based on the evaluation of the features of the landing page.
21. The non-transitory computer-readable medium of claim 16, wherein evaluating the one or more additional pages comprises evaluating the one or more additional pages using an intrusion detection engine that generates intrusion scores for the one or more additional pages.
Type: Application
Filed: Apr 23, 2020
Publication Date: Sep 17, 2020
Inventors: Niels Provos (Los Altos, CA), Yunkai Zhou (Los Altos, CA), Clayton W. Bavor, Jr. (Palo Alto, CA), Eric L. Davis (Menlo Park, CA), Mark Palatucci (Pittsburgh, PA), Kamal P. Nigam (Pittsburgh, PA), Christopher K. Monson (Swissvale, PA), Panayiotis Mavrommatis (Mountain View, CA), Rachel Nakauchi (Yelm, WA)
Application Number: 16/857,018