IMAGE SPAM FILTERING BASED ON SENDERS' INTENTION ANALYSIS
Systems and methods for an anti-spam detection module that can detect image spam are provided. According to one embodiment, an image spam detection process involves determining and measuring various characteristics of images that may be embedded within or otherwise associated with an electronic mail (email) message. An approximate display location of the embedded images is determined. The existence of one or more abnormal factors associated with the embedded images is identified. A quantity of text included in the one or more embedded images is determined and measured by analyzing one or more blocks of binarized representations of the one or more embedded images. Finally, the likelihood that the email message is spam is determined based on one or more of the approximate display location, the existence of one or more abnormal factors and the quantity and location of text measured.
Latest Patents:
Contained herein is material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction of the patent disclosure by any person as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all rights to the copyright whatsoever. Copyright© 2007, Fortinet, Inc.
BACKGROUND1. Field
Embodiments of the present invention generally relate to the field of spam filtering and anti-spam techniques. In particular, various embodiments relate to image analysis and methods for combating spam in which spammers use images to carry the advertising text.
2. Description of the Related Art
Image spam was originally created in order to get past heuristic filters, which block messages containing words and phrases commonly found in spam. Since image files have different formats than the text found in the message body of an electronic mail (email) message, conventional heuristic filters, which analyze such text do not detect the content of the message, which may be partly or wholly conveyed by embedded text within the image. As a result, heuristic filters were easily defeated by image spam techniques.
To address this spamming technique, fuzzy signature technologies, which flag both known and similar messages as spam, were deployed by anti-spam vendors. Such fuzzy signature technologies allowed message attachments to be targeted, thereby recognizing as spam messages with different content but the same attachment.
Spammers now alter the images to make the email message appear different to signature-based filtering approaches yet while maintaining readability of the embedded text message to human viewers. The content of images lies in two levels: (i) the pixel matrix and (ii) the text or graphics these pixel matrices represent. At present, the notion of pixel-based matching does not make sense, as the same text could be represented by countless pixel matrices by simply changing various attributes, such as the font, size, color or by adding noise. Therefore, hash matching and other signature-based approaches have essentially been rendered useless to block image spam as they fail as a result of even minor changes to the background of the image.
Some vendors have attempted to catch image spam by employing Optical Character Recognition (OCR) techniques; however, such approaches have only limited success in view of spammers' use of techniques to obscure the embedded text messages with a variety of noise.
Systems and methods are described for an anti-spam detection module that can detect image spam. According to one embodiment, one or more of the quantity and position of text within an image associated with an electronic message are measured or estimated. Then, based at least in part on the results of the measuring or estimating, the likelihood that the electronic message is spam is determined.
According to another embodiment, an embedded image of an electronic mail (email) message is converted to a binarized representation by performing thresholding on a grayscale representation of the embedded image. A quantity of text included in the embedded image is then determined and measured by analyzing one or more blocks of the binarized representations. Finally, the email message is classified as spam or clean based at least in part on the quantity of text measured.
In one embodiment, the embedded image may be formatted in accordance with the Graphic Interchange Format (GIF), Joint Photographic Experts Group (JPEG) or Portable Network Graphics (PNG) formats/standards.
In one embodiment, the embedded image may be an image contained within a file attached to the email message.
In one embodiment, the method also includes determining an approximate display location of an embedded image within the email message and identifying existence of one or more abnormal factors associated with the embedded image. Then, the classification can be based upon the approximate display location, the existence of one or more abnormal factors as well as the quantity of text measured.
Other features of embodiments of the present invention will be apparent from the accompanying drawings and from the detailed description that follows.
Embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
Systems and methods are described for an anti-spam detection module that can detect various forms of image spam. According to one embodiment, images attached to or embedded within email messages are analyzed to determine the senders' intention. Empirical analysis reveals legitimate emails may contain embedded images, but valid images sent through email rarely contain a substantial quantity of text. Additionally, when legitimate images are included within email messages, the senders of such email messages do not painstakingly adjust the location of such included images to assure such images appear in the preview window of an email client. Furthermore, legitimate senders do not intentionally inject noise into the embedded images. In contrast, spammers usually compose email messages in different ways. For example, in the context of image spam, spammers insert text into images to avoid filtering by traditional text filters and employ techniques to randomize images and/or obscure text embedded within images. Spammers also typically make great efforts to draw attention to their image spam by carefully placing the image in such a manner as to make it visible to the recipient in the preview window/pane of an email client that supports HTML email, such as Netscape Messenger or Microsoft Outlook. Consequently, various indicators of image spam include, but are not limited to, inclusion of one or more images in the front part of an email message, inclusion of one or more images containing text meeting a certain threshold and/or inclusion of one or more images into which noise appears to have been injected to obfuscate embedded text.
According to one embodiment, various image analysis techniques are employed to more accurately detect image spam based on senders' intention analysis. The goal of senders' intention analysis is to discover the email message sender's intent by examining various characteristics of the email message and the embedded or attached images. If it appears, for example, after performing image analysis that one or more images associated with an email message have had one or more obfuscation techniques applied, the intent is to draw attention to the one or more images and/or the one or more images include suspicious quantities of text, then the sender's intention analysis anti-spam processing may flag the email message at issue as spam. In one embodiment, the image scanning spam detection method is based on a combination of email header analysis, email body analysis and image processing on image attachments.
Importantly, although various embodiments of the anti-spam detection module and image scanning methodologies are discussed in the context of an email security system, they are equally applicable to network gateways, email appliances, client workstations, servers and other virtual or physical network devices or appliances that may be logically interposed between client workstations and servers, such as firewalls, network security appliances, email security appliances, virtual private network (VPN) gateways, switches, bridges, routers and like devices through which email messages flow. Furthermore, the anti-spam techniques described herein are equally applicable to instant messages, (Multimedia Message Service) MMS messages and other forms of electronic communications in the event that such message become vulnerable to image spam in the future.
Additionally, various embodiments of the present invention are described with reference to filtering of incoming email messages. However, it is to be understood, that the filtering methodologies described herein are equally applicable to email messages originated within an enterprise and circulated internally or outgoing email messages intended for recipients outside of the enterprise. Therefore, the specific examples presented herein are not intended to be limiting and are merely representative of exemplary functionality.
Furthermore, while, for convenience, various embodiments of the present invention may be described with reference to detecting image spam in the graphic/image file formats currently most prevalent (i.e., Graphic Interchange Format (GIF), Joint Photographic Experts Group (JPEG) and Portable Network Graphics (PNG) graphic/image file formats), embodiments of the present invention are not so limited and are equally applicable to various other current and future graphic/image file formats, including, but not limited to, Progressive Graphics File (PGF), Tagged Image File Format (TIFF), bit mapped format (BMP), HDP, WDP, XPM, MacOS-PICT, Irix-RGB, Multiresolution Seamless Image Database (MrSID), RAW formats used by various digital cameras, various vector formats, such as Scalable Vector Graphics (SVG), as well as other file formats of attachments which may themselves contain embedded images, such as Portable Document Format (PDF), Encapsulated PostScript, SWF, Windows Metafile, AutoCAD DXF and CorelDraw CDR.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of embodiments of the present invention. It will be apparent, however, to one skilled in the art that embodiments of the present invention may be practiced without some of these specific details. In other instances, well-known structures and devices are shown in block diagram form.
Embodiments of the present invention include various steps, which will be described below. The steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps. Alternatively, the steps may be performed by a combination of hardware, software, firmware and/or by human operators.
Embodiments of the present invention may be provided as a computer program product, which may include a machine-readable medium having stored thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical disks, ROMs, random access memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing electronic instructions. Moreover, embodiments of the present invention may also be downloaded as a computer program product, wherein the program may be transferred from a remote computer to a requesting computer by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
TerminologyBrief definitions of terms used throughout this application are given below.
The term “client” generally refers to an application, program, process or device in a client/server relationship that requests information or services from another program, process or device (a server) on a network. Importantly, the terms “client” and “server” are relative since an application may be a client to one application but a server to another. The term “client” also encompasses software that makes the connection between a requesting application, program, process or device to a server possible, such as an FTP client.
The terms “connected” or “coupled” and related terms are used in an operational sense and are not necessarily limited to a direct connection or coupling. Thus, for example, two devices may be coupled directly, or via one or more intermediary media or devices. As another example, devices may be coupled in such a way that information can be passed there between, while not sharing any physical connection with one another. Based on the disclosure provided herein, one of ordinary skill in the art will appreciate a variety of ways in which connection or coupling exists in accordance with the aforementioned definition.
The phrase “embedded image” generally refers to an image that is displayed or rendered inline within a styled or formatted electronic message, such as a HyperText Markup Language (HTML)-based or formatted email message. As used herein, the phrase “embedded image” is intended to encompass scenarios in which the image data is sent with the email message and linked images in which a reference to the image is sent with the email message and the image data is retrieved once the recipient views the email message. The phrase “embedded image” also includes an image that is embedded in other file formats of attachments, such as Portable Document Format (PDF) attachments, in which the image data is displayed to the email recipient when the attachment is viewed.
The phrase “image spam” generally refers to spam in which the “call to action” of the message is partially or completely contained within an embedded file attachment, such as a .gif or jpeg or .pdf file, rather than in the body of the email message. Such images are typically automatically displayed to the email recipients and typically some form of obfuscation is implemented in an attempt to hide the true content of the image from spam filters.
The phrases “in one embodiment,” “according to one embodiment,” and the like generally mean the particular feature, structure, or characteristic following the phrase is included in at least one embodiment of the present invention, and may be included in more than one embodiment of the present invention. Importantly, such phrases do not necessarily refer to the same embodiment.
The phrase “network gateway” generally refers to an internetworking system, a system that joins two networks together. A “network gateway” can be implemented completely in software, completely in hardware, or as a combination of the two. Depending on the particular implementation, network gateways can operate at any level of the OSI model from application protocols to low-level signaling.
If the specification states a component or feature “may”, “can”, “could”, or “might” be included or have a characteristic, that particular component or feature is not required to be included or have the characteristic.
The term “proxy” generally refers to an intermediary device, program or agent, which acts as both a server and a client for the purpose of making or forwarding requests on behalf of other clients.
The term “responsive” includes completely or partially responsive.
The term “server” generally refers to an application, program, process or device in a client/server relationship that responds to requests for information or services by another program, process or device (a server) on a network. The term “server” also encompasses software that makes the act of serving information or providing services possible.
The term “spam” generally refers to electronic junk mail, typically bulk electronic mail (email) messages in the form of commercial advertising. Often, email message content may be irrelevant in determining whether an email message is spam, though most spam is commercial in nature. There is spam that fraudulently promotes penny stocks in the classic pump-and-dump scheme. There is spam that promotes religious beliefs. From the recipient's perspective, spam typically represents unsolicited, unwanted, irrelevant, and/or inappropriate email messages, often unsolicited commercial email (UCE). In addition to UCE, spam includes, but is not limited to, email messages regarding or associated with fraudulent business schemes, chain letters, and/or offensive sexual or political messages.
According to one embodiment “spam” comprises Unsolicited Bulk Email (UBE). Unsolicited generally means the recipient of the email message has not granted verifiable permission for the email message to be sent and the sender has no discernible relationship with all or some of the recipients. Bulk generally refers to the fact that the email message is sent as part of a larger collection of email messages, all having substantively identical content. In embodiments in which spam is equated with UBE, an email message is considered spam if it is both unsolicited and bulk. Unsolicited email can be normal email, such as first contact enquiries, job enquiries, and sales enquiries. Bulk email can be normal email, such as subscriber newsletters, customer communications, discussion lists, etc. Consequently, in such embodiments, an email message would be considered spam (i) the recipient's personal identity and context are irrelevant because the email message is equally applicable to many other potential recipients; and (ii) the recipient has not verifiably granted deliberate, explicit, and still-revocable permission for the email message to be sent.
The phrase “transparent proxy” generally refers to a specialized form of proxy that only implements a subset of a given protocol and allows unknown or uninteresting protocol commands to pass unaltered. Advantageously, as compared to a full proxy in which use by a client typically requires editing of the client's configuration file(s) to point to the proxy, it is not necessary to perform such extra configuration in order to use a transparent proxy.
In the present example, the email security system 220 is logically interposed between spammers and the email server 230 to perform spam filtering on incoming email messages from the public Internet 200 prior to receipt and storage on the email server 230 from which and through which client workstations 260 residing on the LAN 240 may retrieve and send email correspondence.
In the exemplary network architecture of
In one embodiment, the network gateway 215 acts as an interface between the LAN 240 and the public Internet 200. The network gateway 215 may, for example, translate between dissimilar protocols used internally and externally to the LAN 240. Depending upon the distribution of functionality, the network gateway 215 or the firewall 210 may perform network address translation (NAT) to hide private Internet Protocol (IP) addresses used within LAN 240 and enable multiple client workstations, such as client workstations 260, to access the public Internet 200 using a single public IP address.
According to one embodiment, the email security system 220 performs email filtering to detect, tag, block and/or remove unwanted spam and malicious attachments. In one embodiment, an anti-spam module 225 of the email security system 220, performs one or more spam filtering techniques, including but not limited to, sender IP reputation analysis and content analysis, such as attachment/content filtering, heuristic rules, deep email header inspection, spam URI real-time blacklists (SURBL), banned word filtering, spam checksum blacklist, forged IP checking, greylist checking, Bayesian classification, Bayesian statistical filters, signature reputation, and/or filtering methods such as FortiGuard-Antispam, access policy filtering, global and user black/white list filtering, spam Real-time Blackhole List (RBL), Domain Name Service (DNS) Block List (DNSBL) and per user Bayesian filtering so that individual users can set their own profiles.
The anti-spam module 225 also performs various novel image spam detection methodologies or spam image analysis scanning based on sender's intention analysis in an attempt to detect, tag, block and/or remove spam presented in the form of one or more images. Examples of the image analysis techniques and the sender's intention analysis methodologies are described in more detail below. Existing email security platforms that exemplifies various operational characteristics of the email security system 220 according to an embodiment of the present invention include the FortiMail™ family of high-performance, multi-layered email security platforms, including the FortiMail-100 platform, the FortiMail-400 platform, the FortiMail-2000 platform and the FortiMail-4000A platform all of which are available from Fortinet, Inc. of Sunnyvale, Calif.
While in this simplified example, only a single client workstation, i.e., client workstation 360, and a single email server, i.e., email server 330, are shown interacting with the email security system 320, it should be understood that many local and/or remote client workstations, servers and email servers may interact directly or indirectly with the email security system 320 and directly or indirectly with each other.
According to the present example, the email security system 320, which may be implemented as one or more virtual or physical devices, includes a content processor 321, logically interposed between sources of inbound email 380 and an enterprise's email server 330. In the context of the present example, the content processor 321 performs scanning of inbound email messages 380 originating from sources on the public Internet 200 before allowing such inbound email messages 380 to be stored on the email server 330. In one embodiment, an anti-spam module 325 of the content processor 321 may perform spam filtering and an anti-virus (AV) module 326 implementing AV and other filters potentially performs other traditional anti-virus detection and content filtering on data associated with the email messages.
In the current example, anti-spam module 325 may apply various image analysis methodologies described further below to ascertain email senders' intentions and therefore the likelihood that attached and/or embedded images represent image spam. According to the current example, the anti-spam module 325, responsive to being presented with an inbound email message, determines whether the email message contains embedded or attached images and if so, as described further below with reference to
In one embodiment, the content processor 321 is an integrated FortiASIC™ Content Processor chip developed by Fortinet, Inc. of Sunnyvale, Calif. In alternative embodiments, the content processor 321 may be a dedicated coprocessor or software to help offload content filtering tasks from a host processor (not shown).
In alternative embodiments, the anti-spam module 325 may be associated with or otherwise responsive to a mail transfer protocol proxy (not shown). The mail transfer protocol proxy may be implemented as a transparent proxy that implements handlers for Simple Mail Transfer Protocol (SMTP) or Extended SMTP (ESMTP) commands/replies relevant to the performance of content filtering activities and passes through those not relevant to the performance of content filtering activities. In one embodiment, the mail transfer protocol proxy may subject each of incoming email, outgoing email and internal email to scanning by the anti-spam module 325 and/or the content processor 321.
Notably, filtering of email need not be performed prior to storage of email message on the email server 330. In alternative embodiments, the content processor 321, the mail transfer protocol proxy (not shown) or some other functional unit logically interposed between a user agent or email client 361 executing on the client workstation 360 and the email server 330 may process email messages at the time they are requested to be transferred from the user agent/email client 361 to the email server 330 or vice versa. Meanwhile, neither the email messages nor their attachments need be stored locally on the email security system 320 to support the filtering functionality described herein. For example, instead of the anti-spam processing running responsive to a mail transfer protocol proxy, the email security system 320 may open a direct connection between the email client 361 and the email server 330, and filter email in real-time as it passes through.
While in the context of the present example, the content processor 321, the anti-spam module 325 and the mail transfer protocol proxy (not shown) have been described as residing within or as part of the same network device, in alternative embodiments one or more of these functional units may be located remotely from the other functional units. According to one embodiment, the hardware components and/or software modules that implement the content processor 321 the anti-spam module 325 and the mail transfer protocol proxy are generally provided on or distributed among one or more Internet and/or LAN accessible networked devices, such as one or more network gateways, firewalls, network security appliances, email security systems, switches, bridges, routers, data storage devices, computer systems and the like.
In one embodiment, the functionality of one or more of the above-referenced functional units may be merged in various combinations. For example, the content processor 321 may be incorporated within the mail transfer protocol proxy or the anti-spam module 325 may be incorporated within the email server 330 or the email client 361. Moreover, the functional units can be communicatively coupled using any suitable communication method (e.g., message passing, parameter passing, and/or signals through one or more communication paths etc.). Additionally, the functional units can be physically connected according to any suitable interconnection architecture (e.g., fully connected, hypercube, etc.).
According to embodiments of the invention, the functional units can be any suitable type of logic (e.g., digital logic) for executing the operations described herein. Any of the functional units used in conjunction with embodiments of the invention can include machine-readable media including instructions for performing operations described herein. Machine-readable media include any mechanism that provides (i.e., stores and/or transmits) information in a form readable by a machine (e.g., a computer). For example, a machine-readable medium includes read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), etc.
The processor(s) 405 may be Intel® Itanium® or Itanium 2® processor(s), AMD® Opteron® or Athlon MP® processor(s) or other processors known in the art.
Communication port(s) 410 represent physical and/or logical ports. For example communication port(s) may be any of an RS-232 port for use with a modem based dialup connection, a 10/100 Ethernet port, or a Gigabit port using copper or fiber. Communication port(s) 410 may be chosen depending on a network such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which the computer system 400 connects.
Communication port(s) 410 may also be the name of the end of a logical connection (e.g., a Transmission Control Protocol (TCP) port or a User Datagram Protocol (UDP) port). For example communication ports may be one of the Well Know Ports, such as TCP port 25 or UDP port 25 (used for Simple Mail Transfer), assigned by the Internet Assigned Numbers Authority (IANA) for specific uses.
Main memory 415 may be Random Access Memory (RAM), or any other dynamic storage device(s) commonly known in the art.
Read only memory 420 may be any static storage device(s) such as Programmable Read Only Memory (PROM) chips for storing static information such as instructions for processors 405.
Mass storage 425 may be used to store information and instructions. For example, hard disks such as the Adaptec® family of SCSI drives, an optical disc, an array of disks such as RAID, such as the Adaptec family of RAID drives, or any other mass storage devices may be used.
Bus 430 communicatively couples processor(s) 405 with the other memory, storage and communication blocks. Bus 430 may be a PCI/PCI-X or SCSI based system bus depending on the storage devices used.
Optional removable storage media 440 may be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc-Read Only Memory (CD-ROM), Compact Disc-Re-Writable (CD-RW), Digital Video Disk (DVD)-Read Only Memory (DVD-ROM), Re-Writable DVD and the like.
At block 510, an email message is analyzed to determine if it contains images. For purposes of the present example, the direction of flow of the email message is not pertinent. As indicated above, the email message may be inbound, outbound or an intra-enterprise email message. In various embodiments, however, the anti-spam processing may be enabled in one direction only or various detection threshholds could be configured differently for different flows.
In any event, in one embodiment, the headers, body and attachments, if any, of the email message at issue are parsed and scanned to identify whether the email message is deemed to contain one or more embedded images. If so, processing continues with block 520. Otherwise, no further image spam analysis is required and processing branches to the end.
At block 520, the email message at issue has been determined to contain one or more embedded images. In the current example, the senders' intention analysis anti-spam processing, therefore, proceeds to calculate the location(s) of the embedded image(s). Images may be embedded in a HyperText Markup Language (HTML) part of an HTML formatted email message, within a MIME document or attached separately. In one embodiment, by parsing the HTML, plain text and/or other Multipurpose Internet Mail Extension (MIME) parts, the displaying line just prior to the images can be identified and thus the approximate displaying location of any embedded images can be calculated.
At block 530, the one or more images are analyzed for indications of one or more abnormal factors. Typically, the abnormal factors are manifestations of a spammer's attempt to obscure text embedded within the one or more images by injecting a variety of noise. In one embodiment, abnormal factors include the presence of one or more of the following characteristics (i) illegal base64 encoding; (ii) multiple images within one HTML part; (iii) one or more low entropy frames in an animated Graphic Interchange Format (GIF); (iv) illegal chunk data within a Portable Network Graphics (PNG) file; (v) quantities of unsmoothed curves; and (iv) quantities of unsmoothed color blocks.
In one embodiment, illegal base64 encoding can be detected by, among other things, observing illegal characters, such as ‘!’ in the encoded content, such as the HTML formatted message or any part of the MIME document.
In one embodiment, the inclusion of multiple images within one HTML part can be detected by parsing the HTML formatted email message and observing more than one image within an HTML part. In the exemplary HTML code excerpt below, the existence of three images within a single table row (<tr> . . . </tr>) reveals an intention on the part of the creator of the email message to display a contiguous image to the email recipient based on the three separate embedded images.
The existence of one or more low entropy frames of an animated GIF may be determined on an absolute and/or relative basis. For example, an animated GIF frame may be determined to be low with reference to observed entropy values of normal GIF files, which vary from approximately 0.1 to 5.0. Therefore, in one embodiment, the existence of one or more low entropy frames is confirmed based on a comparison of the entropy values calculated for the animated GIF at issue to 0.1. If the entropy value calculated for any frame of the animated GIF at issue is less than 0.1, then this abnormal factor is deemed to exist. In other embodiments, one or more frames of the animated GIF file at issue may simply be “low” entropy relative to the other high entropy frames. For example, a variation of more than 4.9 between the highest entropy frame and the lowest entropy frame relatively lower than the others within the animated GIF file at issue.
Illegal chunk data within a Portable Network Graphics (PNG) file may be observed by evaluating information contained within and/or about the chunks. For example, the length of the chunk and cyclic redundancy checksum (CRC) may be verified against the actual data length and recomputed CRC.
Quantities of unsmoothed curves may be detected by evaluating the amount of pixels in which the difference between their color and the average color of the surrounding pixels are greater than a threshold.
Quantities of unsmoothed color blocks may be detected by evaluating the amount of the color blocks in which the difference between their color and the color of the surrounding color blocks are greater than a threshold. Color blocks contain pixels with the same or similar colors.
In one embodiment, rather than simply conveying a binary result (e.g., a zero indicating the absence of the abnormal factor at issue and a one indicating the presence of the abnormal factor at issue), a value within a range (e.g., 0 to 10) may be returned indicating the degree to which the abnormal factor is expressed.
At block 540, the quantity of text embedded in the images is measured. In one embodiment, images are converted to a binary representation based on a thresholding technique described in further detail below. In general, thresholding is a simple method of image segmentation. Individual pixels in a grayscale image are marked as “information” pixels if their value is greater than some threshold value, T, (assuming the information content is brighter than the background) and as “background” pixels otherwise. Typically, an information pixel is given a value of “1” while a background pixel is given a value of “0.” Then, a text string measurement algorithm is applied to the binary representation of the portion of the image deemed to contain the information content.
Notably, in one embodiment, rather than considering the quantity of embedded text alone, both the quantity of text and the relative position of such text within an email viewer's preview window, for example, or within the image itself may be taken into consideration. For example, a high spam score could be assigned to a very large image (with a correspondingly smaller percentage of text), but the text is positioned to occupy the whole preview window.
At block 550, the email message is classified as spam or clean based on the observed characteristics of the embedded image(s), such as image location information, the existence/non-existence of various abnormal factors and the quantity of text determined to exist within the embedded image(s). In one embodiment, the spam/clean classification may be based upon a weighted average of the various observed characteristics.
In one embodiment, each observed characteristic may contribute to the score. Once the score reaches a threshold, the email message may be classified as spam and the further characteristics may not require analysis or observation. The email message is classified as clean if the score is less than the threshold after all the characteristics have been evaluated. In one embodiment, the characteristics may be considered in the following order:
-
- Image location information
- Presence of continuous images
- Presence of illegal base64 encoding
- Presence of lower entropy frames in an animated GIF
- Presence of illegal chunk data of a PNG encoded image
- Quantities and/or location of text in the images
- Quantities of unsmoothed curves in the images
- Quantities of unsmoothed color blocks in the images
In one embodiment, similar to that described above with reference to abnormal factors, rather than making the ultimate spam/clean decision (because the ultimate decision could be made by another component), a “spaminess” score may be generated. For example, rather than simply conveying a binary result (e.g., spam vs. clean), a value within a range (e.g., 0 to 10) may be returned indicating the degree to which the email message appeared to contain indications of being spam or the likelihood the email message is spam.
If upon completion of the anti-spam processing described above there is not sufficient data to determine the email message is spam (e.g., there is insufficient data to determine the sender's intention), then according to one embodiment, more CPU intensive processes, such as OCR, may be invoked. Advantageously, in this manner, most image spam emails can be detected in real-time without compromising performance and more CPU intensive processes are only performed if and when required.
As mentioned with reference to
At block 610, if the image at issue is color, then it is converted to grayscale to form a grayscale representation, Gi,j. According to one embodiment, color pixels of the image at issue are converted to grayscale by computing an average or weighted average of the red, green and blue color components. While various conversions may be used, examples of suitable conversion equations include the following:
Gi,j=(0.299*ri,j+0.587* gi,j+0.114* bi,j)/3 0≦i<xmax,0≦j<ymax EQ #1
Gi,j=(0.3*ri,j+0.6*gi,j+0.1*bi,j)/3 0≦i<xmax,0≦j<ymax EQ #2
Gi,j=(ri,j+gi,j+bi,j)/3 0≦i<xmax,0≦j<ymax EQ #3
At block 620, entropy and threshold values are determined for the grayscale image, Gi,j. Entropy is a statistical measure of randomness that can be used to characterize the texture of the input image. In connection with calculating the entropy of the grayscale image, an intermediate data structure is built containing an intensity histogram, Cg. In the context of an 8-bit grayscale image, each pixel may have a value of 0 to 255. Thus, the intensity histogram includes 256 bins each of which maintain a count of the number of pixels in the grayscale image having that value. An example of an intensity histogram is shown in
According to one embodiment, a threshold value within the intensity histogram is selected simply by choosing the mean or median value. The rationale for this simple threshold selection is that if the information pixels are brighter than the background, they should also be brighter than the average. However, to compensate for the existence of noise and variability in the background, a more sophisticated approach is to create a histogram of the image pixel intensities and then use the valley point as the threshold, T. This histogram approach assumes that there is some average value for the background and information pixels, but that the actual pixel values have some variation around these average values. In one embodiment, the threshold, T, is calculated by:
T=Max(δi) 0≦i≦255 EQ#7
Subject to:
δi=iw1wi2(Mi1−Mi2)20≦i≦255 EQ #8
According to the above example, the gray levels are divided into two groups by i, and wi1 and wi2 are the total amount of the pixels of each group while Mi1 and Mi2 are the average of the gray level of each group.
Notably, there are many existing methods of performing thresholding. Consequently, any other current or future method of performing thresholding may be used depending upon the needs of a particular implementation.
At block 630, thresholding is performed to form a binary representation, Bi,j, of the grayscale image based on the threshold value selected in block 620. In one embodiment, thresholding is performed in accordance with the following equations:
where, ∂ is an adjustable parameter.
At block 640, the binary image is logically divided into M×N virtual blocks.
At block 650, the M×N virtual blocks are analyzed to quantify the number of text strings. In one embodiment, the text strings in the binary image are quantified in accordance with the following equations:
where,
∂0 . . . a∂7 are adjustable parameters;
Ty
CBin is the likelihood that the line[i] is a part of text;
Bk,i is the value of pixel[k,i] in the binary image.
Notably, while in the context of the equations presented above, a global thresholding approach is implemented taking into consideration the image as a whole, in alternative embodiments, various forms of local thresholding may be performed that consider groups of blocks or individual blocks to determine the best thresholding approach for such block or blocks.
CONCRETE EXAMPLESFor sake of illustration, two concrete examples of application of the thresholding and text quantification described above will now be provided with reference to
Application of the above-referenced equations also results in a threshold value, T, 910, being calculated for grayscale image 810. According to this example, the threshold value 910 is 109.
In the present example, segmented binary image 1110 contains 28 virtual blocks, examples of which are pointed out with reference numerals 1120 and 1130. According to equations EQ #15, EQ #16, EQ #17 and EQ #18, 23 of the 28 blocks contain a total of 63 text strings. Text strings detected by the algorithm are underlined. Block 1120 is an example of a block that has been determined to contain one or more text strings, i.e., the word “TRADE” 1121. Block 1130 is an example of a block that has been determined not to contain a text string.
Notably, to the extent reverse video or the presentation of light colored (e.g., white) text in the context of a dark (e.g., black) background becomes problematic (see, e.g., the “LEARN MORE” text string embedded within
While embodiments of the invention have been illustrated and described, it will be clear that the invention is not limited to these embodiments only. Numerous modifications, changes, variations, substitutions, and equivalents will be apparent to those skilled in the art, without departing from the spirit and scope of the invention, as described in the claims.
Claims
1. A method comprising:
- measuring or estimating one or more of the quantity and position of text within an image associated with an electronic message; and
- estimating the likelihood that the electronic message is spain based at least in part on results of the measuring or estimating.
2. The method of claim 1, wherein the electronic message comprises an electronic mail (email) message.
3. The method of claim 1, wherein the image is divided up into a plurality of blocks and image processing is applied to each of the plurality of blocks.
4. The method of claim 3, wherein the image processing includes local thresholding.
5. The method of claim 3, wherein the image processing includes global thresholding.
6. The method of claim 1, wherein filtering is applied to the image to remove noise deliberately added by an originator of the electronic message.
7. The method of claim 3, wherein the image processing comprises converting the image or one or more of the plurality of blocks to grayscale.
8. The method of claim 3, further comprising determining which colors or intensities are likely to represent text within the image or within one or more of the plurality of blocks by calculating an intensity histogram for the image or for the one or more of the plurality of blocks.
9. The method of claim 3, wherein the quantity of text is measured or estimated by summing the number of blocks within a portion of the image visible in a preview pane of an email client.
10-27. (canceled)
28. A computer-readable medium having stored thereon instructions, which when executed by one or more processors cause the one or more processors to perform a method comprising:
- measuring or estimating one or more of the quantity and position of text within an image associated with an electronic message; and
- estimating the likelihood that the electronic message is spain based at least in part on results of the measuring or estimating.
29. The computer-readable medium of claim 28, wherein the electronic message comprises an electronic mail (email) message.
30. The computer-readable medium of claim 28, wherein the image is divided up into a plurality of blocks and image processing is applied to each of the plurality of blocks.
31. The computer-readable medium of claim 30, wherein the image processing includes local thresholding.
32. The computer-readable medium of claim 30, wherein the image processing includes global thresholding.
33. The computer-readable medium of claim 28, wherein filtering is applied to the image to remove noise deliberately added by an originator of the electronic message.
34. The computer-readable medium of claim 30, wherein the image processing comprises convening the image or one or more of the plurality of blocks to grayscale.
35. The computer-readable medium of claim 30, further comprising determining which colors or intensities are likely to represent text within the image or within one or more of the plurality of blocks by calculating an intensity histogram for the image or for the one or more of the plurality of blocks.
36. The computer-readable medium of claim 30, wherein the quantity of text is measured or estimated by summing the number of blocks within a portion of the image visible in a preview pane of an email client.
Type: Application
Filed: Oct 31, 2007
Publication Date: Apr 30, 2009
Applicant:
Inventors: Jun Lu (Ottawa), Jiandong Cheng (Ottawa)
Application Number: 11/932,589
International Classification: G06F 15/16 (20060101);