CLIENT SIDE HUMAN USER INDICATOR

A system and method for preventing web scraping which includes receiving a request between a web client and a web server for the web client to receive web content. A client side language script is injected into a response to be sent to the requesting web client, wherein the client side language script contains an event listener to detect a keystroke and/or a mouse movement at the web client. Information is collected from the client side language script relating to whether the keystroke and/or the mouse movement were detected. The web client is selectively allowed to access the web server to receive the web content based on the collected information.

Skip to: Description  ·  Claims  · Patent History  ·  Patent History
Description

This application is a continuation of prior U.S. patent application Ser. No. 12/828,237, filed Jun. 30, 2010, which is hereby incorporated by reference in its entirety.

TECHNOLOGICAL FIELD

This technology generally relates to network communication security, and more particularly, to prevent web scraping.

BACKGROUND

With the widespread use of web-based applications, and the Internet in general, concerns have been raised with the security of such servers and web applications operating on them in view of the array of bots and other web scraping potential malicious requests. Various security measures have been taken to combat these ever growing threats, including implementing web application firewalls (“WAFs”), such as the BIG-IP Application Security Manager™ (“ASM™”) product developed by F5 Networks, Inc., of Seattle, Wash., which may be used to analyze network traffic to Web application servers for identifying and filtering out malicious packets or to otherwise thwart malicious attacks.

Web scraping is a computer software technique of extracting information from web pages in which web scrapers, search engines or bots index web content and transforms unstructured Web content, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Although safeguards against web scraping exist, such as CAPTCHA, bots and other web scrapers have become more complex and able to bypass these safeguards to take content off of web pages.

SUMMARY

In an aspect, a method for preventing web scraping comprises receiving a request between a web client and a web server for the web client to receive web content. The method comprises injecting a client side language script into a response to be sent to the requesting web client, wherein the client side language script contains an event listener to detect a keystroke and/or a mouse movement at the web client. The method comprises collecting information from the client side language script relating to whether the keystroke and/or the mouse movement were detected. The method comprises selectively allowing the web client to access the web server to receive the web content based on the collected information.

In an aspect, a machine readable medium having stored thereon instructions for preventing web scraping, comprising machine executable code which when executed by at least one machine, causes the machine to receive a request between a web client and a web server for the web client to receive web content. The code causes the machine to inject a client side language script into a response to be sent to the requesting web client, wherein the client side language script contains an event listener to detect a keystroke and/or a mouse movement at the web client. The code causes the machine to collect information from the client side language script relating to whether the keystroke and/or the mouse movement were detected. The code causes the machine to selectively allow the web client to access the web server to receive the web content based on the collected information.

In an aspect, a network traffic manager for preventing web scraping, the network traffic manager comprises a server interface coupled to a server and a network interface coupled to a web client via a network. The network interface receives a request from the web client requesting access to the server. A controller is coupled to the server interface and the network interface. The controller is operative to inject a client side language script into a response to be sent to the requesting web client, wherein the client side language script contains an event listener to detect a keystroke and/or a mouse movement at the web client. The controller collects information from the client side language script relating to whether the keystroke and/or the mouse movement were detected. The controller selectively allows the web client to access the web server to receive the web content based on the collected information.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram of an example system environment that includes a network traffic manager configured to identify and diffuse network attacks;

FIG. 2 is a block diagram of the network traffic manager shown in FIG. 1;

FIG. 3 is an example flow chart diagram depicting portions of processes for preventing web scraping; and

While these examples are susceptible of embodiment in many different forms, there is shown in the drawings and will herein be described in detail preferred examples with the understanding that the present disclosure is to be considered as an exemplification and is not intended to limit the broad aspect to the embodiments illustrated.

DETAILED DESCRIPTION

Referring now to FIG. 1, an example system environment 100 employs a network traffic management device 110 that is capable of identifying and selectively thwarting unauthorized bots, web scrapers or other malicious content scraping entities from capturing web content. The example system environment 100 includes one or more Web application servers 102, one or more client devices 106 and the traffic management device 110, although the environment 100 could include other numbers and types of devices in other arrangements. The network traffic management device 110 is coupled to the web application servers 102 via local area network (LAN) 104 and client devices 106 via network 108. Generally, requests sent over the network 108 from client devices 106 towards Web application servers 102 are received by traffic management device 110.

Client devices 106 comprise computing devices capable of connecting to other computing devices, such as network traffic management device 110 and Web application servers 102. Such connections are performed over wired and/or wireless networks, such as network 108, to send and receive data, such as for Web-based requests, receiving responses to requests and/or performing other tasks, in accordance with the processes described below in connection with FIG. 3. Non-limiting and non-exhausting examples of such devices include personal computers (e.g., desktops, laptops), mobile and/or smart phones and the like. In an example, client devices 106 run Web browsers that may provide an interface for operators, such as human users, to interact with for making requests for resources to different web server-based applications or Web pages via the network 108, although other server resources may be requested by clients. One or more Web-based applications may run on the web application server 102 that provide the requested data back to one or more exterior network devices, such as client devices 106.

Network 108 comprises a publicly accessible network, such as the Internet, which includes client devices 106. However, it is contemplated that the network 108 may comprise other types of private and public networks that include other devices. Communications, such as requests from clients 106 and responses from servers 102, take place over the network 108 according to standard network protocols, such as the HTTP and TCP/IP protocols in this example. However, the principles discussed herein are not limited to this example and can include other protocols. Further, it should be appreciated that network 108 may include local area networks (LANs), wide area networks (WANs), direct connections and any combination thereof, as well as other types and numbers of network types. On an interconnected set of LANs or other networks, including those based on differing architectures and protocols, routers, switches, hubs, gateways, bridges, and other intermediate network devices may act as links within and between LANs and other networks to enable messages and other data to be sent from and to network devices. Also, communication links within and between LANs and other networks typically include twisted wire pair (e.g., Ethernet), coaxial cable, analog telephone lines, full or fractional dedicated digital lines including T1, T2, T3, and T4, Integrated Services Digital Networks (ISDNs), Digital Subscriber Lines (DSLs), wireless links including satellite links and other communications links known to those skilled in the relevant arts. In essence, the network 108 includes any communication method by which data may travel between client devices 106, Web application servers 102 and network traffic management device 110, and the like.

LAN 104 comprises a private local area network that includes the network traffic management device 110 coupled to the one or more servers 102, although the LAN 104 may comprise other types of private and public networks with other devices. Networks, including local area networks, besides being understood by those skilled in the relevant arts, have already been generally described above in connection with network 108 and thus will not be described further.

Web application server 102 comprises one or more server computing machines capable of operating one or more Web-based applications that may be accessed by network devices in the network 108. Such network devices include client devices 106, via the network traffic management device 110, and may provide other data representing requested resources, such as particular Web page(s), image(s) of physical objects, and any other objects, responsive to the requests. It should be noted that the server 102 may perform other tasks and provide other types of resources. It should be noted that while only two Web application servers 102 are shown in the environment 100 depicted in FIG. 1, other numbers and types of servers may be coupled to the network traffic management device 110. It is also contemplated that one or more of the Web application servers 102 may be a cluster of servers managed by the network traffic management device 110.

As per the TCP/IP protocols, requests from the requesting client devices 106 may be sent as one or more streams of data packets over network 108 to the network traffic management device 110 and/or the Web application servers 102. Such protocols can establish connections, send and receive data for existing connections, and the like. It is to be understood that the one or more Web application servers 102 may be hardware and/or software, and/or may represent a system with multiple servers that may include internal or external networks. In this example, the Web application servers 102 may be any version of Microsoft® IIS servers or Apache® servers, although other types of servers may be used. Further, additional servers may be coupled to the network 108 and many different types of applications may be available on servers coupled to the network 108.

Each of the Web application servers 102 and client devices 106 may include one or more central processing units (CPUs), one or more computer readable media (i.e., memory), and interface systems that are coupled together by internal buses or other links as are generally known to those of ordinary skill in the art.

As shown in the example environment 100 depicted in FIG. 1, the network traffic management device 110 is interposed between client devices 106 in network 108 and Web application servers 102 in LAN 104. Again, the environment 100 could be arranged in other manners with other numbers and types of devices. Also, the network traffic management device 110 is coupled to network 108 by one or more network communication links and intermediate network devices (e.g. routers, switches, gateways, hubs and the like) (not shown). It should be understood that the devices and the particular configuration shown in FIG. 1 are provided for exemplary purposes only and thus are not limiting.

Generally, the network traffic management device 110 manages network communications, which may include one or more client requests and server responses, from/to the network 108 between the client devices 106 and one or more of the Web application servers 102 in LAN 104. These requests may be destined for one or more servers 102, and may take the form of one or more TCP/IP data packets originating from the network 108. The requests pass through one or more intermediate network devices and/or intermediate networks, until they ultimately reach the traffic management device 110. In any case, the network traffic management device 110 may manage the network communications by performing several network traffic related functions involving the communications. Such functions include load balancing, access control, and validating HTTP requests using JavaScript code that are sent back to requesting client devices 106 in accordance with the processes described further below in connection with FIG. 3.

Referring now to FIG. 2, an example network traffic management device 110 includes a device processor 200, device I/O interfaces 202, network interface 204 and device memory 218, which are coupled together by bus 208. It should be noted that the device 110 could include other types and numbers of components.

Device processor 200 comprises one or more microprocessors configured to execute computer/machine readable and executable instructions stored in device memory 218. Such instructions implement network traffic management related functions of the network traffic management device 110. In addition, the instructions implement the security module 210 to perform one or more portions of the processes illustrated in FIG. 3 for protecting the system. It is understood that the processor 200 may comprise other types and/or combinations of processors, such as digital signal processors, micro-controllers, application specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”), field programmable logic devices (“FPLDs”), field programmable gate arrays (“FPGAs”), and the like. The processor is programmed or configured according to the teachings as described and illustrated herein with respect to FIG. 3.

Device I/O interfaces 202 comprise one or more user input and output device interface mechanisms. The interface may include a computer keyboard, mouse, display device, and the corresponding physical ports and underlying supporting hardware and software to enable the network traffic management device 110 to communicate with the outside environment. Such communication may include accepting user data input and to provide user output, although other types and numbers of user input and output devices may be used. Additionally or alternatively, as will be described in connection with network interface 204 below, the network traffic management device 110 may communicate with the outside environment for certain types of operations (e.g., configuration) via a network management port.

Network interface 204 comprises one or more mechanisms that enable network traffic management device 110 to engage in TCP/IP communications over LAN 104 and network 108. However, it is contemplated that the network interface 204 may be constructed for use with other communication protocols and types of networks. Network interface 204 is sometimes referred to as a transceiver, transceiving device, or network interface card (NIC), which transmits and receives network data packets to one or more networks, such as LAN 104 and network 108. In an example where the network traffic management device 110 includes more than one device processor 200 (or a processor 200 has more than one core), each processor 200 (and/or core) may use the same single network interface 204 or a plurality of network interfaces 204. Further, the network interface 204 may include one or more physical ports, such as Ethernet ports, to couple the network traffic management device 110 with other network devices, such as Web application servers 102. Moreover, the interface 204 may include certain physical ports dedicated to receiving and/or transmitting certain types of network data, such as device management related data for configuring the network traffic management device 110.

Bus 208 may comprise one or more internal device component communication buses, links, bridges and supporting components, such as bus controllers and/or arbiters. The bus enable the various components of the network traffic management device 110, such as the processor 200, device I/O interfaces 202, network interface 204, and device memory 218, to communicate with one another. However, it is contemplated that the bus may enable one or more components of the network traffic management device 110 to communicate with components in other devices as well. Example buses include HyperTransport, PCI, PCI Express, InfiniBand, USB, Firewire, Serial ATA (SATA), SCSI, IDE and AGP buses. However, it is contemplated that other types and numbers of buses may be used, whereby the particular types and arrangement of buses will depend on the particular configuration of the network traffic management device 110.

Device memory 218 comprises computer readable media, namely computer readable or processor readable storage media, which are examples of machine-readable storage media. Computer readable storage/machine-readable storage media may include volatile, nonvolatile, removable, and non-removable media implemented in any method or technology for storage of information. Such storage media contains computer readable/machine-executable instructions, data structures, program modules, or other data, which may be obtained and/or executed by one or more processors, such as device processor 200. Such instructions allow the processor to perform actions, including implementing an operating system for controlling the general operation of network traffic management device 110 to manage network traffic and implementing security module 210 to perform one or more portions of the process illustrated in FIG. 3.

Examples of computer readable storage media include RAM, BIOS, ROM, EEPROM, flash/firmware memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information. Such desired information includes data and/or computer/machine-executable instructions and which can be accessed by a computing or specially programmed device, such as network traffic management device 110.

Security module 210 is depicted in FIG. 2 as being within memory 218 for exemplary purposes only; it should be appreciated the module 210 may be alternatively located elsewhere. Generally, when instructions embodying the security module 210 are executed by the device processor 200, the network traffic management device 110 identifies and diffuses potential or suspected network attacks on the server 102 by placing additional burdens on the suspected attacking client devices 106. The security module 210 identifies network communications that may include network attacks, or at least suspected network attacks, using information obtained by the security module 210 to analyze collected data regarding particular clients, such as client devices 106, client requests destined for particular servers, such as the server 102, or particular resources made available by particular servers that have been requested, for example.

In addition to preventing web scrapers and other types of bots from accessing one or more servers 102, the security module 210 may use additional information obtained by further analyzing collected data to identify latencies associated with particular servers, server applications or other server resources, page traversal rates, client device fingerprints and access statistics that the security module 210 may analyze to identify anomalies indicative to the module 210 that there may be an attack. The security module 210 also analyzes collected data to obtain information the security module 210 may use to identify particular servers and/or server applications and resources on particular servers, such as Web application server 102, being targeted in network attacks, so the module 210 can handle the attack.

Although an example of the Web application server 102, network traffic device 110, and client devices 106 are described and illustrated herein in connection with FIGS. 1 and 2, each of the computers of the system 100 could be implemented on any suitable computer system or computing device. It is to be understood that the example devices and systems of the system 100 are for exemplary purposes, as many variations of the specific hardware and software used to implement the system 100 are possible, as will be appreciated by those skilled in the relevant art(s).

Furthermore, each of the devices of the system 100 may be conveniently implemented using one or more general purpose computer systems, microprocessors, digital signal processors, micro-controllers, application specific integrated circuits (ASIC), programmable logic devices (PLD), field programmable logic devices (FPLD), field programmable gate arrays (FPGA) and the like. The devices may be programmed according to the teachings as described and illustrated herein, as will be appreciated by those skilled in the computer, software, and networking arts.

In addition, two or more computing systems or devices may be substituted for any one of the devices in the system 100. Accordingly, principles and advantages of distributed processing, such as redundancy, replication, and the like, also can be implemented, as desired, to increase the robustness and performance of the devices and systems of the system 100. The system 100 may also be implemented on a computer system or systems that extend across any network environment using any suitable interface mechanisms and communications technologies including, for example telecommunications in any suitable form (e.g., voice, modem, and the like), Public Switched Telephone Network (PSTNs), Packet Data Networks (PDNs), the Internet, intranets, a combination thereof, and the like.

The security module 210 of the network traffic management device performs a two step process to mitigate web scraping by a web crawler, bot or other web content collecting entity. In an aspect, the security module 210 subjects the web client to one or more computational challenges, to determine if the web client 102 is Javascript Proper. However, it is contemplated that a computational challenge is not sent to the web client (e.g. when the security module is not in blocking mode.) To be Javascript proper, the web client 102 has to support Javascript, and support cookies, as well as be able to provide satisfactory results to the computational challenges from the security module 210 of the network traffic management device 110.

In particular, the challenge preferably includes information representing instructions (e.g., JavaScript code) to be executed by the web client 102 to perform the challenge. The challenge preferably includes instructions to generate an HTTP cookie for storing any result(s) obtained by performing the challenge as well as recreate and resend its initial request including any cookies that store the challenge results. Although the challenge preferably includes Javascript code, other types of challenges can be employed and the code could be expressed in other programming, markup or script languages. For example, the challenge may be to compute some code and then auto-submit a form, which contains the original request, enveloped in the modified response which is sent back to the web client 102. The challenge is sent to the web client 102 in a modified response on behalf of the potentially targeted server. The web client 120 thus receives what it would otherwise understand to be a response, from the server, to its initial request.

If the web client 102 is indeed not JavaScript proper, then it would not be able to execute the challenge or, alternatively be able to execute the challenge but not generate the correct result. However, if the web client is indeed JavaScript proper, the web client 102 will execute the challenge and resend its initial request along with results obtained by executing the challenge in a cookie. It is contemplated that other fingerprints or information may be utilized which that the security module 210 will be able to analyze to determine that the associated requestor is legitimate. It should be noted that regardless of whether the bot and/or webscraper is able to correctly solve and update the challenge, the actual time it takes the client to go through the challenge cycle, including computing the challenge, results in the bot/web scraper being slowed down in obtaining information.

The security module 210 in the network traffic management device 110 confirms whether the received challenge results are correct to determine whether the web client 102 is JavaScript proper. The calculation of such a challenge allows a legitimate requestor to access server resources without any interference while non-Javascript proper web crawlers are filtered out. The security module 210 therefore gains some control over the rate of requests to the server 102. It is to be understood that other client side languages such as FLASH, Silverlite, VB script, etc. may be returned with the request.

In an aspect, the system may send several computational challenges to the web client to make it difficult for a bot or web scraper to copy the computational challenges from a previous intervals to bypass the current challenge. The computational challenge may require the web client to locate a special challenge cookie placed by the security module 210 when the request is sent back to the web client. However, it is contemplated that other types and/or techniques of administering a computational challenge can be utilized.

The system also proactively protects the web application by extending the scope of its data collection in the web client by injecting a client side script into HTTP responses with HTML content. In an aspect, the client side script is a JavaScript enabled technique called a client side human user indicator (CSHUI) code which is effectively used to determine whether the web client is being operated by a human by monitoring one or more keyboard and/or mouse events over a set amount of time and/or within a time period. It is preferred that the CSHUI code is inserted into the HTTP response after the computational challenge has been successfully validated by the web client. However it is contemplated that the validation of the web client may be performed before and/or after the CSHUI code has been injected into the response.

The CSHUI code is configured to hook into and log keyboard and/or mouse events, whereby the CSHUI code causes the web client to send a CSHUI cookie which carries human activity (or an indicated of lack of human activity) information to the web server to allow determination of whether the web client is operated by a human or a bot. However, it is contemplated that the web client may send this information back in another manner instead of using a cookie. For example, it is contemplated that the human activity information (or lack thereof) may be communicated from the web client using a parameter such as a query string or post; a specialized Ajax request, a special header or any other appropriate means.

The security module 210, when implementing the CSHUI code in a web client, first sends a CSHUI code snippet to the web client, preferably in an obfuscated manner known in the art. The CSHUI code snippet, once executed on the client side, causes the web client to download, preferably in a delayed manner, the actual CSHUI code. The delayed code download has advantages as the code will be downloaded from the network traffic management device 110 to the web client only if the web client was able to previously execute the code snippet. Additionally, bandwidth usage is reduced and latency is minimized in viewing the web content since most or all of other Javascript code will have first been loaded before the CSHUI code is loaded. Further, for delayed code loading, this facilitates caching mechanisms in the client side in intermediate proxies as well as the network traffic management device 110.

In particular, the CSHUI code utilizes an event listener which is configured to monitor activity in the web client while the web client displays a HTML page. In particular, the event listener of the CSHUI code determines whether keyboard and/or mouse activity has been detected and then updates a flag value in the CSHUI cookie if the event listener has detected sufficient mouse and/or keyboard activity in a session. In an aspect, the CSHUI cookie will have a unique identifier and a flag value of whether human activity has been detected (e.g. flag set to true). The security module 210 keeps track of the CSHUI cookies sent from the web client and monitors whether at least one cookie has had its flag be set to true within a predetermined number of transactions received from the web client. The security module 210 may be configured to keep these scores within a session and store them in a message key which effectively allows the approved web client to continue to receive the web content from the web server for that session for a set number of safe intervals. For example, a human user accessing a web page having a continuously updating stock ticker may not move the mouse or press a key on the keyboard throughout the session. However, if at any point during the session, the CSHUI detects keyboard and/or mouse movement (thus, concluding that a human is operating the web client), the security module 210 will continue allowing web content from the web server to be sent to the web client.

In an aspect, the network traffic management device 110 uses a CSHUI_ID to protect the cookie from theft in a valid session, whereby the CSHUI ID is injected along with the CSHUI cookie for each validation cycle (described below). In an aspect, network traffic management device 110 keeps the CSHUI_ID current per interval and considers the web client as being operated by a human only if it has human indication (CSHUI cookie value true) and the CSHUI_ID is current for that interval.

The system may be configured to store and utilize a white-list of approved web scrapers, bots, search engines or other content accessing entities in which the Web server will provide the web content to the web client even though non-human activity has been detected. In an embodiment, the entities are identified on the white-list by IP Address, by User Agents, by user credentials' identification, by connection, by session, and the like. It should be stressed that if requests are being issued to the web application server by an entity which can be associated with the white list, then the requests are approved and are not further monitored for web scraping violations. In other words, the system and method does not perform web scraping detection when the requesting web client is on the white-list. Thus, if a web crawler or bot is identified by the CSHUI cookie, but is on the white-list, the Web server will continue to provide web content to the Web client. Additionally or alternatively, the system may be configured to store and utilize a black-list of non-approved web scrapers, bots, search engines or other content accessing entities. In particular, the system can check whether the requesting entity is on the black-list and automatically refuse access to that entity of it is on the black-list. In an embodiment, the entities are identified on the white-list by IP Address, User Agents, by user credentials' identification, by connection, by session, and the like although web page addresses or other identifying information is contemplated for use with the black-list.

It is contemplated that the network traffic management device 110 can store the identity (e.g. IP address, subnet mask, web page address) or other information (e.g. user-agent string) of the web crawler or bot for diagnostic purposes, policy building, and/or preventative measures. This information can be also used for optimization of the web scraping mitigation feature of the system.

The operation of the example use of the web scraping mitigation technique is shown in FIG. 3 which may be run on the example network traffic management device 110, described with reference to FIGS. 1-2. FIG. 3 is a flow diagram of a process performed by the network traffic management device 110 in accordance with an aspect of the present disclosure. As described above, an HTTP request is intercepted by the network traffic management device 110 from a web client running a web browser (e.g. Firefox™, Internet Explorer™, Safari™, Chrome™). However, as stated above, the network traffic management device 110, and in particular the security device 210, performs all or part of the process in FIG. 3 to monitor these requests and prevent them from reaching the server if the requests are determined to originate from a web crawler, potential bot, search engine or other unauthorized web scraping entity.

As shown in FIG. 3, upon a request being received by the network traffic management device 110, a new validation cycle is started. As stated above, the network traffic management device 110 will accordingly respond to initial incoming requests depending on whether the security module 210 is operating in a blocking mode or a non-blocking mode. If the security module 210 is operating in a non-blocking mode, the process proceeds directly to block 308, as shown by arrow 306, in which the network traffic management device 110 assigns a CSHUI_ID and injects a CSHUI code into responses from the server as described below. In the non-blocking mode, the security module 210 allows requests to proceed to the server 102. However, the security module 210 will issue an alarm notification and declare the validation cycle as unsafe if the request is later determined to be received from a web scraping entity.

However, if the security module 210 is already operating in the blocking mode, upon receiving the request, a response will be sent with the original request enveloped in it along with JavaScript code that causes the browser on the client device to execute a computational challenge. Once the browser provides a result to the computational challenge, the client device 106 will auto submit the original request along with the result of the challenge (in a cookie) back to the We application server 102. The security module 210 intercepts the request sent from the client device and inspects the result of the computational challenge. If the security module 210 determines that the result is proper and correct, then the security module 210 will forward the request to the web application server 102 for that client device and will move on to the next state (block 308 in FIG. 3). It should be noted that detection of page traversals as well as hidden link functionality described below may be incorporated within and/or between blocks 302 and 308 in FIG. 3.

However, if no answer or a wrong answer to the challenge is intercepted by the network traffic management device 110, the device 110 will again send a modified response with another enveloped computational challenge to the web client 106. As shown by arrow 304, this challenge process will repeat for P number of times until a proper answer is received by the network traffic management device 110, whereby P may be set to a set finite number of intervals or an indefinite number of intervals.

However, if a proper answer to the computational challenge is intercepted by the network traffic management device 110 within a threshold limit of P intervals (if applicable), the process proceeds to block 308, as indicated by arrow 306. At block 308, the security module 210 of the network traffic management device 110 assigns a CSHUI_ID for the session and injects a CSHUI code into responses from the server which are sent to the client device 106. As stated above, a CSHUI cookie corresponding to the injected CSHUI code is stored on the client device 106, whereby the CSHUI code monitors mouse and/or keyboard actions on the client device 106 and updates one or more values in the CSHUI cookie to indicate whether a human is operating the client device 106. The modified value in the CSHUI cookie is provided in subsequent requests from the client device 106 which are then inspected by the security module 210.

As indicated by arrow 312, if the CSHUI cookie comes back to the network traffic management device 110 with a value which indicates that the requests are coming from a bot or web scraper, (shown as arrow 312) the security module considers the validation cycle as unsafe and goes into the prevent mode (block 314). If the security module 210 was previously operating in the blocking mode, then all further requests (up to the M number of unsafe intervals) are blocked from being passed on through to the server 102. However, if the security module 210 was previously in the non-blocking mode, the request is allowed to pass through onto server but the security module 210 produces an alarm notification for a M number of unsafe intervals, as described below.

However, as stated above, it is possible that the request received by the security module may not even include a CSHUI cookie value. It is also possible that the request may include a CSHUI cookie whose value is unchanged. In either case, as shown by arrow 310, the security module 210 is configured to repeat the step at block 308 for a N number of grace intervals, as indicated by arrow 310. In particular, the security module 210 will allow a limited number of subsequent requests to be passed on to the server 102 even though the CSHUI cookie does not definitively indicate whether the client device 106 is being operated by a human or a bot. The threshold of a maximum number of grace intervals to be allowed per session can be set to a desired value N on the network traffic management device 110. However, as shown by arrow 312, if the number of requests exceeds the threshold limit N of grace intervals while the security module 210 is in the blocking mode, then the security module 210 refuses the request from going to the server and (prevent mode), as shown in block 314. Then the security module 210 moves to an unsafe/prevented stated (block 314).

In block 314, it should be noted that the security module 210, in its blocking mode, will run a counter which keeps track of a number of requests received from the client device 106 which are refused to pass on to the server (“unsafe intervals”). However, if the security module 210 is operating in its non-blocking mode, it will allow requests to proceed to the server, and will issue alarm notifications for every request within an unsafe interval, as shown by arrow 316. If the number of refused requests reaches or exceeds the set threshold limit of M number of unsafe intervals, then the validation cycle starts over and the process begins at block 302, as shown by arrow 318.

Referring back to block 308, if the security module 210 finds that the CSHUI cookie has a modified value that indicates heuristics that a human is operating the client device 106 (e.g. keystrokes and/or mouse movements detected), the security module 210 will also inspect the CSHUI_ID present in the request. In particular, the security module 210 will determine whether the received CSHUI cookie has a CSHUI_ID matches the CSHUI_ID for that particular session, as shown by arrow 320. If both of these conditions are satisfied, the security module 210 will conclude that the client device 106 is not a bot, and switches to an approval mode for the client device 106, whereby subsequent requests from the client device 106 will be allow to access the server for that session (block 322). It should be noted that if the web client is identified to be one of the approved entities on a white-list, then the process described in FIG. 3 is not performed on the web client 106.

It should be noted that the security module 210, in its approval mode, will run a counter which keeps track of a subsequent Q number of requests received from the client device 106 which are allowed to pass on to the server (referred to herein as “safe intervals”). As shown by arrow 324, if the number of subsequent allowed requests exceed the set threshold limit of Q safe intervals, then the validation cycle starts over, and the process begins again at block 302, as shown by arrow 326.

It should be noted that although the security module 210 in the network traffic management device is described herein as sending the computational challenges and processing the CSHUI code/cookies, it is contemplated that the web application or web server which serves the web application can be configured to perform these duties.

Additionally or alternatively, it is possible for the network traffic management device 110 to monitor the number of page traversals (e.g. load, unload and/or refresh requests) which are received from a particular web client over a predetermined amount of time to determine whether the web client is human or a web scraper. In an example, the predetermined number of allowed page traversals per a time unit is set in the network traffic management device 110 by an administrator, wherein the setting may be a value that permits no more than 5 page traversals per second. If the network traffic management device 110 detects that more than 5 page traversals have occurred within a second (or less), then the network traffic management device 110 may consider the web client as hostile and drop the connection or may refer to the authorized list in the policy enforcer. This is based on the assumption that a human will not make such a high number of requests within that small period of time. It should be noted that although a value of more than 5 page traversals per second is discussed above, any other value greater or lesser than 5 is contemplated. It should be noted that keyboard and mouse movement and usage can be used as a heuristics for detecting a human user on the client side while the network traffic management device is detecting whether too many page traversals occur per time unit (e.g. identifying the requesting pattern as a “non human activity’).

It is also contemplated that network traffic management device randomizes and adds a hidden link to web pages where the CSHUI script is injected into the response. The hidden link cannot be viewed by a human, unless page markup and code is carefully inspected, although it will be able to be selected by a bot or web scraper. As a result, if the hidden link is selected, the CSHUI cookie will be set to false and the network traffic management device will conclude that the web client is hostile. Additionally or alternatively to the page traversal feature described above, keyboard and mouse movement and usage can be used as a heuristics for detecting a human user on the client side while and the hidden link functionality is performed to determine whether the client device is being operated by a human or a bot.

Having thus described the basic concepts, it will be rather apparent to those skilled in the art that the foregoing detailed disclosure is intended to be presented by way of example only, and is not limiting. Various alterations, improvements, and modifications will occur and are intended to those skilled in the art, though not expressly stated herein. These alterations, improvements, and modifications are intended to be suggested hereby, and are within the spirit and scope of the examples. Additionally, the recited order of processing elements or sequences, or the use of numbers, letters, or other designations therefore, is not intended to limit the claimed processes to any order except as may be specified in the claims. Accordingly, the invention is limited only by the following claims and equivalents thereto.

Claims

1. A method for preventing web scraping attacks implemented by a network traffic management system comprising one or more network traffic management computing devices, client devices, or server devices, the method comprising:

inserting a script into a response to a request received from one of the client devices, wherein the request is associated with a session and the script comprises an event listener configured to detect human activity at the one of the client devices;
receiving another request associated with the session from the one of the client devices and determining when the another request comprises a predetermined value indicating that human activity was detected at the one of the client devices by the script; and
providing the one of the client devices access to web content corresponding to the another request, when the determining indicates that the another request comprises the predetermined value set by the script.

2. The method of claim 1, further comprising:

determining when the one of the client devices is capable of executing JavaScript code based on a challenge answer included in the request;
inserting the script into the response, when the determining indicates that the one of the client devices is capable of executing JavaScript code; and
dropping a connection with the one of the client devices corresponding to the request, when the determining indicates that the one of the client devices is not capable of executing JavaScript code.

3. The method of claim 1, further comprising assigning a unique identifier to the session and sending the unique identifier to the one of the client devices with the response, wherein the another request comprises a cookie comprising the unique identifier and the predetermined value comprises a flag value set in the cookie.

4. The method of claim 1, further comprising inserting a hidden link into the response prior to sending the response to the one of the client devices, wherein the script is further configured to determine when the hidden link is selected and set the predetermined value, when the determining indicates that the hidden link is selected.

5. The method of claim 1, further comprising:

storing a message key as associated with the session and sending another response to the one of the client devices that includes the web content and the script, when the determining indicates that the another request comprises the predetermined value set by the script;
determining when one or more subsequent requests received from the one of the client devices comprise the predetermined value set by the script;
determining when the message key associated with the session is stored, when the determining indicates that the one or more subsequent requests do not comprise the predetermined value set by the script; and
providing the one of the client devices access to additional web content corresponding to the one or more subsequent requests, when the determining indicates that the message key associated with the session is stored.

6. A network traffic management computing device, comprising memory comprising programmed instructions stored thereon and one or more processors configured to be capable of executing the stored programmed instructions to:

insert a script into a response to a request received from one of the client devices, wherein the request is associated with a session and the script comprises an event listener configured to detect human activity at the one of the client devices;
receive another request associated with the session from the one of the client devices and determining when the another request comprises a predetermined value indicating that human activity was detected at the one of the client devices by the script; and
provide the one of the client devices access to web content corresponding to the another request, when the determining indicates that the another request comprises the predetermined value set by the script.

7. The network traffic management computing device of claim 6, wherein the one or more processors are further configured to be capable of executing the stored programmed instructions to:

determine when the one of the client devices is capable of executing JavaScript code based on a challenge answer included in the request;
insert the script into the response, when the determining indicates that the one of the client devices is capable of executing JavaScript code; and
drop a connection with the one of the client devices corresponding to the request, when the determining indicates that the one of the client devices is not capable of executing JavaScript code.

8. The network traffic management computing device of claim 6, wherein the one or more processors are further configured to be capable of executing the stored programmed instructions to assign a unique identifier to the session and send the unique identifier to the one of the client devices with the response, wherein the another request comprises a cookie comprising the unique identifier and the predetermined value comprises a flag value set in the cookie.

9. The network traffic management computing device of claim 6, wherein the one or more processors are further configured to be capable of executing the stored programmed instructions to insert a hidden link into the response prior to sending the response to the one of the client devices, wherein the script is further configured to determine when the hidden link is selected and set the predetermined value, when the determining indicates that the hidden link is selected.

10. The network traffic management computing device of claim 6, wherein the one or more processors are further configured to be capable of executing the stored programmed instructions to:

store a message key as associated with the session and send another response to the one of the client devices that includes the web content and the script, when the determining indicates that the another request comprises the predetermined value set by the script;
determine when one or more subsequent requests received from the one of the client devices comprise the predetermined value set by the script;
determine when the message key associated with the session is stored, when the determining indicates that the one or more subsequent requests do not comprise the predetermined value set by the script; and
provide the one of the client devices access to additional web content corresponding to the one or more subsequent requests, when the determining indicates that the message key associated with the session is stored.

11. A non-transitory computer readable medium having stored thereon instructions for preventing web scraping attacks comprising executable code which when executed by one or more processors, causes the one or more processors to:

insert a script into a response to a request received from one of the client devices, wherein the request is associated with a session and the script comprises an event listener configured to detect human activity at the one of the client devices;
receive another request associated with the session from the one of the client devices and determining when the another request comprises a predetermined value indicating that human activity was detected at the one of the client devices by the script; and
provide the one of the client devices access to web content corresponding to the another request, when the determining indicates that the another request comprises the predetermined value set by the script.

12. The non-transitory computer readable medium of claim 11, wherein the executable code when executed by the one or more processors further causes the one or more processors to:

determine when the one of the client devices is capable of executing JavaScript code based on a challenge answer included in the request;
insert the script into the response, when the determining indicates that the one of the client devices is capable of executing JavaScript code; and
drop a connection with the one of the client devices corresponding to the request, when the determining indicates that the one of the client devices is not capable of executing JavaScript code.

13. The non-transitory computer readable medium of claim 11, wherein the executable code when executed by the one or more processors further causes the one or more processors to assign a unique identifier to the session and send the unique identifier to the one of the client devices with the response, wherein the another request comprises a cookie comprising the unique identifier and the predetermined value comprises a flag value set in the cookie.

14. The non-transitory computer readable medium of claim 11, wherein the executable code when executed by the one or more processors further causes the one or more processors to insert a hidden link into the response prior to sending the response to the one of the client devices, wherein the script is further configured to determine when the hidden link is selected and set the predetermined value, when the determining indicates that the hidden link is selected.

15. The non-transitory computer readable medium of claim 11, wherein the executable code when executed by the one or more processors further causes the one or more processors to:

store a message key as associated with the session and send another response to the one of the client devices that includes the web content and the script, when the determining indicates that the another request comprises the predetermined value set by the script;
determine when one or more subsequent requests received from the one of the client devices comprise the predetermined value set by the script;
determine when the message key associated with the session is stored, when the determining indicates that the one or more subsequent requests do not comprise the predetermined value set by the script; and
provide the one of the client devices access to additional web content corresponding to the one or more subsequent requests, when the determining indicates that the message key associated with the session is stored.

16. A network traffic management system, comprising one or more traffic management devices, client devices, or server devices, the network traffic management system comprising memory comprising programmed instructions stored thereon and one or more processors configured to be capable of executing the stored programmed instructions to:

insert a script into a response to a request received from one of the client devices, wherein the request is associated with a session and the script comprises an event listener configured to detect human activity at the one of the client devices;
receive another request associated with the session from the one of the client devices and determining when the another request comprises a predetermined value indicating that human activity was detected at the one of the client devices by the script; and
provide the one of the client devices access to web content corresponding to the another request, when the determining indicates that the another request comprises the predetermined value set by the script.

17. The network traffic management system of claim 16, wherein the one or more processors are further configured to be capable of executing the stored programmed instructions to:

determine when the one of the client devices is capable of executing JavaScript code based on a challenge answer included in the request;
insert the script into the response, when the determining indicates that the one of the client devices is capable of executing JavaScript code; and
drop a connection with the one of the client devices corresponding to the request, when the determining indicates that the one of the client devices is not capable of executing JavaScript code.

18. The network traffic management system of claim 16, wherein the one or more processors are further configured to be capable of executing the stored programmed instructions to assign a unique identifier to the session and send the unique identifier to the one of the client devices with the response, wherein the another request comprises a cookie comprising the unique identifier and the predetermined value comprises a flag value set in the cookie.

19. The network traffic management system of claim 16, wherein the one or more processors are further configured to be capable of executing the stored programmed instructions to insert a hidden link into the response prior to sending the response to the one of the client devices, wherein the script is further configured to determine when the hidden link is selected and set the predetermined value, when the determining indicates that the hidden link is selected.

20. The network traffic management system of claim 16, wherein the one or more processors are further configured to be capable of executing the stored programmed instructions to:

store a message key as associated with the session and send another response to the one of the client devices that includes the web content and the script, when the determining indicates that the another request comprises the predetermined value set by the script;
determine when one or more subsequent requests received from the one of the client devices comprise the predetermined value set by the script;
determine when the message key associated with the session is stored, when the determining indicates that the one or more subsequent requests do not comprise the predetermined value set by the script; and
provide the one of the client devices access to additional web content corresponding to the one or more subsequent requests, when the determining indicates that the message key associated with the session is stored.
Patent History
Publication number: 20170034210
Type: Application
Filed: May 5, 2016
Publication Date: Feb 2, 2017
Inventors: Ron Talmor (Sunnyvale, CA), Shlomo Yona (Tel Aviv), Orit Margalit (Tel Aviv), Beni Serfaty (Karkor)
Application Number: 15/147,577
Classifications
International Classification: H04L 29/06 (20060101);