IDENTIFYING DYNAMIC CONTENT IN RESPONSES
Identifying dynamic content in responses includes comparing responses to requests for web pages, identifying portions of the responses that are different as dynamic content, and creating a template that designates the dynamic content.
This is a continuation of U.S. application Ser. No. 12/365,771, filed Feb. 4, 2009, which is hereby incorporated by reference.BACKGROUND
HyperText Markup Language (HTML) is the predominant markup language used for delivering and displaying Web pages. It provides a means to describe the structure of text-based information in a web page by denoting certain text as links, headings, paragraphs, lists, and so on. HTML code can be written by a web designer, and/or can be generated automatically. A web server provides the HTML code in response to a request for a web page in accordance with the HyperText Transfer Protocol (HTTP). The HTML code is then received by a web client and rendered by a web browser as a web page for viewing. HTML code can also supplement markup text with interactive forms, embedded images, and other objects. HTML code is written in the form of tags, surrounded by angle brackets.
HTML comprises components called “elements.” Elements provide the basic structure for HTML markup. Elements have two basic properties: attributes and content. Each attribute and each element's content has certain restrictions that must be followed for an HTML document to be considered valid. An element has a start tag (e.g., <element-name>) and usually an end tag (e.g., </element-name>). The element's attributes are contained in the start tag, and any content is located between the start and the end tags (e.g., <element-name attribute=“value”>Content</element-name>). Some elements, such as <br>, do not have any content and do not have a closing tag.
A malicious user can attack the operation or performance of a web server, such as by gaining unauthorized access to the server and changing web page code, operating parameters, or the like, or by taking advantage of web programming weaknesses, etc. One way to detect an attack is to examine the contents of web pages provided by a web server in response to HTTP requests. The process of scanning web pages can be automated to a degree, such as by detecting changes to a web server's HTTP responses to HTTP requests for a web page, such as changes in the response time, or changes in web pages provided in response to identical requests. One challenge faced by automatic web scanners is that there are many things besides an attack that can cause a web server response to change, such as changing ad banners, time-stamps, page hit counters, and the like. Typically, such content can change even if web page requests are identical. As used herein, the term “dynamic” is used to indicate web page content that changes in the responses to identical web page requests. The term “static” is used to indicate web page content that does not change in the responses to identical web page requests.
Differential analysis is a technique used to compare two or more HTTP responses to determine if there are differences between them, and if so, identify the differences. One of the challenges of using differential analysis to analyze web pages is unknown web application behavior. For example, HTTP responses may contain dynamically generated content such as ad banners, page request counts, time stamps, and other elements that are independent of the request parameters. Two responses to identical requests can be the same with regard to the matter that was requested, but the responses may be different because their dynamic content has changed.
The accompanying drawings are included to illustrate and provide a further understanding of the disclosed embodiments. In the drawings:
Reference will now be made in detail to the disclosed embodiments, examples of which are illustrated in the accompanying drawings.
It is appreciated that although the exemplary computing system 100 is shown to comprise a single CPU 110 that such description is merely illustrative as computing system 100 may comprise a plurality of CPUs 110. Additionally, computing system 100 may exploit the resources of remote CPUs (not shown) through communications network 170 or some other data communications means (not shown).
In operation, CPU 110 fetches, decodes, and executes instructions from a computer readable storage medium such as HDD 115. Such instructions can include an operating system (OS), executable programs, and the like. Information, such as computer instructions and other computer readable data, is transferred between components of computer system 100 via the computer's main data-transfer path. The main data-transfer path may use a system bus architecture 105, although other computer architectures (not shown) can be used, such as architectures using serializers and deserializers (serdes) to communicate data between devices over serial communication paths. System bus 105 typically includes data lines for sending data, address lines for sending addresses, and control lines for sending interrupts and for operating the system bus. An example of a system bus is the PCI (Peripheral Component Interconnect) bus. Some busses provide bus arbitration that regulates access to the bus by extension cards, controllers, and CPU 110. Devices that attach to the busses and arbitrate access to the bus are called bus masters. Bus master support also allows multiprocessor configurations of the busses to be created by the addition of bus master adapters containing a processor and its support chips.
Memory devices coupled to system bus 105 include random access memory (RAM) 125 and read only memory (ROM) 130. Such memories include circuitry that allows information to be stored and retrieved. ROMs 130 generally contain stored data that cannot be modified. Data stored in RAM 125 can be read or changed by CPU 110 or other hardware devices. Access to RAM 125 and/or ROM 130 may be controlled by memory controller 120. Memory controller 120 may provide an address translation function that translates virtual addresses into physical addresses as instructions are executed. Memory controller 120 may also provide a memory protection function that isolates processes within the system and isolates system processes from user processes. Thus, a program running in user mode can normally access only memory mapped by its own process virtual address space; it cannot access memory within another process's virtual address space unless memory sharing between the processes has been set up.
In addition, computing system 100 may contain peripherals controller 135 responsible for communicating instructions from CPU 110 to peripherals, such as printer 140, keyboard 145, and mouse 150.
Display 160, which is controlled by display controller 155, is used to display visual output generated by computing system 100. Such visual output may include text, graphics, animated graphics, and video. Display 160 may be implemented with a CRT-based video display, an LCD-based flat-panel display, gas plasma-based flat-panel display, touch-panel, or the like. Display controller 155 includes electronic components required to generate a video signal that is sent to display 160.
Further, computing system 100 may contain network adapter 165 which may be used to couple computing system 100 to an external communication network 170, which may include or provide access to the Internet. Communications network 170 may provide computer users with means of communicating and transferring software and information electronically. Additionally, communications network 170 may provide for distributed processing, which involves several computers and the sharing of workloads or cooperative efforts in performing a task. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
It is appreciated that exemplary computer system 100 is merely illustrative of a computing environment in which the herein described systems and methods may operate and does not limit the implementation of the herein described systems and methods in computing environments having differing components and configurations, as the inventive concepts described herein may be implemented in various computing environments having various components and configurations.
Computing system 100, described above, can be deployed as part of a computer network. In general, the above description for computing environments applies to both server computers and client computers deployed in a networked environment.
In operation, a user (not shown) may interact with a computing application running on a client computing environment to obtain desired data and/or computing applications. The data and/or computing applications may be stored on server computing environment 205 and communicated to cooperating users through client computing environments 100, 210, 215, 220, and 225, over the exemplary communications network. A participating user may request access to specific data and applications housed in whole or in part on server computing environment 205. Such data may be communicated between server 205 and client computing environments 100, 210, 215, 220, and 225 for processing and/or storage. Server 205 may host computing applications, processes and applets for the generation, authentication, encryption, and communication of data and applications and may cooperate with other server computing environments (not shown), third party service providers (not shown), network attached storage (NAS) and storage area networks (SAN) to realize application/data transactions.
In particular, a computing system 100 can send HTTP requests to server 205, and server 205 can respond by sending HTTP responses containing HTML code to the requesting computer system 100. The requesting computer system 100 may be located on a LAN with server 205, such as through router 230. Alternatively, computer system 100 may access server 205 via communications network 235, such as over the Internet. In either case, identical HTTP requests may result in different HTTP responses. Those different HTTP responses may be the result of normal operation, or they may be the result of a problem with the server 205, or a result of the activities of a hostile user.
The contents of the HTTP responses provided by server 205 can be scanned for changes, which can be helpful to an analyst in determining whether or not the changes are the result of normal operation. The process of scanning web pages can be automated to a degree, and it is helpful to the analyst to have the changes identified automatically, so that they can be more readily analyzed, such as by performing a differential analysis. An example of changes due to normal web server operation include HTTP responses that contain dynamically generated content such as ad banners, page request counts, time stamps and other elements that are independent of the request parameters. Two responses to identical requests can be the same with regard to the matter that was requested, but the responses may be different because their dynamic content has changed. It is helpful in performing a differential analysis comparing HTTP responses to properly identify the static and dynamic portions of the responses. Using differential analysis, one or more baselines can be established as bases for the comparison of HTTP responses.
In an exemplary embodiment, the portions of the HTTP responses that are identical to corresponding portions of the first response are identified as static content. The static content can also be included in the template, or indicators of the static content can be included in the template.
Requesting a page only twice or only a few times may not capture dynamic areas that change after a certain number of requests, such as on every 10th request, or after a varying number of requests. Thus, in an embodiment the additional requests after the first request can be based on a select number of requests, such as a number of requests selected by a user.
Furthermore, web pages may have features that change regularly or frequently, such as a quote of the day, or a weather forecast. Those features may change between the time (or date) the template is created and the time (or date) it is used, for example, to investigate a possible attack on the web server. Thus, in an embodiment, the additional requests after the first request can additionally, or alternatively, be based on the passage of a select amount of time, such as an amount of time selected by a user.
In an embodiment, the dynamic content can be made more easily recognizable by an analyst. That can be done in many ways, such as by tagging each identified instance of dynamic content. Such tagging can include surrounding each instance of dynamic content with easily recognizable characters that set it off from other content. Alternatively or in addition, the dynamic content can be highlighted, for example, by changing the formatting attributes such as using a different font, different color font, different size font, using bold, italics, underlining, indenting, and/or adding space before and/or after each identified instance or section of dynamic content. In this way, the dynamic content can be more easily recognized by the analyst.
In an embodiment, in identifying the dynamic and/or the static content, the HTML tags of the responses can be identified and optionally characterized. Tags of the HTML can be identified by their angle brackets or other formatting attributes in accordance with HTML standards, guidelines, rules, and/or definitions. Information of the HTML tags can be used to characterize them as associated with content that may not change between responses, i.e., as unchangeable HTML tags. Other HTML tags can be characterized as associated with content that may change between responses. Changeable content can be further characterized according to whether it does or does not actually change between the first response and any of the subsequent responses. For example, HTML tags associated with content that may, but does not, change between the first response and all subsequent responses can be characterized as static HTML tags. HTML tags associated with content that may change and in fact does change between the first response and any subsequent response can be characterized as dynamic HTML tags. The content that changes can be identified as dynamic content. Identifying and characterizing the HTML tags can be accomplished by consulting HTML standards, guidelines, rules, and/or definitions (collectively, “rules”), such as HTML rules stored in a storage device, to identify the tags associated with content that may change and the tags associated with content that may not change.
The template can be created in any of various ways. For example, the template may include information of both the static and the dynamic content. Alternatively, the template may include only information of the dynamic content, such as by removing the static content from the template, or by not including the static content in the template.
Template 450 is created based on all four responses. Template 450 is formed by identifying the dynamic content and including it in the template. Thus, the changed text “C” and “D” are both included in the template. In this example template, the dynamic content is tagged by setting it off from other content by preceding each instance of dynamic content with the line “===DIFF TEXT===”, and following each instance of dynamic content with the line “===END DIFF TEXT===”, so that it would be easily recognizable by an analyst. In addition, the formatting of the dynamic content and added lines is made bold-underline-italic, so that it would be even more easily recognizable by an analyst. The template also contains information of the HTML tags in the responses. Each HTML “TAG” is identified and characterized as: not being associated with content that could change (Defined Lower); being associated with content that could change but does not change (Static); or being associated with content that could change and in fact does change (Dynamic). Although this example includes particular formatting and content, other formatting and content could also be used, provided that the dynamic content is identified and included. An analyst would then use the template to further analyze the responses.
Using such a template may enable an analyst to more easily identify key areas of a web page by providing a simplistic representation of the page. In addition, use of the template can result in increased accuracy and speed of differential analysis by enabling the analyst to compare only the portions of the responses that are shown to have changed from the first response. Furthermore, it is harder to mistakenly identify content as dynamic content that is not, in fact, dynamic. Thus, the template can be relied on to accurately indicate actual dynamic content in HTTP responses to identical requests.
Various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.
1. A method of identifying dynamic content in responses to web page requests, comprising:
- sending a first request for a web page;
- receiving a first response to the first request;
- sending at least one additional request for a web page, the at least one additional request identical to the first request;
- receiving at least one respective additional response to the at least one additional request;
- comparing, by a computing system having a central processing unit, the first response to each of the at least one additional response;
- identifying, by the computing system, portions of the at least one additional response that are different from corresponding portions of the first response as dynamic content;
- generating, by the computing system, a template that designates the dynamic content.
2. The method of claim 1, further comprising identifying portions of the at least one additional response that are identical to corresponding portions of the first response as static content.
3. The method of claim 1, wherein a number of the at least one additional request sent is based on a select number of requests.
4. The method of claim 1, wherein a number of the at least one additional request sent is based on passage of a select amount of time.
5. The method of claim 1, wherein generating the template that designates the dynamic content includes tagging the dynamic content in the template.
6. The method of claim 1, further comprising:
- identifying markup language tags in the first response; and
- characterizing the identified markup language tags in the template.
7. The method of claim 6, wherein the characterizing comprises:
- identifying markup language tags associated with content that may not change between responses as unchangeable markup language tags,
- identifying markup language tags associated with content that may, but does not, change between the first response and all subsequent responses as static markup language tags;
- identifying markup language tags associated with content that may change and does change between the first response and any subsequent response as dynamic markup language tags; and
- identifying the content that changes as dynamic content.
8. The method of claim 7, wherein the identifying of the markup language tags comprises consulting markup language rules stored in a storage device to identify the markup language tags associated with content that may change and the markup language tags associated with content that may not change.
9. The method of claim 1, further comparing using the template to analyze responses to requests for web pages, the analyzing to detect unauthorized access of a server that provided the analyzed responses.
10. A non-transitory computer readable storage medium storing instructions that when executed by a computer cause the computer to:
- send a first request for a web page;
- receive a first response to the first request;
- send at least one additional request for a web page, the at least one additional request identical to the first request;
- receive at least one respective additional response to the at least one additional request;
- compare the first response to each of the at least one additional response;
- identify portions of the at least one additional response that are different from corresponding portions of the first response as dynamic content;
- generate a template that designates the dynamic content.
11. The non-transitory computer readable storage medium of claim 10, wherein the instructions when executed cause the computer to further:
- identify a portion of the at least one additional response that is identical to a corresponding portion of the first response as static content; and
- designate the static content in the template.
12. The non-transitory computer-readable storage medium of claim 10, wherein the instructions when executed cause the computer to further:
- use the template to analyze responses to requests for web pages, the analyzing to detect unauthorized access of a server that provided the analyzed responses.
13. A system comprising:
- at least one central processing unit (CPU) to: send a plurality of identical requests for web pages; receive a plurality of responses to the plurality of identical requests; compare a first response of the plurality of responses to a second response of the plurality of responses; identify a portion of the first response that is different from a corresponding portion of the second response as dynamic content; and generate a template that identifies the dynamic content.
14. The system of claim 13, wherein the at least one CPU is to further:
- identify a portion of the first response that is identical to a corresponding portion of the second response as static content; and
- designate the static content in the template.
15. The system of claim 13, wherein the at least one CPU is to further use the template to analyze responses to requests for web pages for detecting unauthorized access of a server that provided the analyzed responses.
Filed: Dec 30, 2014
Publication Date: Apr 30, 2015
Inventor: William Matthew Hoffman (Atlanta, GA)
Application Number: 14/585,805
International Classification: G06F 17/22 (20060101); G06F 17/21 (20060101);