SYSTEM FOR AUTOMATICALLY GENERATING WRAPPER FOR ENTIRE WEBSITES
A system for automatically generating a wrapper for an entire website, the wrapper characterising the structure of the website, the system having a plurality of functional elements, including at least one annotation module to classify components of a page and generate an annotated, a page classification module to identify functional and informational components of an annotated page, and an action module to identify an action to be taken to further navigate the website, wherein at least one of the annotation module, page classification module and action module is operable in response to a plurality of domain-specific rules, where a domain is understood as a conceptual domain such as real estate, used cars, or electronics.
This application claims the benefit and priority of U.S. Provisional Application 62/057,395 filed on Sep. 30, 2014, the contents of which are hereby incorporated by reference.
The present application relates to a system for automatically generating a wrapper for an entire website, where the wrapper characterises the structure of the website.
BACKGROUND TO THE INVENTIONThe identification and retrieval of data from websites is a difficult and pressing issue. While techniques have been developed for ‘crawling’ the web to classify and index websites, and make the information available for searching and retrieval, extraction of complete data from a site is problematical. It is known for websites to offer APIs to facilitate data extraction, but this is far from universal. Automatic complete extraction of data from a site without a suitable API is problematic as sites are not structured consistently, and page elements such as forms, side bars, and navigation menus can be difficult to correctly identify or interact with. Supervised systems are known, in which a user navigates to a site and identifies relevant data, which the system then uses to direct data extraction, but these are time-consuming and not scalable. Automatic full-site extraction has so far only been successfully used in setting with limited structures, such as title and body extraction from news articles or search engine results. For extracting highly structured data, these approaches are unsuitable.
SUMMARY OF THE INVENTIONAccording to the present invention there is provided a system for automatically generating a wrapper for a website, the wrapper characterising the structure of the website, the system having a plurality of functional elements, including at least one annotation module to classify components of a page and generate an annotated page, a page classification module to identify functional and informational components of an annotated page, and an action module to identify an action to be taken to further navigate the website, wherein at least one of the annotation module, page classification module and action module is operable in response to a plurality of domain-specific rules, where a domain is understood as a conceptual domain such as real estate, used cars, or electronics.
The annotation module may be responsive to page structure components and text components to identify a domain-specific datatype and associated data value within the page.
The page classification module may comprise at least one analysis module and a first control element, the first control element being responsive to the classified page to select an analysis module.
The at least one analysis module may comprise one or more of a form analysis module, and a results page analysis module.
The page classification module may include a plurality of analysis modules and the first control module may be able to successively select a plurality of analysis modules.
The first control element may be operable to pass control to the action module.
The action module may be operable to select a navigation step and cause a further page of the website to be loaded.
The action module may be operable to select an action comprising one or more of; selecting a link, invoking a form completion module, performing a crawler action or returning to a previous page.
The action which may be selected by the action module may be dependent on the result of the operation of the page classification module.
A plurality of the modules may comprise relational transducers, wherein each relational transducer embodies a set of domain-specific rules defining a relationship between input data and output data.
The plurality of relational transducers may be disposed in a network, such that the execution order of the network of relational transducers varies depending on the page to be classified.
The system may further comprise a control transducer, the control transducer being operable to determine control flow within the network.
The system may further comprise a visual block analysis module to identify graphical elements of the page.
The system may comprise a data extraction system operable to receive the wrapper and navigate the website in accordance with the wrapper to extract data from the website.
An embodiment of the invention is described by way of example only with reference to the accompanying drawings, wherein;
With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only, and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of the components set forth in the following description or illustrated in the drawings. The invention is applicable to other embodiments or of being practiced or carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein is for the purpose of description and should not be regarded as limiting.
In the present example, a system embodying the invention will be described with reference to data extraction from sites in a particular knowledge domain, estate agent (real estate) data. However, the invention is applicable to extracting structured information relating to any other domain.
Throughout this description, we will refer to relational transducers. A relational transducer in this context refers to any computational element, however implemented, which is defined by particular input and output relations. Relational transducers can be self-contained, in that they interact through a shared memory and have no knowledge of each other. Because transducers are self-contained and are directed to a specific set of relations between inputs and outputs, the individual relational transducers can be relatively simple. Advantageously, in the present example the relational transducers are generally written as Datalog programs. Relational transducers, arranged into a transducer network, form the basic components of the system described herein. The transducer network provides integration and communication between the transducers in a way that represents an ideal trade-off among the system's primary integration goals:
(1) Isolation: Transducers communicate through a transactional shared memory and have no other knowledge of each other;
(2) Resumable: Transducers can be executed repeatedly, possibly continuing previously suspended computation, e.g., if new input data has become available.
(3) Complexity: Transducers are generally Datalog programs with controlled value invention, retaining Datalog's polynomial data complexity but with significant performance benefits.
(4) Data partitioning: The relations (in the shared memory) are strictly partitioned into fine-granular transducer scopes.
The relational transducers are resumable: They can yield processing and may be called again on the same page, returning new facts. Resumption is monotone, i.e., additional calls to a transducer may produce additional output, but never retract previously derived facts. Resumable transducers are further distinguished into state- and input-driven transducers. State-driven transducers may produce new facts even if called with the same input, but maintain state between calls. These are typically transducers that iterate over some collection, e.g., all links on a page, and maintain the position in the iteration in their state relations. Input-driven transducers may also be called multiple times, but may only produce new data, if additional input is provided. Typically, these transducers are called only once or twice per page. Resumable transducers are exhausted, if further calls yield no new data.
Relational transducers described herein are each one of three general types:
(1) Stateless phenomenological transducers encoding phenomenological patterns, such as record and attribute identification. These patterns are domain independent, but query the possibly domain-dependent phenomenological knowledge. These transducers are typically input-driven resumable.
(2) Stateful guarded finite state transducers (gFSTs) encode finite state transducers where transitions are guarded by first-order formulas. All exploration decisions are delegated to such transducers, which are typically state-driven resumable. The transducers used for form filling are examples of such transducers.
(3) External programs are required for certain tasks such as interaction with the browser. The corresponding components are designed, such that they can be formalised as relational transducers with an infinite background relation. These transducers are input-driven (e.g., for wrapper induction as discussed in the following example).
A system embodying the present invention, as illustrated in
A system embodying the invention is shown in outline in
With reference to
Within the initialisation module 11, element 110 loads the next page of the website to be examined, whether the initial top-level site URL, or a URL for a subsequent page as described below. At element 111, success of this operation is checked and if not, the page failure element is invoked. The loaded HTML document is parsed into a Document Object Model (“DOM”) to facilitate subsequent operations. In annotation module 12, the received and parsed page is annotated using an annotator element 120. Advantageously, annotation element 120 uses a plurality of labelled and named entity recognisers (“LNER”s) to identify domain-specific datatypes and values within the page. Using rules derived from the relevant domain knowledge, the LNER serve to identify entities of interest, taking into account the HTML structure and CSS formatting, in addition to the text components of the page.
The success of this step is checked at 121 and element 200 is invoked in the event of failure. As shown at 122 and 123, a probing element may be invoked to assess the page structure if necessary. Visual block identification module 124 identifies graphical and visual elements of the page, such as frames and images.
On completion of the annotation and identification of the various components of the page, the page then is passed to page classification module 13. In the end of this example, a plurality of modules 131, 132, 133 are shown which may be invoked by first control element 16′ of the control transducer 16, to identify or interact with functional components of the page (such as forms) or information components of the page (such as the results records of interest). Each of the analysis modules 131, 132, 133 has an associated guard 131a, 132a, 133a which may implement guard rules for the respective analysis module, to indicate whether the respective module is ready or appropriate for execution. In this example, module 131, referred to as OPAL, is selected when a form identified on the page and, and analyses the form to identify acceptable inputs and resultant behaviour. If result pages identified, one of modules 132, 133 is run to identify regularities in the structure of a page, typically resulting from data-publishing templates. In this example module 132, referred to as AMBER is realised as a phenomenological transducer that encodes domain-independent rules for detecting patterns on websites. These patterns are combined with template discovery, that is the detection of regularities in the structure of a page. AMBER's transducer is input-driven resumable: it is called once per page in a sequence of result pages (typically connected by pagination links), each time refining the model of the template underlying the result pages. In addition, AMBER uses the concept of a pivot attribute. A pivot attribute, such as PRICE, is a mandatory attributes of an easy to detect type. The system locates these pivot attributes to discard regular structures with irrelevant data or irregular noise in otherwise regular structures, such as advertisements interspersed among records. This increases the accuracy of record and attribute identification compared to existing template discovery approaches. In most product domains, PRICE or product identifier are ideal pivot attributes. In domains with no regular attributes, presentational attributes such as images or details page links may be chosen as pivot attributes. Textual analysis of the results from analysis modules 132, 133 is then performed by text analysis module 134.
In module 14 as discussed in more detail below, actions for further navigation of the website are selected under the control of second control element 16″. The action selected may include filling an identified form as shown at 151, or performing a crawler action as shown in 152.
With reference to
In the event that the operation is successful, generator 160 stores the relevant information to be included in the wrapper. Insert F here If the identified action is successfully performed at step 161 and is checked at step 162 the process repeats. The wrapper accumulates information about all identified result pages and the navigation paths leading to them and integrates that information into a coherent wrapper program. It is input driven resumable, called once per page, but accumulating the wrapper information over all the calls for one site. Within a result page sequence, it combines the collected information into a coherent wrapper for the underlying template of these pages, that is likely applicable also to any other page in the sequence.
Accordingly, the system comprises a plurality of relational transducers which form a synchronised transducer network. The network is referred to as being synchronised as its control flow is determined by a central controller, itself a relational transducer. Intuitively, a transducer network is a set of transducers with a transactional shared memory which serves as input and output for the transducers. The execution is controlled by the control transducer and a specific area in the memory is reserved for communication between controller and transducers. The network is self-adaptive, as the control flow is dynamically determined from transducer dependencies and their guard rules. Rather than relying on one or a few statically defined control flows for exploring a page or site, this allows the network to form different control flows for the exploration of each individual site. Dependency and guard rules, registered by the individual transducers with the control transducer, thus determine for each transducer separately if it can be executed at a given point. This determination is typically based on the already explored portion of the site or page. For example, a form analysis transducer has a dependency on the transducer that produces annotations for Document Object Model (“DOM”) elements and a guard rule that prevents it from running if there are no form elements on the page. It has the same priority as, e.g., the record identification transducer, which has similar dependencies, but is guarded by requiring the presence of pivot attribute annotations. If both guards are satisfied, the transducers may be executed in parallel. Transducers that yield an interaction with the browser cannot be executed in parallel, but must be sequentialised, as parallel access may break server state or Javascript execution. Therefore, for the selection and execution of actions, explicit priorities are used to sequentialise the actions and to prioritise actions with the highest estimated probability to lead to relevant data.
In the above description, an embodiment is an example or implementation of the invention. The various appearances of “one embodiment”, “an embodiment” or “some embodiments” do not necessarily all refer to the same embodiments.
Although various features of the invention may be described in the context of a single embodiment, the features may also be provided separately or in any suitable combination. Conversely, although the invention may be described herein in the context of separate embodiments for clarity, the invention may also be implemented in a single embodiment.
Furthermore, it is to be understood that the invention can be carried out or practiced in various ways and that the invention can be implemented in embodiments other than the ones outlined in the description above.
Meanings of technical and scientific terms used herein are to be commonly understood as by one of ordinary skill in the art to which the invention belongs, unless otherwise defined.
Claims
1. A system for automatically generating a wrapper for a website, the wrapper characterising the structure of the website,
- the system having a plurality of functional elements, including at least one annotation module to classify components of a page and generate an annotated page, a page classification module to identify functional and informational components of a annotated page, and an action module to identify an action to be taken to further navigate the website,
- wherein at least one of the annotation module, page classification module and action module is operable in response to a plurality of domain-specific rules.
2. A system according to claim 1 wherein the annotation module is responsive to page structure components and text components to identify a domain-specific datatype and associated data value within the page.
3. A system according to claim 1 wherein the page classification module comprises at least one analysis module and a first control element, the first control element being responsive to the classified page to select an analysis module.
4. A system according to claim 3 wherein the at least one analysis module may comprise one or more of a form analysis module, and a results page analysis module.
5. A system according to claim 3 wherein the page classification module includes a plurality of analysis modules and the first control module is able to successively select a plurality of analysis modules.
6. A system according to claim 3 wherein the first control element is operable to pass control to the action module.
7. A system according to claim 1 wherein the action module is operable to select a navigation step and cause a further page of the website to be loaded.
8. A system according to claim 7 where the action module is operable to select an action comprising one or more of; selecting a link, invoking a form completion module, performing a crawler action or returning to a previous page.
9. A system according to claim 8 wherein the action which may be selected by the action module is dependent on the result of the operation of the page classification module.
10. A system according to claim 1 wherein a plurality of the modules comprise relational transducers, wherein each relational transducer embodies a set of domain-specific rules defining a relationship between input data and output data.
11. A system according to claim 1 wherein the plurality of relational transducers disposed in a network, such that the execution order of the network of relational transducers varies depending on the page to be classified.
12. A system according to claim 1 further comprising a control transducer, the control transducer being operable to determine control flow within the network.
13. A system according to claim 1 comprising a visual block analysis module to identify graphical elements of the page.
14. A system according to claim 1 comprising a data extraction system operable to receive the wrapper and navigate the website in accordance with the wrapper to extract data from the website.
Type: Application
Filed: Sep 30, 2015
Publication Date: Nov 3, 2016
Inventors: Georg GOTTLOB (Oxford), Tim FURCHE (Oxford), Giovanni GRASSO (Oxford), Christian SCHALLHART (Oxford), Giorgio ORSI (Oxford)
Application Number: 14/871,027