Method for extracting content from structured or unstructured text documents

A method for selecting textual content within a document. Text is selected using mechanisms of pattern recognition on the document's structure or content itself. A pattern recognition rule selects the desired text by identifying the start and/or end positions of the content in the document. The delineated contents is then said to be enclosed in an envelope. A series of envelopes may be used to identify the desired content. Successive envelopes are defined relative to a previous envelope. The contents of any envelope within a series, including the final envelope, may be extracted for use by other documents.

Skip to: Description  ·  Claims  · Patent History  ·  Patent History
Description
CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority from U.S. Provisional Patent Application No. 60/263,574, filed on Jan. 22, 2001, entitled “SYSTEM AND METHOD FOR DESIGNING, DEPLOYING AND MANAGING MOBILE APPLICATIONS.”

FIELD OF THE INVENTION

[0002] The present invention relates generally to the endeavor of reusing or repurposing the contents of documents for use in other documents or applications. More particularly, the invention relates to a generic method for selecting/extracting a body of content from a textual document.

BACKGROUND OF THE INVENTION

[0003] The Internet has been a greatly successful medium that allows for the sharing of and access to essential information. This success also stems from the Internet's newfound ability to carry out transactions. Traditionally, the Internet has been accessed using web browsers running on personal computers linked to the Internet.

[0004] However, with the advent of new web technologies, users may now access the same information from a variety of different devices using disparate standards. The new devices not only run on different software systems than existing website and applications, they often use different mediums to transmit data, such as PSTN or wireless networks. More often than not, this makes such devices incompatible with existing sites. For example, a website built using HTML markup language and designed for personal computers using HTML-based browsers cannot operate with Internet-enabled, wireless phones that use Wireless Markup Language-based browsers.

[0005] In order to support these new devices and standards, a new breed of application will be built. A cost-effective solution for building these applications is to extract information from existing web sites, rather than implementing new systems from scratch. Thus, there is a need for a method to automatically extract information from current web sites and transform it for new application formats. This is referred to as repurposing content. Fundamental to this endeavor is the task of identifying the desired content or functionality within a web site for reuse.

[0006] A typical prior art approach for content and functional identification involves specifying the absolute location of the content, based on its location within the structure of the page's source code. However, this approach and others like it tend to be unreliable in practice, as web pages change in content and structure periodically. For example, a selection may be defined as, ‘Select the third paragraph’ for an HTML-based web page. As seen in FIG. 1, this would result in the selection, “The quick brown fox slyly jumped over the lazy dog.” However, if a new paragraph is inserted at the beginning of the document as seen in FIG. 2, then the same selection definition, would yield “Starlight, starbright, first star I see tonight, I wish I may, I wish I might, have the wish I wish tonight.”

[0007] Given that it is common for web pages to change in structure and content regularly, this problem suggests that a different and improved approach to identifying and selecting content from web pages and other computer-based documents is valuable. This method must be robust enough to operate successfully even after reasonable changes in structure and content. While the need for the present invention arose from work involving web sites and web applications, the invention is not limited exclusively to the domain of web sites and web applications. Numerous other applications will be apparent.

SUMMARY OF THE INVENTION

[0008] The invention presents a method to select content from text documents that may be extracted for use by other systems. A primary advantage of the present invention is that it selects content correctly and reliably from documents that may change in content or structure over time. Preferably, the present invention achieves content selection by applying a series of selection commands in succession. The selection commands successively narrow the scope of the selected content until the required content is reached. The selected content is said to be enclosed in a selection envelope. Selection envelopes are comprised of two virtual markers that delineate the boundaries of each envelope. An envelope is defined by positioning these virtual markers around a specified body of content in the document.

[0009] The definition of a selection envelope may be made relative to a previously defined envelope. This definition is based on various, non-limited means of identifying bodies of content or structures within a document. These means include, but are not limited to, computer-based functions and methods.

[0010] One non-limiting advantage of the invention is that it presents a method for defining selection commands for both structured and unstructured documents. Structured documents can be interpreted as having structural content and textual/character content. Unstructured documents can only be interpreted as having textural/character content.

[0011] Using a powerful and extensible command set, such as one described herein, it is possible for an operator to create robust selection commands that correctly function, even on constantly changing documents. The method is preferably embodied in a software-based development environment executing on a computer and manipulated by an operator. The operator may use this software to create a set of instructions for the selection of content from a given document. These instructions may then be executed by a computer-based, run-time entity to select a body of content. Once selected, the content may be ‘repurposed’ by other documents.

[0012] A non-limited series of selection commands may be defined for a document. Each successive command specifies a smaller envelope, or child envelope, defined relative to a preceding, or parent, envelope. Each successive command further “narrows” in on a desired body of content. In summary, this method of content identification is referred to as Iterative Relative Enveloping (IRE).

Glossary of Terms

[0013] Begin marker: A virtual demarcation that signifies the commencement of a content envelope within the body of a web page.

[0014] Content selection envelope: See Selection envelope.

[0015] DTD: See Structured document.

[0016] End marker: A virtual demarcation that signifies the completion of a content envelope within the body of a web page.

[0017] Extraction command set: A set of selection envelopes. Applied to a set of source documents, an extraction set yields all the data to be extracted from the source for repurposing by another application.

[0018] IRE: See Iterative Relative Enveloping.

[0019] Iterative Relative Enveloping (IRE): An iterative process of selecting successively smaller envelopes of content. After selecting the first envelope of content, successive envelopes are all defined relative to the previous envelope.

[0020] Regular expression (regex): A pattern matching language to express how a computer program/human should look for a specified pattern in text. Regular expressions are composed of literal characters and metacharacters. Literal characters are normal text characters. Metacharacters combine literal characters according to a set of rules, similar to how arithmetic operators combine smaller (numeric) expressions.

[0021] Selection command: A function used to locate a specific piece of content within a document. If the content is located, begin and end markers may be placed adjacent to the content.

[0022] Selection envelope: A function of a set of domain-specific selection commands. The application of a selection envelope on a source document selects the desired data element(s).

[0023] Structured document: A structured document is a document whose contents follow a set of rules. Usually the rules are based on XML meta-language rules. XML is a World Wide Web Consortium standard that allows other languages to be formally defined; it is not an application unto itself. Languages defined using XML meta-language rules are referred to as XML-conform ant languages, or in short, XML languages. XML language rules are defined in two formats: Document Type Definition (DTD) or XML Schema Definition (XSD) format. A DTD is a set of rules governing the element types that are allowed in an XML document and the rules for specifying the allowed content and attributes of each element type. The DTD also declares all the external entities referenced within the document and notations that can be used. A schema definition is essentially equivalent to a DTD definition, with the additional ability to define the element and attribute types.

[0024] Unstructured document: Any text document. A stream of textual data does not need to follow any structural rules. One can treat a structured document as an unstructured document if needed.

[0025] Web application: See Web site.

[0026] Web page: A computer file that can be viewed by an end user in a web browser. These pages may be constructed in a variety of computer languages, such as HTML, WML, VoiceXML, XHTML, or any other suitable language. At present, HTML is the most prevalent source language for web pages.

[0027] Web site: A computer-based system of logical instructions, presentation files and data organized to form an interactive source of information accessible via computer networks.

[0028] XML: See Structured document.

[0029] The foregoing has outlined some of the pertinent aspects of the present invention. These aspects are merely illustrative of some of the more prominent features and applications of the present invention. Other benefits can be understood by applying the invention in a different manner or modifying the invention, as described below. These and other features and advantages of the present invention will be best understood from the following drawings and detailed description.

BRIEF DESCRIPTION OF THE DRAWINGS

[0030] FIG. 1 illustrates the selection of the third paragraph of a HTML document.

[0031] FIG. 2 illustrates the selection of the third paragraph of the HTML document in FIG. 1, after a paragraph has been inserted.

[0032] FIG. 3 is a flow diagram illustrating the process of repurposing content according to a preferred embodiment of the present invention.

[0033] FIG. 4 illustrates the creation of a selection envelope by applying a selection command to a sample document.

[0034] FIG. 5 illustrates the selection of an object in the structured hierarchy of a document.

[0035] FIG. 6 illustrates the selection of content within a stream of content.

[0036] FIG. 7 illustrates the relationship between multiple selection commands and selection envelopes, assuming every envelope is nested completely within its parent.

[0037] FIG. 8 illustrates a child envelope that is relative to and nested within a parent envelope.

[0038] FIG. 9 illustrates a child envelope that is relative to but only partially overlapping a parent envelope.

[0039] FIG. 10 illustrates child envelopes that are relative to but outside parent envelopes.

[0040] FIG. 11 illustrates the selection of two objects in the structured hierarchy of a document.

[0041] FIG. 12 illustrates the selection of two strings within a stream of content.

[0042] FIG. 13A illustrates the process of defining selection commands in a selection envelope to identify the desired content according to a preferred embodiment of the present invention.

[0043] FIG. 13B is a flow diagram illustrating the creation of a selection command based on the document type and selection need.

[0044] FIG. 14 illustrates the application of a selection command to select an object in the structured hierarchy of a document.

[0045] FIG. 15 illustrates the application of a selection command to select a string within a stream of content.

[0046] FIG. 16 is a viewable version of a sample web page, as rendered in a web browser.

[0047] FIG. 17 is the HTML source for the sample web page in FIG. 16.

[0048] FIG. 18 illustrates a selection envelope surrounding the first table in the sample web page.

[0049] FIG. 19 illustrates a selection envelope surrounding the second table in the sample web page.

[0050] FIG. 20 illustrates a begin marker placed before the string “Section Title” and an end marker placed at the end of the document.

[0051] FIG. 21 illustrates a selection envelope surrounding the first paragraph in the parent envelope shown in FIG. 20.

[0052] FIG. 22 illustrates how the sample page shown in FIG. 21 may be altered without affecting the selected content.

[0053] FIG. 23 illustrates the selection of the first table row containing the text “Row1.”

[0054] FIG. 24 shows an unstructured document in the form of a news story.

[0055] FIG. 25 illustrates the begin marker placed behind the em dash and the end marker placed after the third paragraph in the example shown in FIG. 24.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION

[0056] The present invention provides a method for selecting content from within a document. In the preferred embodiment, the method may be implemented on a computer system, server, and/or software platform. Particularly, the method may be embodied within conventional software that may be implemented by at least one conventional computer system or network (e.g., a plurality of cooperatively linked computers). The system may be operatively and communicatively coupled to a computer network (e.g., the Internet), thereby allowing the method to operate over a network and select content from remote documents or files.

[0057] The discussion below describes the present invention in the following manner: (i) Section I provides a formulaic description on a general method for repurposing content according to a preferred embodiment of the present invention; (ii) Section II provides a definition for selection envelopes; (iii) Section III describes the concept of selection commands and how to create them; (iv) Section IV elaborates on the method using a structured document in HTML; and (v) Section V elaborates on the method using an unstructured example.

[0058] I. General Method of Repurposing Content

[0059] FIG. 3 illustrates a method 1000 for repurposing content between two domains, according to a preferred embodiment of the present invention. A domain (Y) 1001 is an information source. If necessary, the information from domain (Y) 1001 is processed by a transformer (T1) 1002 into a set of textual documents. The information in domain (Y) 1001 may not be textual in origin, so that transformer (T1) 1002 may be required to convert it into text for use by the present invention. The present invention provides a method of selecting and extracting sets of information from text. This method is referred to as Iterative Relative Enveloping (IRE). These extracted sets of data are then passed to an external transformation system (T2) 1005. Transformation system (T2) 1005 maybe required to convert the extracted data in a format that is used by a target domain (Y′) 1006. Target domain (Y′) 1006 may be any system that needs to use the information.

[0060] The elements in FIG. 3 may be formally specified as follows: domain (Y) 1001 is the set of documents from the source domain; transformer (T1) 1002 is the external transformation system for transforming the source documents into text documents; selected data (X) 1004 is the set of data desired for extraction; transformer (T2) 1005 is the external transformation system for transforming the output data into the format needed for system (Y′) 1006; and system (Y′) 1006 is the target domain. In some cases transformers T1 and T2 could be null. FIG. 3 illustrates the simplest case in repurposing Y for Y′. In practice, there can be any number of extraction sets and transformers between Y and Y′. The present invention provides a method to extract content from structured and unstructured documents using a set of extraction commands (E) 1003. This method is explained below using a series of equations as follows. When operated on domain (Y) 1001, E generates an extracted data set (X) 1004. This may be represented as follows:

E(Y)=X  (Equation 0)

[0061] where E is the complete extraction command set, and is an unordered set of selection envelopes. It is called a selection envelope because it “selects” a portion of text each time it is applied to the source document set. Sets E and X may be defined as follows:

E={s1, s2, s3, . . . , sm}  (Equation 1)

[0062] where E is an ordered set of selection envelopes with cardinality m, and

[0063] each sk is a selection envelope, ∀k such that 0<k≦m

X={x1, x2, x3, . . . , xm}  (Equation 2)

[0064] where X is an ordered set of all extracted data with cardinality m

[0065] and xk is the extracted data set, ∀k such that 0<k≦m

[0066] (The term ‘selection envelope’ is used in two ways. The first refers to a system of instructions that ‘selects’ content. The second refers to a container, or ‘envelope,’ generated by those instructions. This section uses the first definition. The second definition will be described below.)

[0067] The application of a selection envelope on the source document set results in a data element. More specifically, when any selection envelope sk is applied to domain (Y), the result is the corresponding data element xk.

sk(Y)=xk  (Equation 3)

[0068] ∀k such that 0<k≦m

[0069] Selection envelopes are composed of instructions called selection commands. The set of all selection commands is typically domain specific. We denote the set of domain specific commands using the letter “C” as follows:

C={c1, c2, c3, . . . , ct}  (Equation 4)

[0070] where C is the set all of selection commands in a domain with cardinality t, and

[0071] ck is a selection command, ∀k such that 0<k≦t

[0072] Each selection envelope is made up of a set of one or more selection commands or selection functions applied in sequence. A selection function is a meta-level command generated by combining various selection commands using logical or programming language constructs. Each selection envelope sk may have a different number of selection commands, as required by the extraction. Each selection command in envelope sk is an instantiation of a command in C, with parameters, or an instantiation of a function that is defined using the selection commands in C, with parameters. Thus, the number of selection commands n in an envelope sk has no relation to t, the cardinality of C. A selection function “f” in equation (5) below is preferably a concatenation of one or more selection command instances.

sk=f(Ck)  (Equation 5)

[0073] where CkC

[0074] For any given selection envelope sk using n selection commands, sk contains n−1 envelopes, and the initial selection envelope is the same as the output of the initial selection command applied on the source document. Further, the selection commands are applied relative to the results of previous selection commands. The selection operator ⊙ in equation (6) indicates the concatenation of any two selection two commands. For example, in a non-limiting embodiment ⊙ could be one of “*” and “+,” with “*” being similar to the Boolean “AND” operation, meaning “apply the previous command and this command,” and” “+” being similar to the Boolean “OR” operation, meaning “apply the previous command, if false, then evaluate this command.”

skgckg(xki)⊙skg−1  (Equation 6)

[0075] ∀k such that 0<k≦m,

[0076] where skg is the envelope with g successively applied commands, 1<g≦n,

[0077] ckg is the gth invocation of a selection command in set Ck,

[0078] xki is the result of the ith selection command, such that 1≦i<g, and

[0079] ⊙ denotes the operation of selection. In the case where k=1,

sk1=ck1(Y)=xk1  (Equation 7)

[0080] The first selection envelope is a result of applying the first selection command on the input set Y.

[0081] Note that, when expanded, equation (6) is also equivalent to (shown without parameters)

skg=ckg⊙(ckg−1⊙( . . . (ck2⊙(ck1)) . . . ))  (Equation 8)

[0082] In equations (6), (7) and (8) each command ckg is the gth instance of command ck or of function using commands defined in equation (4). These selection commands have required parameters that need to be specified when used. The same command may be applied multiple times in the same selection envelope with different parameters. For the sake of clarity, the parameters of the commands are not shown in the notation. Further, the notation ckg refers to the command of index g in any selection envelope sk.

[0083] Applying each ckg successively on the previous selection envelope results in an intermediate data extraction, xkg. Note that ckg is an instance of a selection command, and may use any combination of the previously determined data sets, along with the initial input Y, as a parameter to the command. To elucidate further, the following shows all the intermediate steps in a selection of n steps:

[0084] sk1=ck1(f{Y}) and sk1(Y)=xk1

[0085] sk2=ck2(f{Y, xk1})⊙(sk1) and sk2(xk1)=xk2 . . .

[0086] skn=ckn(f{Y, xk1, . . . , xkn−1})⊙(skn−1) and skn(xkn−1)=xkn

[0087] Note that xkn, the result of the nth successive selection command, is the same as xk, which is the required extraction data element.

[0088] By repeating the above to extract all the content specified by X, the set E is achieved.

[0089] This exemplary scenario is presented to further elucidate selection envelopes:

[0090] Let Y={y1, y2, y3, . . . , y100} be the source domain of HTML documents

[0091] Let C={c1, c2, c3, C4} be the set of available commands

[0092] where the command set is specifically defined as follows:

[0093] c1 selects the document y to operate in from the set Y

[0094] c2 is a regular expression pattern matcher

[0095] c3 selects tabular data

[0096] c4 returns list data

[0097] Suppose the goal is to extract

[0098] 1. The table between “God . . . value our own.” in document y12.

[0099] 2. The first list in the document y25.

[0100] 3. The first table or first list of document y4.

[0101] Let X={x1, x2, x3} represent the above three data sets,

[0102] and the operation ⊙ is either “*” (logical AND) or “+” (logical OR).

[0103] The method specified herein may be used to determine the extraction command set E={s1, s2, s3} using the command set C.

[0104] The selection envelopes developed using this method are described below:

s1=c3′*(c2′*c1′); C1={c1, c2, c3}

[0105] c1′ selects document y12 from Y; it is an instantiation of c1

[0106] c2′ parameterizes c2 to only include content between “God . . . value our own” in y12

[0107] c3′ further finds a table in between the scope “God . . . value our own” in document y12, based on c3

s2=c4′*c1′; C2={c1, c4}

[0108] c1′ selects document y25 from Y

[0109] c4′ further finds a list in document y25

s3=(c3′*c1′)+(c4′*c1′); C3={c1, c2, c3, c4}

[0110] c1′ selects document y4 from Y

[0111] c4′ further finds a list in document y4

[0112] If the a list is available, it returns here, otherwise (the “+” operator)

[0113] c1′ selects document y4 from Y

[0114] c3′ further finds a table in document y4

[0115] These may be applied to input set Y such that

s1(Y)=x1

s2(Y)=x2

s3(Y)=x3

[0116] For any specific pair of domains, an extraction system consists of a design phase and an execution phase. During the design phase, an operator of the present invention uses the domain specific extraction commands C to produce an extraction command set E. This is achieved by defining specific selection envelopes to extract each data element. During the execution phase, a run-time system executes the selection envelopes to extract the contents.

[0117] II. Selection Envelopes

[0118] As mentioned before, selection envelopes are used both as instructions for selection of content and as a container for selected content. This second manifestation will now be described.

[0119] As shown in FIG. 4, a selection envelope 1400 is a container for a section of a document, delineated by two markers referred to as the begin marker 1200 and end marker 1300. These markers are virtual delineators that are created only during runtime. The begin marker 1200 defines the beginning of the selection envelope 1400 while the end marker 1300 defines the end of the selection envelope. The selected contents 1500 is what lies between these two markers.

[0120] A selection envelope can contain elements from structured or unstructured documents.

[0121] For the purpose of this invention, it is assumed that all structured documents are based on XML meta-language rules. XML is a known World Wide Web Consortium standard. XML allows other languages to be formally defined; it is not an application unto itself. Languages defined using XML meta-language rules are referred to as XML-conformant languages, or in short, XML languages. XML language rules are defined in two formats: Document Type Definition (DTD) or XML Schema Definition (XSD) format. A DTD is a set of rules governing the element types that are allowed in an XML document and the rules for specifying the allowed content and attributes of each element type. The DTD also declares all the external entities referenced within the document and notations that can be used. Stated otherwise, an XML DTD provides a means by which an XML processor can validate the syntax and some of the semantics of an XML document. A schema definition is essentially equivalent to a DTD definition, with the additional ability to define the element and attribute types. XML based languages can be of two types, well formed and strict. Well-formed documents are structurally complete. Strict documents are always accompanied by a rule set (DTD or schema) and strictly follow those rules. This invention applies to both. An HTML document can be treated as a well-formed XML document and used in structural operations. In addition, structured documents have both structural and textual representations.

[0122] Unstructured documents, also known as ‘character’ documents, are textual documents and do not need to follow any structural rules. They are comprised of text symbols that can be of any type and can be ordered in any sequence. A structured document may also be treated as an unstructured document. ASCII text is an example of an unstructured document.

[0123] For structured documents, a selection envelope can contain various arrangements of structures. As shown in FIG. 5, a structured document may be represented as a hierarchical structure 1110. A selection envelope 1410 made of a begin marker 1210 and end marker 1310 may contain any valid structural element represented object 1112. Selection envelopes containing structural objects place their begin markers and end markers immediate adjacent to the object so that they exclusively define the desired object. Just as the structure of a document may exist as an abstract system created by an XML processor, the begin and end markers are virtual objects in the document.

[0124] For unstructured documents, a selection envelope can contain contiguous segments of text based on the textual representation of the document. An example of a selection envelope with relation to an unstructured document is shown FIG. 6. Begin marker 1220 and end marker 1320 are positioned around segments of content within the document.

[0125] More generally, a system of selection envelopes can be defined so that each successive selection envelope, or child envelope, is defined relative to a previously defined envelope, or parent envelope. As shown in FIG. 7, selection envelope 1430 may be defined for source document 1100. Envelope 1430 may then be used to produce envelope 1431 via selection command 1602, and so on. Selection commands are more fully explained below.

[0126] The relationship between a parent envelope and its successor, or child envelope can take form in one of three ways. A child selection envelope 1441 may be either nested within a parent selection envelope 1440, as shown in FIG. 8; partially overlapping a parent selection envelope, as shown in FIG. 9; or completely outside of a parent selection envelope, as shown in FIG. 10. The scope of the selection is iteratively refined until the desired content has been selected.

[0127] Furthermore, multiple sets of selection envelopes may exist simultaneously for a given document when a selection command is applied. Referring to FIG. 11, a structured document 1110 can be seen to have two selection envelopes 1410 and 1411 that contain two different object structures. Referring to FIG. 12, an unstructured document can be seen to also have two selection envelopes.

[0128] The means by which a selection envelope is defined may differ for each envelope in a set. Thus, while a parent envelope may be defined by associating a marker with a certain string, the child selection envelope may be defined by associating a marker with a structural object. The means by which selections are defined will be described in detail later.

[0129] The preceding discussion can be can be further illuminated by referring to FIG. 13A. This figure illustrates the general process 2000 of creating a series of selection envelopes sk1, sk2, . . . , skn for a document Yk. It corresponds to equations (6), (7) and (8) described above.

[0130] The basic unit for this process is the specification of a selection envelope. Step 2004 specifies the source of information. A source may be a complete document or section of a document. For the first selection envelope, sk1, the source is the entire document Yk. In step 2005, a selection command c is parameterized to operate on Yk. In step 2006, parameterized command ck1 outputs data set xk1, which is the content selected by envelope sk1. Finally, step 2001 evaluates whether the desired content has been selected. If so, then xk1 is output to system Y′ by way of Transformer T2. This completes the process. If the desired content has not yet been selected, then the specification of a second envelope sk2 begins. The source is the set containing document Yk and the output of the previous selection command, xk1.

[0131] The process for the specifying sk1 is equivalent to equation (7) above.

[0132] Proceeding selection envelopes are specified using the same process described above. Like the first selection envelope, sk2 is defined by selection command ck2 that outputs xk2. At decision gate 2002, it is again evaluated if the desired content has been selected. If it has, xk2 is output to system Y′ by way of Transformer T2. This is equivalent to equations (6) or (8) where g=2.

[0133] If the desired content has not yet been selected, further envelopes are defined until a final envelope skn is defined. The source for skn is the set containing source document Yk and xk1, xk2, . . . , xkn−1. Selection command ckn outputs xkn, which is deemed to be the correct selection by the final decision gate 2003. This completes the process. This is equivalent to equations (6) or (8) where g=n.

[0134] With this understanding, the detailed workings of selection commands can now be explained.

[0135] III. Selection Commands

[0136] As mentioned above, selection commands define selection envelopes or sets of selection envelopes. This section will describe the relationship between selection commands and selection envelopes.

[0137] For structured documents, the general relationship between selection commands and selection envelopes is illustrated in FIG. 14. A selection command 1610 may identify an object structure composed of a child object 1112 and descendant objects 1113, and thus specify a selection envelope 1410 around the structure. For unstructured documents, this general relationship is illustrated in FIG. 15. A selection command 1620 may define the locations of the virtual begin marker 1220 and virtual end marker 1320 and thus, define a selection envelope 1420.

[0138] Selection commands may use both structural and textual cues to define the selection envelope. Selection commands are not unique or universal; a set of extraction problems may require their own command set based on the markup language of the source document, a set of text operations, and programming language constructs.

[0139] Several prior art systems are based on either structure- or character-based operations. However, no prior art system has provided a combination of the two in the manner provided by the present invention, which offers increased flexibility. Also, the prior art systems based on structure-based operations use position-based information, such as second table, third paragraph, and others. The current invention creates selection commands based not only on position-based structure, but on semantic information (e.g., find the table with title “zzz”) as well. Further, while most prior art methods enable automatic generation of selection commands, providing a method that uses human intervention during design time leads to more highly robust extractions. With human intervention, the selections can use intrinsic content markers in a document as part of the command that an automatic system could not.

[0140] Selection commands can be categorized into 3 different groups including (1) selection commands based on document structure; (2) selection commands based on character patterns or regular expressions; and (3) combined selection commands. Each of these groups is discussed below.

[0141] Group 1. Selection Commands Based on Document Structure

[0142] In any structured document, the “structure” is defined by notation that is interspersed among the document content. For example, in the case of XML-based documents, it is in the form of XML tags. These tags create a hierarchical structure. Thus, when these documents are manifested in memory, an operator can define several commands that capitalize on the document hierarchy. This typically results in a traversal of the non-linear data structures in memory. It is sometimes more optimal than using character-based operations.

[0143] For example, an XML document may contain a hierarchy of chapters, sections, and sub-sections. Once this has been read into memory, locating a certain paragraph of a certain section of a certain chapter becomes a trivial indexing location command. A character-based search, on the other hand, would perform a linear search.

[0144] The disadvantage of such indexing commands is the precise nature of the addressing. For documents that are periodically changing in structure, simply relying on structural commands may be disastrous, as illustrated in FIGS. 1 and 2.

[0145] The following table illustrates a few of the structure/context based selection commands on structured documents. 1 Structure/context based selection commands Command name Example command instances Select elements by name Given a name, select all the elements in the source document matching the name Select element by location Select elements by their location, such as the third table of the document, fifth address book entry, or nth occurrence of element k. Select element by sibling Select the parent of element m with id = k or relationship find the second sibling of element with id = k Select element by attribute Find all elements m whose attribute k has value v. Select element by counter Select nth child of root element.

[0146] Indexing commands such as “Select element by location” (e.g. find third table) may still successfully extract data when the structure of the document changes. However, contextual commands such as “Select elements by attribute” (e.g. find tables with title “Foobar”) will be more resilient to structural changes, assuming the content remains same even if the structure changes. If an XML document is not strict, then some of the contextual commands might not be useful, as the attributes specified by the commands might not be present in the source document. Then the most reliable way to identify them is to use structural commands. One of ordinary skill in the art will appreciate how to create or extend more context/based selection commands based on element order, attributes, and various relationships.

[0147] Group 2. Selection Commands Based on Character Patterns or Regular Expressions

[0148] Pattern- or regular expression-based operations treat text documents, both structured and unstructured, as a stream of characters, ignoring any structural notation that may be interspersed in the document. Most of these commands use patterns in the content itself to identify regions of text. By applying formal language theory, an operator of the present invention may build powerful regular expression commands to search and operate on bodies of text.

[0149] To create pattern-based selection commands, the input (or the contents) of the envelope is considered to be a stream of characters with certain delimiters such as ‘space’, ‘comma’, ‘newline’, and others. Those skilled in the art can appreciate how to create commands using regular expressions to find text containing specified strings and regular expressions. The table below illustrates two such operations: 2 Character-based selection commands Command name Example command instance Select text contain Select text containing the word ‘patents’ Select text matching pattern Select text matching pattern [1-9][0-9]*(\[0-9][0-9])?

[0150] The present invention allows selection commands to define the position of one or more pairs of virtual begin and end markers within a document, and in so doing, defines a selection envelope or a set of selection envelopes.

[0151] Group 3. Combining Context- and Pattern-Based Selection Commands Using Programming Language Constructs

[0152] Selection commands that have literal interpretations, such as “Select the third table after the statement ‘Final report:’” or “Select the table with the string ‘Stock Symbol:’ anywhere in the first row,” are compositions created using both structural- and character-based concepts. These are the most flexible and robust commands.

[0153] Those skilled in the art will appreciate how to use programming constructs such as conditionals, loops and variables, in addition to the two types of selection commands described above, to create meta-selection commands. For example: 3 Var k = (Select element ‘i’ with id=‘z’) Result := for each element e in k do; if e contains the pattern ‘text’ Select e; End-if End-for each

[0154] will select all elements with id ‘z’ and containing pattern ‘text’ in the source document.

[0155] Designing Selection Commands for Documents

[0156] The general process for creating selection commands is shown in FIG. 13B. This process provides for the creation of selection commands and selection functions.

[0157] In step 2016, the source document is evaluated automatically by a system or manually by an operator to be either a structured or unstructured document. Based on this, step 2017 is to select and parameterize an appropriate selection command. For structured documents, these include but are not limited to structural/contextual selection commands 2013 and pattern-based selection commands 2015. For character-based documents, this includes but is not limited to pattern-based selection commands 2015. Furthermore, the command set for structured documents may be based on combinations of structure/context-based selection commands and pattern-based selection commands. This is accomplished with the use of programmatic language constructs 2014. Programming language constructs 2014 may also be used to enhance selection commands for all documents by providing the ability to add conditions, loops, branching and other constructs.

[0158] The output of this process is a parameterized selection function ckn, which will create a selection envelope skn, similarly to equation (6) above.

[0159] IV. Examples of the Operation of Method 2000 Using a Structured Document

[0160] An application of the present invention is illustrated in the following examples, using a structured document, or more specifically, a web page based on HTML. The examples illustrate the general process as shown in FIG. 13A of extracting data from source Y for use in system Y′. They also illustrate the ability to create robust selection commands via the process shown in FIG. 13B.

[0161] The method 1000 will be defined as follows for the following four examples. The source document Y is an HTML document, seen in rendered form in FIG. 16 and in HTML source view in FIG. 17. The examples will illustrate the creation of four selection envelopes s1, s2, s3, and s4 that respectively identify x1, x2, x3, and x4. As described above, selection envelopes are functions of selection commands ‘c’ that are defined below.

[0162] Specifically, the content selection goals for this example are as follows: Selection envelope, s1, is to contain x1, the first table in source document Y. Envelope s2 is to contain x2, the second table in the document. Envelope s3 is to contain x3, a certain paragraph in the document specified in detail below. Lastly, s4 is to contain x4, a certain paragraph containing a given string specified in detail below.

[0163] For the purposes of these examples, an initial selection envelope exists before any selection commands are specified. This envelope contains the entire source document.

[0164] Let

s1=f(c1)

s2=f(c1)

s3=f(c2, c1)

s4=f(c3)

[0165] Let X={x1, x2, x3, x4} represent the result of the above three selections where

[0166] s1 yields x1

[0167] s2 yields x2

[0168] s3 yields x3

[0169] s4 yields x4

[0170] when applied to source document Y.

[0171] Let E={s1, s2, s3, s4} where E is the complete set of content selected by the selection commands.

[0172] For the purposes of this example, let the total set of selection commands used be C={c1, c2, c3}, where

[0173] c1 is a structural selection command with parameters:

[0174] type—the type of structural object to select; values can be HTML tag set

[0175] instance—the index of occurrence of the type of structure

[0176] inclusion—governs if the identified content is included or excluded

[0177] c2 is a pattern matching selection command that positions the begin and end marker with parameters:

[0178] begin marker string—the text string to be located

[0179] begin marker instance—the index of occurrence of the text string

[0180] begin marker inclusion—governs if the identified content is included or excluded

[0181] end marker string—the text string to be located

[0182] end marker instance—the index of occurrence of the text string

[0183] end marker inclusion—governs if the identified content is included or excluded

[0184] c3 is both a structural and pattern matching selection command that finds a structure based that contains a certain string. Its parameters are:

[0185] type—the type of structural object to select; values can be HTML tag set

[0186] instance—the index of occurrence of the type of structure

[0187] string—the text string contained in the structural object

[0188] inclusion—governs if the identified content is included or excluded

[0189] Now that the system has been defined, selection envelopes s1, s2, s3, and s4 are now defined per the process illustrated in FIG. 13A and FIG. 13B. Relevant benefits of the invention will also be pointed out.

[0190] A. Selection Envelope s1

[0191] This selection envelope example illustrates the ability to directly identify a structural object within a document by its position or sequential index with reference to a parent selection envelope. Referring to the process seen in FIG. 13A, step 2004 is to define the source information for envelope specification; the source is document Y. For step 2005, a selection command ck1 is to be selected from the set of functions C defined above and then parameterized.

[0192] This calls steps 2016 and 2017 of the process in FIG. 13B. Given that document Y is structured, step 2016 allows structural, pattern-based, or any combination of selection commands c1, c2 , or c3 to be used. For the purposes of the example, the desired content x1, which is the first table in the source file Y, is deemed to be reliably extractable by immediately using a single structural selection command c1. Thus for step 2017, a structural selection command c1 is chosen and parameterized as follows:

[0193] type=table

[0194] instance=1

[0195] inclusion=true

[0196] Thus, the output in step 2018 is c1 such that

[0197] c1 defines a resulting selection envelope, s1. This is represented as:

s1=f(c1)

[0198]  which is equivalent to equation (5) above. Stated another way,

s1=c1(Y)=x1

[0199]  where x1=the first instance of a table in document Y.

[0200] As shown in FIG. 18, this command places begin marker 3012 and end marker 3013 so that they immediately surround the HTML table structure of the first table. As the desired content has been selected, the answer for step 2001 is ‘yes’ and the selected content x1 is available for use in Y′.

[0201] B. Selection Envelope s2

[0202] To further elaborate on the use of selection commands, the second table of document Y will be selected for use in Y′. This again illustrates the use of position or sequential index or an object within a parent selection envelope.

[0203] Again utilizing the process seen in FIG. 13A, step 2004 is to define the source information Y. The desired content x2, the second table in the source file Y, is deemed to be reliably extractable by immediately using a single structural selection command c1. Thus, for step 2005, selection command c1 is selected for parameterization,

[0204] type=table

[0205] instance=2

[0206] inclusion=true

[0207] Thus, the output of c1 is such that

s1=f(c1)

[0208] which is equivalent to equation (5) above. Stated another way,

s2=c1(Y)=x2

[0209] where x2=the second instance of a table in document Y.

[0210] This selects the second table, as shown in FIG. 19. As the desired content ha s been selected, the answer for step 2001 is ‘yes’ and the selected content x2 is available for use in Y′.

[0211] C. Selection Envelope s3

[0212] This next selection envelope example illustrates the ability to use multiple selection commands in series to define a selection envelope for a source document that may change in structure or content. This example also illustrates that different types of selection commands can be specified within the same selection envelope as necessary. Utilizing the process seen in FIG. 13A, step 2004 is to define the source information for envelope specification; in this case, Y.

[0213] To develop the desired selection command, the process of FIG. 13B is followed. Step 2016 dictates that either structural, pattern-based or any combination of selection commands c1, c2, or c3 can be used. For the purposes of the example, the desired content x31 which is the first paragraph after the string, “Section Title,” in the source file Y, needs two selection commands for reliability, given that source document Y may change. For step 2017, the first selection command is determined to be a pattern-based selection command 2015, as seen in FIG. 13B. Command c2 is chosen and parameterized as follows:

[0214] begin marker string=“Section Title”

[0215] begin marker instance=1

[0216] begin marker inclusion=true

[0217] end marker string=end of document

[0218] end marker instance=n/a

[0219] end marker inclusion=n/a

[0220] Thus, the output c2 is such that:

s31=f(c2)

[0221] which is equivalent to equation (5) above. Stated another way,

s31=c2(Y)=x31

[0222] where x31 can be seen in FIG. 20.

[0223] Referring to step 2001, the desired content has not yet been selected thus necessitating the definition of another selection envelope. For this second selection envelope, the source in step 2004 document Y and x31. For step 2005, selection command ck1 has not yet been chosen. To determine c, the process of FIG. 13B is again followed. Step 2016 dictates that either structural, pattern-based or any combination of commands c1, c2, or c3 can be used.

[0224] For step 2017, the first selection command is determined to be a structural selection command 2013, as seen in FIG. 13B. Command c1 parameterized as follows:

[0225] type=table

[0226] instance=1

[0227] inclusion=true

[0228] Thus, c2 is such that:

s32=f(c1)

[0229] which is equivalent to equation (5) above. Stated another way, s32=c1(x31)⊙s31=x32 where x32 can be seen in FIG. 21 where the begin marker 3003 and end marker 3004 surrounding the first HTML paragraph in the parent envelope. This is the desired selection x32. Furthermore, according to step 2002 in FIG. 13A, no further selection envelopes need to be defined.

[0230] The robustness of selection envelope s3 is illustrated by showing that it still correctly extracts the desired content from an altered source document. The original HTML source document is shown in FIGS. 16 and 17. The altered HTML source document 3007 is shown in FIG. 22. Specifically, a paragraph 3008, horizontal rule 3009 and table 3010 have been added. The string “Section Title” now resides within table 3010. While these alterations have been made to the source page, the selection command defined for s31 still successfully positions the begin marker 3011 and end marker 3012, for the first selection envelope. Similarly, the selection command defined for s32 successfully positions the begin marker 3013 and end marker 3014, for the second selection envelope.

[0231] D. Selection Envelope s4

[0232] This selection envelope example illustrates the use of a command that combines structural and pattern-based command. Yet again, the process of FIG. 13A is used. Step 2004 defines the source information for envelope specification; in this case, the source is document Y, as shown in FIG. 17. For step 2005, a selection command ck1 is to be selected from the set of functions C defined above and then parameterized.

[0233] In order to do this, steps 2016 and 2017 of the process in FIG. 13B are used. Given that document Y is structured, step 2016 of the process seen in FIG. 13B allows either structural, pattern-based or any combination of commands c1, c2, or c3 to be used. For the purposes of the example, the desired content x4, is deemed to be reliably extractable by immediately using a selection command c3. Command c3 combines structural and pattern-based commands using programmatic constructs. Thus for step 2017, both a structural/contextual selection command 2013 and a pattern-based selection command 2015 are selected. The selection command c3 is parameterized as follows:

[0234] type=row

[0235] instance=1

[0236] string=“Row1”

[0237] inclusion=true

[0238] Thus, c3 is such that

[0239] c3 defines a resulting selection envelope, s4 such that:

s4=f(c3)

[0240]  which is equivalent to equation (5) above. Stated another way,

s4=c3(Y)=x4

[0241]  where x4 can be seen in FIG. 23. As the desired content has been selected, the answer for step 2001 is ‘yes’ and the selected content x4 is available for use in Y′.

[0242] As shown in FIG. 23, the begin marker 3005 is placed before the opening structural tag for a table row <tr>, and end marker 3006 is placed immediately after the closing tag </tr> for the same table row. This is the selected and outputted content x4 equivalent to item 1551 in FIG. 13A. As specified by the command, table row contains a cell with the text “Row1” inside it.

[0243] V. An Example of the Operation of Method 2000 Using an Unstructured Document

[0244] The present invention can also be applied to non-structured documents. The following is an example of the use of the invention to extract content from a non-structured document as can be seen in FIG. 24. The source document Y 4000, is a news story. The desired content from the document 4000 consists of only selection: the first three paragraphs.

[0245] The following example will be explained in reference to the extraction process illustrated in FIG. 3 and the equations in Section I above.

[0246] The source domain Y for the system consists of document 4000. An extraction set E can be immediately applied to document 4000, as a transformer T1 is not required to transform the source into text. E is defined in order to produce the desired data set X. In this case,

X={x1}

[0247] where x1 is the complete set of extracted data from document 4000.

[0248] The data set x1 possesses one member element, a string containing the first three paragraphs of the news story in document 4000.

[0249] In order to extract set x1, a selection envelope s1 must be applied to document 4000.

[0250] From equation (5), it follows that s1=f(C1) where C1 is a subset of all the selection commands in the current domain C.

[0251] Let C={c1}, the total set of selection commands used, where

[0252] c1 is a pattern matching selection command that positions the begin and end marker with parameters:

[0253] begin marker string—the text string to be located

[0254] begin marker instance—the index of occurrence of the text string, and

[0255] begin marker inclusion—governs if the identified content is included or excluded

[0256] end marker string—the text string to be located

[0257] end marker instance—the index of occurrence of the text string, and

[0258] end marker inclusion—governs if the identified content is included or excluded

[0259] Utilizing the process seen in FIG. 13A, step 2004 is to define the source information for envelope specification; in this case Yk 1151 is document Y 4000. For step 2005, a selection command ck1 1651 is to be selected from the set of functions C defined above and then parameterized.

[0260] In order to do this, steps 1 through 3 of the process in FIG. 13B are run through. Given that document Y 4000 is structured, step 2011 of the process shown in FIG. 13B allows either structural, pattern matching or any combination of selection commands c1, c2, or c3 can be used. For the purposes of the example, the desired content x1 which is the first table in the source file Y, is deemed to be reliably extractable by immediately using a single structural selection command c1. Thus for step 2015, the structural selection command 2013 is selected. This selection command is chosen to be c1 and parameterized as follows:

[0261] begin marker string=“-”

[0262] begin marker instance=1

[0263] begin marker inclusion=false

[0264] end marker string=“.¶”

[0265] end marker instance=3

[0266] end marker inclusion=true

[0267] This allows for step 2018 which defines ckn 1652 equal to c1 such that

[0268] c1 defines a resulting selection envelope, s1 such that:

[0269] s1=f(c) where c={c1}

[0270]  which is equivalent to equation (5) above. Stated another way,

s1=c1(Y)=x1

[0271]  where x1 can be seen in FIG. 25.

[0272] As seen in FIG. 25, c1 places the begin marker 4002 after the em dash 4001. c1 also places the end marker 4003 after the carriage return following the third paragraph. The selected content is x1.

[0273] Thus, after applying one selection command described above to system Y, the selection function f(C1) yields the desired data set x1. This data may now be passed to transformer T2 to be converted to a format appropriate for any target domain Y′.

[0274] It should be understood that the inventions described herein are provided by way of example only and that numerous changes, alterations, modifications, and substitutions may be made without departing from the spirit and scope of the inventions as delineated within the following claims.

Claims

1) A method for extracting content from a document, comprising the step of:

creating at least one selection envelope based upon a plurality of selection commands for locating specific content within said document; and
selecting content from said document based upon said at least one selection envelope.
Patent History
Publication number: 20020184188
Type: Application
Filed: Jan 22, 2002
Publication Date: Dec 5, 2002
Inventors: Srinivas Mandyam (San Jose, CA), Krishna Vedati (Sunnyvale, CA), Winston Wang (San Francisco, CA), Cynthia Kuo (Mountain View, CA), Janak Bhalodia (Mountain View, CA)
Application Number: 10056300
Classifications
Current U.S. Class: 707/1
International Classification: G06F007/00;