Automated generation of text analysis systems
A system, method, and computer program for automatically generating text analysis systems is disclosed. Individual passes of a multi-pass text analyzer are created by generating rules from samples supplied by users. Successive passes are created in a cascading fashion by performing partial text analyses employing existing passes. A complete text analyzer interleaves the generated passes with a framework of existing passes. The complete text analysis system can then process texts to identify patterns similar to samples added by users. Generation of rules from samples encompasses a wide range of constructs and granularities that occur in text, from individual words to intrasentential patterns, to sentential, paragraph, section, and other formats that occur in text documents.
Text analysis is an area of computer science that focuses on processing text to extract information through pattern recognition. The decade of the 1990's has seen an unprecedented explosion in work on learning methods for text analysis. Prior text analysis methods rely on unsupervised learning, where the system is responsible for teasing generalizations from texts or samples. One such system, the HASTEN system described in “SRA: Description of the SRA System as Used for MUC-6,” Krupka, George R., pp. 221-235, Proceedings Sixth Message Understanding Conference (MUC-6), November 1995 (referred to herein as Krupka). Krupka teaches a system for grouping text samples supplied and labeled by users and creating data structures called e-graphs. The system in Krupka then uses a similarity metric to decide if portions of an input text are related to e-graphs that have been created. It applies these collections of e-graphs, called collectors, as sequential processing phases, in order to match each sample set to the input text. Generalization of the elements of e-graphs is performed manually by the developer. There is no notion of generating grammar rules from e-graphs. The work does not establish a method for converting the collectors to rule-based passes of a text analyzer. The work does not describe a way to automatically generate substantial portions of a text analyzer. The system in Krupka requires a large amount of user interaction to perform tasks manually beyond adding and labeling samples, and was applied specifically to create an event level pattern for MUC text analysis. However, Krupka's system does not teach a general and fully automated text analyzer capability.
Another text analysis system is disclosed in Huffman (U.S. Pat. Nos. 5,796,926 and 5,841,895). The Huffman patents deal with text extraction at the event level and teach methods for locating potential event patterns of interest. In essence, Huffman teaches a rigid, inflexible method of searching for specific patterns such as “actor acts on object.”
There is a need for a system that automatically generates text analysis systems with minimal training samples while retaining sufficient intelligence to recognize patterns beyond those described by the training samples, sufficiently flexible to allow adaptation to a variety of applications.
SUMMARY OF THE INVENTIONAn embodiment of the present invention includes a generator program 106 that utilizes a hierarchy of user-supplied samples and a text analyzer framework to create complete text analyzer programs. The hierarchy and framework are related in that the top-level concepts of the hierarchy are associated with stubs, or empty regions of passes, in the text analyzer framework. The generator program fills these stub regions with text analyzer passes generated from samples in the hierarchy. A user guides the conversion of the samples to generalized rules for recognizing not only the given samples, but also related patterns that are processed at a later time. Users may supply additional samples in order to process novel patterns that were not anticipated when the initial text analyzer was created. When a text analysis system according to the present invention fails to identify a pattern, a user can simply highlight the unrecognized sample in text and label its components, if necessary, to enable the generator to create a new text analyzer that now recognizes the new sample and related samples processed at a later time. Rather than using a similarity metric, an embodiment of the present invention applies rules that have been automatically generated from samples.
BRIEF DESCRIPTION OF THE DRAWINGS
Directing attention to the drawings,
Generator program 106 produces text analyzer programs by generating rules from samples supplied by users to create individual passes of a multi-pass text analyzer. A sample is a piece of text that users have decided is a unit of interest, such as a name or idiomatic phrase. A sample hierarchy is an index for storing all user-added samples. A rule is a representation for a pattern of interest, which may include associated actions to ensure that the pattern has matched correctly and to record the match in the parse tree. A rule typically associates a concept with a pattern or phrase. When the pattern matches a list of nodes, the matched nodes of the parse tree are condensed or reduced to a node associated with the concept.
As used herein, a pass is one step of a multi-step analyzer, in which the generator program 106 traverses a parse tree to execute a set of rules associated with the pass. As used herein, a parse tree is a tree data structure constructed and maintained by the generator program 106 to organize text and all the patterns that have been recognized within the text. Successive passes are created in a cascading fashion by performing partial text analyses employing existing passes. The resulting text analyzer program interleaves the generated passes with a framework of existing passes. The complete text analysis system can then process text to identify patterns similar to samples added by users. Generation of rules from samples encompasses a wide range of constructs and granularities that occur in text, from individual words to intrasentential patterns (such as a grammar), to sentential, paragraph, section, and other formats that occur in text documents.
To exemplify the methods and data structures of the present invention, we use simple telephone number patterns such as
497-5318
(949) 497-5318
Home: (949) 497-5318 (1)
Given the Sample Input Text:
Home: (949) 497-5318 (2)
The output pass displays the parse tree 300 as illustrated in
Text analyzer program 200 has no knowledge of telephone number patterns. If a user wants phone numbers to be grouped under a concept called phone, a sample hierarchy as shown in
Generator program 106 can be invoked to generate a new analyzer by executing the sequence of steps 500 illustrated in the
497\-5318 (3)
It therefore generates [step 516] the raw rule based precisely on what is represented in the parse tree, as follows:
_phone<-497\-5318 @@ (4)
The underscore before phone indicates that this is a non-literal concept. The <- arrow indicates a rewrite of the phrase to the right with the concept to the left. The @@ marker denotes the end of the rule. The backslash preceding the dash means that this dash is to be taken literally, rather than being part of the rule language. At this point, the generator program 106 can attach labeling information to the first element (“497”) and the last element (“5318”) of the phrase, as prefix and suffix, as follows:
_phone<-497 [label=_prefix]\-5318 [label=−suffix] @@ (5)
Since there are no other samples (decision step 518) under the phone concept, the generator program 106 has no opportunity to merge and compare samples. Having finished with the samples under this rule concept, the generator program 106 at step 526 creates a new pass called phone for the rule set it has generated (consisting of one rule in this case). The generator program 106 then adds the new pass to the analyzer sequence (step 528), as shown in
Had the phone concept not been in this sample hierarchy, the generator program 106 would have built the rule
completePhone<-\(949 [label=areaCode]\)\497\-5318 @@ (6)
But because the phone sample is also present and the generator program 106 has installed the phone pass within the analyzer, the generator program 106 is given parse tree 550 (
_completephone<-\(949 [label=areaCode]\)\_phone @@ (7)
The product of the prior automatically-generated pass is used in building the rules for the current pass called completePhone. The generator program 106 has now built an analyzer for phone numbers that follows the passes illustrated in
Generator program 106 automatically creates the passes of a text analyzer in stepwise fashion, each time using the sequence of passes constructed so far in order to create the next pass of the analyzer. It adds each new pass to a backbone of manually built and previously generated passes.
The discussion above describes the generation of one pass per rule concept. Additional modes, specified by the user who constructs the sample hierarchy, enable the rules generated for multiple rule concepts to be merged into a single large pass (step 524), in order to both optimize performance and to enable more sophisticated rule generation that identifies and unifies ambiguous constructs. For example, if “New York” is listed under a rule concept city and a rule concept state, then a unified treatment of these rule concepts can enable the generation of rules such as:
_city [label=state]<-New York @@ (8)
which condenses instances of “New York” to both a city concept and a state concept in a parse tree.
Optimizations
Executing the generator program 106 can be computationally expensive, because each sample in the sample hierarchy requires the text containing it to be partially analyzed, in order to generate the rule corresponding to the sample. Generator program 106 can be modified to keep track of instances where multiple samples under a rule concept derive from the same text. In those cases, the given text need be partially analyzed only once, in order to glean the RAW rules for all the samples that derived from that text.
In a preferred embodiment, further optimization may be achieved when generator program 106 places user-added samples into a single sample file. Thus, each rule file has an associated sample file. The sample file may be stored in memory 104, disk drive 114 or CD Rom 116. In this way the number of partial text analyses is reduced for a sample hierarchy with many samples. Further optimizations are to generate passes when their complement of samples has changed. While there is a danger that some subsequent pass may not be updated correctly due to dependencies on the current pass, most of the time this method of generation (generate dirty) is adequate for rapid development and testing. Occasionally, a generate all function may be invoked to rebuild every single pass, thus making sure that all passes that need updating will get updated.
Rule Generalization and Merging
A preferred embodiment of the present invention also has the capability to generalize and merge raw rules generated directly from samples as illustrated in
At step 560, for each raw rule generated (one per sample), the generator program 106 creates a general rule by iteratively generalizing each element of the raw rule. For example, “497” will be generalized to NUMBER, “Home” will be generalized to ALPHABETIC, “-” to PUNCTUATION, and “ ”to WHITESPACE. At step 562, generator program 106 merges general rules that have identical elements and length. The general rule for “497-5318” will be identical to that for “555-1212,” namely
phone<-NUMBER_PUNCTUATION_NUMBER @@ (14)
Therefore the rules for the two samples are merged under this general rule. The general rule retains a list of all the raw rules that gave rise to it. At step 564, generator program 106 traverses the general rules to build the split rules. The split rules require that all raw rules have consistent labeling. So a split rule may appear:
_phone<-_NUMBER [label=_prefix]_PUNCTUATION_NUMBER [label=_suffix] @@ (15)
At step 566, generator program traverses the split rules to generate the constrained rules. Constrained rules are rules whose raw rules all have consistent features, such as length and capitalization.
A Constrained Rule May Appear:
The above rule constrains the first number to have three digits and the second number to have four digits. At step 566, generator program 106 creates a literal rule for every raw rule. The literal rule is constructed by looking “inside” each element of the phrase as deeply as can be seen in the parse tree. For example, if a raw rule appears
_phone<-_LIST (NUMBER 497)\-_LIST (NUMBER 5318) @@ (17)
the literal rule produced is
_phone<-497\-5318 @@ (18)
At step 570, generator program 106 creates optional rules by comparing the composition of general rules that differ by one element. If that element is not a labeled element, then the two general rules can be merged, with the difference element marked as optional.
By embellishing a sample hierarchy with particular attributes, the manner in which rules are generated is controlled. The need to collect large sample sets in order to calculate statistically plausible generalizations is eliminated. Attributes may be specified to indicate what is to be generalized, what is to be collected as a list, and what is to be retained literally. For example, one attribute may instruct the generator program 106 to always generalize whitespace to a rule element that allows an arbitrary number of space characters. Another attribute may designate a label concept as “closed,” meaning any samples within it are to be collected into a list of only those samples, with no generalization. Other flags control the rule sets to be retained for the pass being generated. If the “constrain” flag is set to “true,” then the constrained list of rules is retained by the generator program 106. Retaining a rule set involves placing it into the final list of rules for the pass under construction. An enhancement to the sample hierarchy is to enable the described attributes to control the way rules are generated for an entire subtree. If some concept within that subtree changes an attribute's value, then that new value controls its subtree, and so on recursively.
A nonexhaustive set of attributes may be utilized to allow a user to control the rule sets to be retained in each pass of the analyzer, as below:
The above attributes cause the generator program 106 to retain or discard the corresponding rule sets. For example, if a concept in the sample hierarchy has the constrained attribute set to true, then all the constrained rules generated in that subhierarchy will be retained as part of the final analyzer. An attribute called closed, also with true/false values, controls the way parts of samples are collected into rules. For example, given the samples
497-5318
555-1212 (20)
if the closed attribute is set to true, then the corresponding constrained phone rule appears
phone<-_LIST (497 555)\-_LIST (5318 1212) @@ (21)
That is, each element of the pattern is a “closed set,” which collects any values found in the set of samples. If the CLOSED attribute is set to false, the constrained rule is
Because white space and punctuation are often secondary in importance, a WHITE attribute with values true/false can specify that whitespace in samples generalizes to the rule element
_WHITE [min=0 max-infinity] (23)
that is, any number of white spaces, regardless of the particular type and number of whitespace characters in the set of samples.
Other attributes can control the actions that get built for the generated rules and their components. For example, a QUICKSEM attribute with values true/false generates actions for semantic information to be copied automatically when a rule matches text. In the phone number example, the QUICKSEM attribute would cause the automatic creation of a data item called “prefix” with value “497” and a second data item called “suffix” with value “5318” in the _phone node, given that the _phone rule matched a text string such as “497-5318.” The LABEL (or LAYER) attribute takes a name as its value and leads to the generation of a label action in the associated rules that get generated.
User Interface
- “concept” “gram” “LiteralPhrase” “HeaderPhrase”
- “concept” “gram” “LiteralPhrase” “HeaderPhrase” “ContactHeaderPhrase”
- “concept” “gram” “LiteralPhrase” “HeaderPhrase” “ObjectiveHeaderPhrase”
- “concept” “gram” “LiteralPhrase” “HeaderPhrase” “EducationHeaderPhrase”
- “concept” “gram” “LiteralPhrase” “HeaderPhrase” “ExperienceHeaderPhrase”
- “concept” “gram” “LiteralPhrase” “HeaderPhrase” “SkillsHeaderPhrase”
- “concept” “gram” “LiteralPhrase” “HeaderPhrase” “PresentationsHeaderPhrase”
- “concept” “gram” “LiteralPhrase” “HeaderPhrase” “PublicationsHeaderPhrase”
- “concept” “gram” “LiteralPhrase” “HeaderPhrase” “ReferencesHeaderPhrase”
- “concept” “gram” “LiteralPhrase” “HeaderPhrase” “OtherHeaderPhrase”
- “concept” “gram” “LiteralPhrase” “Others”
- “concept” “gram” “LiteralPhrase” “Others” “degreeInMajor”
- “concept” “gram” “LiteralPhrase” “Others” “WebLinks”
- “concept” “gram” “LiteralPhrase” “Others” “emailHeader”
- “concept” “gram” “LiteralPhrase” “Others” “minorKey”
- “concept” “gram” “LiteralPhrase” “Caps”
- “concept” “gram” “LiteralPhrase” “Caps” “cityPhrase”
- “concept” “gram” “LiteralPhrase” “Caps” “statephrase”
- “concept” “gram” “LiteralPhrase” “Caps” “companyPhrase”
- “concept” “gram” “LiteralPhrase” “Caps” “degreePhrase”
- “concept” “gram” “LiteralPhrase” “Caps” “countryPhrase”
- “concept” “gram” “LiteralPhrase” “Caps” “skillsPhrase”
- “concept” “gram” “LiteralPhrase” “Caps” “naturalLanguages”
- “concept” “gram” “LiteralPhrase” “Caps” “software”
- “concept” “gram” “LiteralPhrase” “Caps” “hardware”
- “concept” “gram” “LiteralPhrase” “Caps” “certifications”
- “concept” “gram” “LiteralPhrase” “Caps” “field”
- “concept” “gram” “LiteralPhrase” “Caps” “Thesis”
- “concept” “gram” “LiteralPhrase” “Caps” “jobTitle”
- “concept” “gram” “LiteralPhrase” “Caps” “jobphrase”
- “concept” “gram” “Word”
- “concept” “gram” “Word” “Syntax”
- “concept” “gram” “Word” “Syntax” “posPREP”
- “concept” “gram” “Word” “Syntax” “posDET”
- “concept” “gram” “Word” “Syntax” “posPRO”
- “concept” “gram” “Word” “Syntax” “posCONJ”
- “concept” “gram” “Word” “HeaderWord”
- “concept” “gram” “Word” “HeaderWord” “ContactHeaderWord”
- “concept” “gram” “Word” “HeaderWord” “ObjectiveHeaderWord”
- “concept” “gram” “Word” “HeaderWord” “EducationHeaderWord”
- “concept” “gram” “Word” “HeaderWord” “ExperienceHeaderWord”
- “concept” “gram” “Word” “HeaderWord” “SkillsHeaderWord”
- “concept” “gram” “Word” “HeaderWord” “PresentationsHeaderWord”
- “concept” “gram” “Word” “HeaderWord” “PublicationsHeaderWord”
- “concept” “gram” “Word” “HeaderWord” “ReferencesHeaderWord”
- “concept” “gram” “Word” “HeaderWord” “OtherHeaderWord”
- “concept” “gram” “Word” “headerMod”
- “concept” “gram” “Word” “openPunct”
- “concept” “gram” “Word” “closePunct”
- “concept” “gram” “Word” “resumeWord”
- “concept” “gram” “Word” “Present”
- “concept” “gram” “Word” “Direction”
- “concept” “gram” “Word” “adjDirection”
- “concept” “gram” “Word” “PostalUnit”
- “concept” “gram” “Word” “PostalRoad”
- “concept” “gram” “Word” “monthWord”,
- “concept” “gram” “Word” “monthNum”
- “concept” “gram” “Word” “Season”
- “concept” “gram” “Word” “PostalState”
- “concept” “gram” “Word” “jobTitleRoot”
- “concept” “gram” “Word” “jobMod”
- “concept” “gram” “Word” “companyRoot”
- “concept” “gram” “Word” “companyModroot”
- “concept” “gram” “Word” “companyMod”
- “concept” “gram” “Word” “ProgrammingLanguage”
- “concept” “gram” “Word” “cityMod”
- “concept” “gram” “Word” “cityWord”
- “concept” “gram” “Word” “Names”
- “concept” “gram” “Word” “Names” “femaleName”
- “concept” “gram” “Word” “Names” “maleName”
- “concept” “gram” “Word” “Names” “surName”
- “concept” “gram” “Word” “fieldName”
- “concept” “gram” “Word” “subOrg”
- “concept” “gram” “Word” “softwareWord”
- “concept” “gram” “Phrase”
- “concept” “gram” “Phrase” “Contact”
- “concept” “gram” “Phrase” “Contact” “humanName”
- “concept” “gram” “Phrase” “Contact” “humanName” “prefixName”
- “concept” “gram” “Phrase” “Contact” “humanName” “firstName”
- “concept” “gram” “Phrase” “Contact” “humanName” “middleName”
- “concept” “gram” “Phrase” “Contact” “humanName” “lastName”
- “concept” “gram” “Phrase” “Contact” “humanName” “suffixName”
- “concept” “gram” “Phrase” “Contact” “cityStateZip”
- “concept” “gram” “Phrase” “Contact” “cityStateZip” “cityName”
- “concept” “gram” “Phrase” “Contact” “cityStateZip” “stateName”
- “concept” “gram” “Phrase” “Contact” “cityStateZip” “zipCode”
- “concept” “gram” “Phrase” “Contact” “cityStateZip” “zipSuffix”
- “concept” “gram” “Phrase” “Contact” “cityStateZip” “country”
- “concept” “gram” “Phrase” “Contact” “cityState”
- “concept” “gram” “Phrase” “Contact” “citystate” “cityName”
- “concept” “gram” “Phrase” “Contact” “cityState” “stateName”
- “concept” “gram” “Phrase” “Contact” “phoneExtension”
- “concept” “gram” “Phrase” “Contact” “phoneExtension” “extendWord”
- “concept” “graam” “Phrase” “Contact” “phoneExtension” “extension”
- “concept” “gram” “Phrase” “Contact” “phoneNumber”
- “concept” “gram” “Phrase” “Contact” “phonenumber” “countryCode”
- “concept” “gram” “Phrase” “Contact” “phoneNumber” “areaCode”
- “concept” “gram” “Phrase” “Contact” “phoneNumber” “prefix”
- “concept” “gram” “Phrase” “Contact” “phoneNumber” “suffix”
- “concept” “gram” “Phrase” “Contact” “phonePhrases”
- “concept” “gram” “Phrase” “Contact” “phonePhrases” “phoneHomeFaxPhrase”
- “concept” “gram” “Phrase” “Contact” “phonePhrases” “phoneHomeFaxPhrase” “HomeFax”
- “concept” “gram” “Phrase” “Contact” “phonePhrases” “phoneWorkPhrase”
- “concept” “gram” “Phrase” “Contact” “phonePhrases” “phoneWorkPhrase” “Work”
- “concept” “gram” “Phrase” “Contact” “phonePhrases” “phoneForPhrase”
- “concept” “gram” “Phrase” “Contact” “phonePhrases” “phoneFaxPhrase” “Fax”
- “concept” “gram” “Phrase” “Contact” “phonePhrases” “phonePagerPhrase”
- “concept” “gram” “Phrase” “Contact” “phonePhrases” “phonePagerPhrase” “Pager”
- “concept” “gram” “Phrase” “Contact” “phonePhrases” “phoneCellPhrase”
- “concept” “gram” “Phrase” “Contact” “phonePhrases” “phoneCellPhrase”
- “concept” “gram” “Phrase” “Contact” “phonePhrases” “phoneHomePhrase” “Cell”
- “concept” “gram” “Phrase” “Contact” “phonephrases” “phoneHomePhrase”
- “concept” “gram” “Phrase” “Contact” “phoneHomePhrase” “Home”
- “concept” “gram” “Phrase” “Contact” “unitroom” “unitRoom”
- “concept” “gram” “Phrase” “Contact” “unitRoom” “unit”
- “concept” “gram” “Phrase” “Contact” “unitRoom” “room”
- “concept” “gram” “Phrase” “Contact” “addressLine”
- “concept” “gram” “Phrase” “Contact” “addressLine” “streetNumber”
- “concept” “gram” “Phrase” “Contact” “addressLine” “streetName”
- “concept” “gram” “Phrase” “Contact” “addressLine” “road”
- “concept” “gram” “Phrase” “Contact” “addressLine” “direction”
- “concept” “gram” “Phrase” “Contact” “addressLine” “postdirection”
- “concept” “gram” “Phrase” “Contact” “iaddressLine” “POBox”
- “concept” “gram” “Phrase” “Contact” “email”
- “concept” “gram” “Phrase” “Contact” “email” “accountName”
- “concept” “gram” “Phrase” “Contact” “email” “machineName”
- “concept” “gram” “Phrase” “Contact” “email” “companyName”
- “concept” “gram” “Phrase” “Contact” “email” “domainName”
- “concept” “gram” “Phrase” “Contact” “url”
- “concept” “gram” “Phrase” “Contact” “url” “urlheader”
- “concept” “gram” “Phrase” “Contact” “url” “protocol”
- “concept” “gram” “Phrase” “Contact” “url” “machineName”
- “concept” “gram” “Phrase” “Contact” “url” “companyName”
- “concept” “gram” “Phrase” “Contact” “url” “domainName”
- “concept” “gram” “Phrase” “Contact” “url” “directory”
- “concept” “gram” “Phrase” “Contact” “Height”
- “concept” “gram” “Phrase” “Contact” “Height” “feet”
- “concept” “gram” “Phrase” “Contact” “Height” “inches”
- “concept” “gram” “Phrase” “Education”
- “concept” “gram” “Phrase” “Education” “degree”
- “concept” “gram” “Phrase” “Education” “major”
- “concept” “gram” “Phrase” “Education” “minor”
- “concept” “gram” “Phrase” “Education” “university”
- “concept” “gram” “Phrase” “Experience”
- “concept” “gram” “Phrase” “Experience” “company”
- “concept” “gram” “Phrase” “SingleDate”
- “concept” “gram” “Phrase” “SingleDate” “numA”
- “concept” “gram” “Phrase” “SingleDate” “numB”
- “concept” “gram” “Phrase” “SingleDate” “monthSD”
- “concept” “gram” “Phrase” “SingleDate” “daySD”
- “concept” “gram” “Phrase” “SingleDate” “yearSD”
- “concept” “gram” “Phrase” “SingleDate” “seasonSD”
- “concept” “gram” “Phrase” “DateRange”
- “concept” “gram” “Phrase” “DateRange” “fromDate”
- “concept” “gram” “Phrase” “DateRange” “dateSep”
- “concept” “gram” “Phrase” “DateRange” “toDate”
- “concept” “gram” “Part”
- “concept” “gram” “Part” “addressPart”
- “concept” “gram” “Part” “educationPart”
- “concept” “gram” “Part” “experiencePart”.
Machinery for Adding and Managing Samples
While a command line interface may be utilized by an embodiment of the present invention, the preferred embodiment utilizes a graphical user interface (GUI) to manage the sample hierarchy. A specialized pull-down menu enables rapid highlighting and labeling of samples and their components within a text. By selecting a concept in the sample hierarchy and then highlighting a text, the highlighted text sample is placed under the sample hierarchy concept, as in
In another aspect of the user interface of the present invention, a form tool 580 (
Additional tools associated with the sample hierarchy are the Attribute Window and the Properties Window. The Properties Window 582 (
Sample manager 586 is responsible for bookkeeping to track the file that each sample originated from and the offsets of the sample and its labels within that file. The user may further associate a sample file with any concept in the sample hierarchy. If the user creates such an association, then the system creates copies of samples, their labels, and their offsets in the sample file. Sample files enable faster and more efficient generation of the text analyzer by minimizing the volume of text that must be analyzed to generate the rules for the analyzer. The sample manager 586 enables the user to perform functions such as associating a sample file, dissociating a sample file, opening the associated sample file, deleting the samples under a concept, and similar manipulations.
The left panel 590 in
We have described a system, method, and computer readable medium for generating text analyzers from samples. The users of a text analyzer need not understand how rules are generated in order to maintain and enhance the capabilities of the text analyzer. Nonprogrammer and nonlinguist users can add samples that the text analyzer does not identify, in order to expand the processing power of the text analyzer. While the present invention has been illustrated and described in detail, it is to be understood that numerous modifications may be made to the preferred embodiment without departing from the spirit of the invention.
Claims
1. A method for generating a text analysis program for recognizing patterns appearing in text and extracting information from said patterns, the method comprising the steps of:
- (a) providing a sample hierarchy, said sample hierarchy comprising samples of text;
- (b) extracting at least one rule from said sample hierarchy, said rule describing how to process a portion of text;
- (c) generating a pass from said rule, said pass containing instructions to operate a text analyzer; and
- (d) constructing a text analyzer containing said pass.
2. The method of claim 1, wherein said rule is generalized into multiple rules and multiple passes.
3. The method of claim 1, wherein multiple passes are added to said text analyzer.
4. The method of claim 3, wherein said multiple passes are arranged in a cascading manner having a sequence of passes such that rules associated with a pass are applied to subsequent passes.
5. The method of claim 1, wherein the samples are associated with offset values, said offset values identifying locations in a parse tree data structure, said parse tree containing concepts stored at locations identified by said offsets.
6. The method of claim 4, further comprising the step of allowing a user to control the extraction of rules from the sample hierarchy
7. The method of claim 5, further comprising the step of allowing a user to designate properties associated with said rules, said properties controlling rule generation for a portion of the sample hierarchy.
8. The method of claim 5, wherein said concepts are retrieved from said parse tree and processed to form said rule.
9. The method of claim 6, further comprising the step of allowing a user to designate attributes associated with said rules, said attributes guiding the application of said rules.
10. The method of claim 1, wherein multiple rules are generalized and merged into a single rule if there is a difference between the multiple rules.
11. The method of claim 10, wherein said samples may be contained in a sample file.
12. A sample hierarchy data structure for use in a text analyzer system, said sample hierarchy comprising an index for storing samples, said samples comprising portions of text, said samples used to generate rules for identifying patterns appearing in text, said samples used to derive information from said identified patterns, said rules generated by parsing said text samples, said index organized such that passes comprising operational steps and rules are generated in an order wherein simple patterns are recognized by said text analyzer, and said recognized simple patterns are used by said text analyzer system and used to iteratively recognize more complex patterns.
13. A computer readable medium containing instructions which, when executed by a computer, generate a text analysis program for recognizing patterns appearing in text and extracting information from said patterns, by:
- (a) providing a sample hierarchy, said sample hierarchy comprising samples of text;
- (b) extracting at least one rule from said sample hierarchy, said rule describing how to process a portion of text;
- (c) generating a pass from said rule, said pass containing instructions to operate a text analyzer; and
- (d) constructing a text analyzer containing said pass.
Type: Application
Filed: Dec 2, 2004
Publication Date: Apr 21, 2005
Inventors: Amnon Meyers (Laguna Beach, CA), David de Hilster (Long Beach, CA)
Application Number: 11/003,061