Systems And Methods For Defining And Processing Text Segmentation Rules
Computer-implemented methods and systems are provided for text segmentation of textual data. Rules are accessed that define how the input stream is to be segmented into textual data elements through pattern matching. The one or more rules are applied to the input stream to determine the textual data elements in the input stream which are then provided as output.
The technology described herein relates generally to systems and methods for processing textual data. More specifically, the technology described herein relates to performing text segmentation.
BACKGROUNDFor written natural languages, it can be difficult to programmatically break phrases into meaningful elements, a process known as text segmentation. This is evident in any language and is particularly evident when trying to parse such languages as Korean, Japanese, or other Asian languages where fixed word delimiters (e.g., “white-space”) are typically not used. The written symbols of such languages represent spoken syllables, and a reader is required to understand the meaning and context of the surrounding symbols in order to derive the meaning of a given phrase. Additionally, text segmentation can pose a unique and difficult problem for natural language processing systems, because comprehending languages typically requires an extensive corpus of knowledge specific to the language being processed. This lexicon can be challenging and expensive to obtain, and it is usually massive in size.
SUMMARYIn accordance with the teachings herein, computer-implemented systems and methods are provided to process input textual data and segment such data. As an illustration, a computer-implemented method and system are provided for context-sensitive text segmentation of textual data. Rules are accessed that define how the input stream is to be segmented into textual data elements through pattern matching. The one or more rules are applied to the input stream to determine the textual data elements in the input stream which are then provided as output.
As another example, a computer-implemented method and system are provided for integrating textual data from disparate data sources in order to have data standardization with respect to the textual data. An input stream of textual data is received from one or more of the disparate data sources. The input stream of textual data is related to a predetermined category. One or more character-level rules are accessed that are related to the predetermined category and that define how the input stream is to be segmented into textual data elements through pattern matching. The one or more rules are applied to the input stream to determine the textual data elements in the input stream. The textual data elements are provided to a morphological parser. The morphological parser provides semantic analysis of the textual data elements for use in integrating the textual data elements in order to have data standardization with respect to the textual data.
As yet another example, a computer-implemented system and method are provided to process input textual data and segment such data in a context-sensitive manner, without the need to have delimiter characters present in the textual data. If a user wished to process a large amount of textual data consisting of Korean characters, which text does not include delimiter characters, a text segmentation system allows the user to nonetheless segment the text on the basis of rules the user defines. Once the input textual data is segmented, the output textual data elements may then be further analyzed by known methods.
A text segmentation system 110 may be executed on one or more servers 120. The one or more servers 120, in turn, may be connected to one or more data stores 130, which may store the input, output, or both of the text segmentation system 110. Users 140 may access the text segmentation system 110 over one or more networks 150 that are linked to the one or more servers 120 on which the text segmentation system 110 executes.
As depicted in
The output produced by the example text segmentation system 110 is one or more textual data elements 220. These data elements 220 represent the “segments” produced by the text segmentation system's application of the user-defined rules 210 to the input stream 200. The textual data elements 220 may form the output from the text segmentation system 110 and be passed as input to a morphological parsing system 230, which may further process the data elements 220.
In order to process an input character string and produce output tokens, a text segmentation system 110 is configured with several parameters. First, the text segmentation system 110 defines a set of initial flags. The initial flags are variable names, and they serve to initialize the system with a pre-determined state. Second, a user of the text segmentation system 110 provides an ordered set of segmentation rules. The example system may use any number of rules necessary to fully segment the input textual data, and each rule may consist of several different fields as discussed below.
As depicted in
The dictionary 330 and regular expression 310 fields are different ways of identifying matches within an input character string. A dictionary 330 could contain literal strings that the text segmentation system attempts to identify within the input character string. A regular expression 310 defines symbolically an acceptable set of strings for which a text segmentation system would search within the input character string. These regular expressions 310 could, for instance, take the form of known Perl-style regular expressions. Regardless of the approach used, though, the system may find that more than one rule in the list 300 applies at any given time. To disambiguate these situations, the system selects the longest matching substring within the character input string and, if more than one substring of the same length was matched, the system proceeds to select the top-most rule in the list 300 that produced one of the longest-matching substrings. The system maintains a set of context state variables 320, also called the flag state. These flags are Boolean variables, analogous to switches, and may either exist in a given state, or not. If used, these flags operate to determine the proper ordering and application of the user-defined text segmentation rules. For maximum flexibility in the segmentation of input textual data, rules are analyzed at the character level. This allows fine-grained control over the text segmentation process and also permits the application of the systems and methods described herein across a broad set of languages.
Once a rule has successfully performed a textual match 425 (and its matching substring is longer than any previous match), the system checks at 430 the current flag state and evaluates it against the matching rule's prerequisite expression, using Boolean logic operators AND, OR, and NOT to test for a flag's existence. A true result means that the rule's input criteria have been satisfied and the rule becomes the “satisfier” for this input position, as shown at 435, until, possibly, a better satisfier is found further down in the rule list. At 440, if there are rules remaining in the list that have not yet been scanned, then the system returns to 420 and resumes scanning the rule list.
If no rules remain, the system determines at 445 whether a rule satisfier was identified by the system. In the event that no suitable satisfier was located for the current input position, the system returns to 410 to determine if there is additional text to be segmented, and if so, the system advances the pointer position on the input stream exactly one character toward the end of the stream and returns to 420 to scan the rule list at the new pointer position. If a satisfier was found, on the other hand, as shown at 450, the system positions the input pointer to the string position immediately following the last character of the matched substring and segments the input stream as discussed below. At 455, the system optionally sets its flag state to the configuration specified by the satisfier's output flags field. This may be implemented as an overwrite operation, so that if any input flags are to be preserved, they are reassigned using the satisfier's output flags. The system returns to 410 to determine whether the input stream contains additional text to be segmented and the process continues as described.
As part of step 450, the system determines from the rule that produced the satisfier how the system should segment the input string, given what it has learned from the matching process. To do this, the system checks the segment before and segment after options, which are optional. If neither is specified, no segmentation is performed for the current match. The actual segmentation process sets markers at specific character positions in the input character string. The segment before option instructs the system to place a marker before the first character of the substring matched by the current satisfier rule. Similarly, the segment after setting instructs the system to place a marker after the end of the last character of the matched substring. The settings for the segmentation flags may vary depending on the needs of the situation at hand and the structure of the match rules that are defined for a particular input character string. When the end of the input character string has been reached, the system then breaks the string into tokens using the segmentation markers that were created along the way.
The output produced by the text segmentation system 110 is one or more textual data elements 520. These data elements 520 represent the “segments” produced by the text segmentation system's application of the user-defined rules 510 to the input textual data. The textual data elements 520 may form the output from the text segmentation system 110 and be passed as input to a morphological parsing system 530, which may further process the data elements 520 from a semantic perspective. Further, the textual data elements 520 may be incorporated into a common database 540.
Thus, a text segmentation system may be used as part of a system designed to standardize textual data from disparate input sources and load the standardized data into a common database that then may be further utilized by users or other applications. The textual data elements 520 produced by the example text segmentation system also may be subjected to further analytical techniques. For example, a clustering algorithm can be used to analyze and categorize the textual data elements 520. Alternatively, or in conjunction with the above-described data analysis techniques, data identification techniques may be used to determine one or more data types represented within the textual data elements 520.
In the example, the category of textual data to be segmented is address location type data. The input textual data in the example contains Japanese characters. As mentioned previously, there are no language restrictions on the textual input, and the same is true with regard to predetermined categories. Any category of textual data may be segmented as described herein. In the example, the category is address data, but other types of personal data, such as names, telephone numbers, or government identification numbers could be segmented, as could categories such as financial or accounting data, positional coordinate data, or any other type of information that may be represented textually. In this way, the text segmentation system concerns itself with only a small subset of the entire language structure by focusing on a particular pre-determined category of phrases, such as a collection of names or addresses. This obviates the need to accumulate or purchase a large lexicon of knowledge for these scenarios since such databases typically require large amounts of memory and disk space. Therefore in this operational scenario, the text segmentation system allows a user to define, for a specific category of phrases, segmentation and word categorization heuristics that can be passed into a natural language processing system.
As another example of a specific category of phrases, row 610 in the example user interface 600 shows a rule for segmenting textual data in Japanese characters that describes addresses in the Hokkaido prefecture. The rule causes the text segmentation system to search a vocabulary to attempt to find textual matches, and the prerequisite flag SEARCH_PREF indicates that this example includes a necessary precondition before a textual match may constitute a “satisfier.” If the match does constitute a satisfier, then in this example, the output flags are used by the system to set the flag state. Also, the system is instructed to chop the input textual data after the HOKKAIDO match. Further, the example user interface 600 provides a user with the ability to add notes to each rule, so that, for example, a future user would be able to better understand the structure of the rules and their function and precedence.
While examples have been used to disclose the invention, including the best mode, and also to enable any person skilled in the art to make and use the invention, the patentable scope of the invention is defined by claims, and may include other examples that occur to those skilled in the art. Accordingly the examples disclosed herein are to be considered non-limiting.
It is further noted that the systems and methods may be implemented on various types of computer architectures, such as for example on a single general purpose computer (as shown at 1010 on
Further, the systems and methods may include data signals conveyed via networks (e.g., local area network, wide area network, internet, combinations thereof, etc.), fiber optic medium, carrier waves, wireless networks, etc. for communication with one or more data processing devices. The data signals can carry any or all of the data disclosed herein that is provided to or from a device.
In addition, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
The systems' and methods' data (e.g., associations, mappings, data input, data output, intermediate data results, final data results, etc.) may be stored and implemented in one or more different types of computer-implemented data stores, such as different types of storage devices and programming constructs (e.g., RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
The systems and methods may be provided on many different types of computer-readable media including computer storage mechanisms (e.g., CD-ROM, diskette, RAM, flash memory, computer's hard drive, etc.) that contain instructions (e.g., software) for use in execution by a processor to perform the methods' operations and implement the systems described herein.
The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes but is not limited to a unit of code that performs a software operation, and can be implemented for example as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
It should be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise; the phrase “exclusive or” may be used to indicate situation where only the disjunctive meaning may apply.
Claims
1. A computer-implemented method for integrating textual data from disparate data sources in order to have data standardization with respect to the textual data, comprising:
- receiving an input stream of textual data from one or more of the disparate data sources;
- wherein the input stream of textual data is related to a predetermined category;
- accessing character-level rules that are related to the predetermined category and that define how the input stream is to be segmented into textual data elements through pattern matching;
- applying the rules to the input stream to determine the textual data elements in the input stream; and
- outputting the textual data elements to a morphological parser;
- wherein the morphological parser provides semantic analysis of the textual data elements for integrating the textual data elements in order to have data standardization with respect to the textual data.
2. The method of claim 1, wherein the rules are based on regular expressions, dictionary lists, and combinations thereof.
3. The method of claim 2, wherein the regular expressions define symbolically an acceptable set of strings for searching within the input stream of textual data.
4. The method of claim 2, wherein the dictionary lists contain literal strings to identify literal strings within the input stream of textual data.
5. The method of claim 2, wherein the operation of the rules is altered based upon a value stored in a context state variable.
6. The method of claim 1, wherein the disparate data sources include text files, relational databases, data analysis applications, network-based applications, manually-input data, and combinations thereof.
7. The method of claim 6, wherein the input stream from the disparate data sources does not contain delimiters.
8. The method of claim 7, wherein the input stream comprises words from an Asian language.
9. The method of claim 8, wherein because the data sources are disparate, the outputted data elements vary with respect to a formatting standard.
10. The method of claim 1, wherein the rules are defined by a user through a graphical user interface.
11. The method of claim 1, wherein the textual data elements segmented from the input stream are integrated into a common database.
12. The method of claim 1, wherein a clustering algorithm is used to analyze and categorize the textual data elements.
13. The method of claim 1, wherein data identification techniques are used to determine one or more data types of the textual data elements.
14. The method of claim 1, wherein because the input stream of textual data is related to the pre-determined category, the user-defined rules only are required to be related to the pre-determined category, thereby reducing storage size requirements with respect to the user-defined rules.
15. The method of claim 1, wherein the pre-determined category is an address location category, or a name category, or a phone number category, or an occupation category.
16. The method of claim 1, wherein the user-defined rules and the input stream of textual data are stored in one or more computer-readable data stores.
17. The method of claim 1, wherein if more than one of the rules applies to a plurality of substrings of characters in the input stream, then selecting longest-matching substring within the substrings to determine a textual data element in the input stream.
18. The method of claim 17, wherein the rules are arranged in an ordered list, wherein if more than one substring of the same length was selected, then selecting top-most of the rules that produced one of the longest-matching substrings.
19. A computer-implemented system for integrating textual data from disparate data sources in order to have data standardization with respect to the textual data, comprising:
- one or more computer-readable data stores for storing an input stream of textual data from a data source;
- wherein the input stream of textual data is related to a predetermined category;
- one or more computer-readable data stores for storing character-level rules that are related to the predetermined category and that define how the input stream is to be segmented into textual data elements through pattern matching;
- instructions configured to operate on a data processor and to apply the rules to the input stream to determine the textual data elements in the input stream; and
- instructions configured to operate on the data processor for outputting the textual data elements to a morphological parser;
- wherein the morphological parser provides semantic analysis of the textual data elements for use in integrating the textual data elements in order to have data standardization with respect to the textual data.
20. A computer-readable medium including processor-executable instructions for a system for integrating textual data from disparate data sources in order to have data standardization with respect to the textual data, comprising:
- instructions for receiving an input stream of textual data from a data source;
- wherein the input stream of textual data is related to a predetermined category;
- instructions for accessing character-level rules that are related to the predetermined category and that define how the input stream is to be segmented into textual data elements through pattern matching;
- instructions for applying the rules to the input stream to determine the textual data elements in the input stream; and
- instructions for outputting the textual data elements to a morphological parser;
- wherein the morphological parser provides semantic analysis of the textual data elements for use in integrating the textual data elements in order to have data standardization with respect to the textual data.
Type: Application
Filed: Oct 27, 2008
Publication Date: Apr 29, 2010
Patent Grant number: 8326809
Inventor: Peter Anthony Vetere (Raleigh, NC)
Application Number: 12/258,887
International Classification: G06K 9/34 (20060101);