NATURAL LANGUAGE PROCESSING SYSTEM AND METHOD
A natural language processing system is disclosed herein. Embodiments of the NLP system perform hand-written rule-based operations that do not rely on a trained corpus. Rules can be added or modified at any time to improve accuracy of the system, and to allow the same system to operate on unstructured plain text from many disparate contexts (e.g. articles as well as twitter contexts as well as medical articles) without harming accuracy for any one context. Embodiments also include a language decoder (LD) that generates information which is stored in a three-level framework (word, clause, phrase). The LD output is easily leveraged b various software applications to analyze large quantities of text from any source in a more sophisticated and flexible manner than previously possible. A query language (LDQL) for information extraction from NLP parsers' output is disclosed, with emphasis on on its embodiment implemented for LD. It is also presented, how to use LDQL for knowledge extraction on the example of application named Knowledge Browser.
Latest Fido Labs Inc. Patents:
This application is a continuation of U.S. patent application Ser. No. 14/071,631, filed Nov. 4, 2013, which claims the benefit of U.S. Provisional Patent Application No. 61/721,792, filed Nov. 2, 2012. Both of these application are incorporated by reference herein in their entirety.
FIELD OF THE INVENTIONInventions disclosed and claimed herein are in the field of natural language processing (NLP).
BACKGROUNDNatural language processing (NLP) systems are computer-implemented methods for taking natural language input (for example, computer-readable text), and operating on the input so as to generate output that is useful for computers to derive meaning. Examples of NLP systems applications include spell checkers/grammar checkers, machine translation systems, and speech-to-text systems. Increasingly, there is interest in developing methods for machines to more intelligently interpret human language input data (such as text) for the purpose of directing, the computer as if it were another person who could understand speech. One application for such methods is search engines that receive a typed query from a person and perform web searches to attempt to generate a set of meaningful answers to the query. An important subclass of NLP systems is NLP parsers, especially grammatical parsers such as Part-Of-Speech tagger, constituency parsers, dependency parsers, and shallow semantic parsers such as SRL (Semantic Role Labeling). Their role is to preprocess text and add additional information to words to prepare it for further usage. Current NLP systems are mostly built on top of NLP parsers and they features and accuracy strongly relies on the information produced by these parsers. Quality of the information delivered by these parsers is strongly correlated with the efficiency of NLP systems.
All current parsers are dependent on corpora and therefore on the context in which they were written. Typically corpora are in a context of correctly written, grammatically correct sentences and common syntactic structures which are manually annotated by humans. The system is then trained using this corpus.
This is one reason that traditional NLP parsers are most accurate on the same type of content they were trained on (the same corpus). That is why always-changing language, such as user generated content (e.g. reviews, tips, comments, tweets, social media content) presents a challenge for NLP parsers built with machine learning techniques. Such content often includes grammatically incorrect sentences and non-standard usage of language, as well as emoticons, acronyms, strings of non-letter characters and so on. This content is constantly changing and expanding with different words or syntactic structures. All of this content has meaningful information and it is easy to understand by humans but it is still difficult for NLP applications to extract meaning out of it.
One way in which current NLP parsers can be updated or improved (for better accuracy or extracting additional information) is to modify the existing corpus, or create a new corpus or re-annotate existing one and retrain the system with it to understand new content. However, this is a tedious, costly and time-consuming process. For example all current NLP parsers are using corpus as a training data annotated by linguists with predefined tags (e.g. Penn Treebank)—especially use machine-learning algorithms.
If there was a need to distinct pronominal or adjectival aspect of “that” (giving them different POS tags in different context), one would need to manually re-annotate by linguists all the sentences in the whole corpus that contain word “that” regarding the context of each usage of “that” and retrain the parser.
Building a particular application on top of and NLP parser requires building a module to transform the NLP parser output into usable data. The application using the parser's output could be coded in a programming language, use a rule based systems or be trained with machine learning techniques—or created with combination of any of the above methods.
Using NLP parsers can be challenging due to the need to understand the structure of the output and parameters (requires expert knowledge). One of the challenges in NLP parsers is to provide constant consistent structure of information. Also, the output of the NLP parsers rely on the quality of the input text data.
For example,
Let's consider these sentences
-
- 1. John likes math.
- 2. John likes to learn.
- 3. John likes learning math in the evening.
By using grammar parsers in each case you will get different notations for the object what John likes.
In constituency parsers the number of levels (depth) in parse tree depends of the length and the grammatical structure of the processed sentences. In a given example the first sentence has 3 levels, the second sentence has 5 levels and the third example has 6 levels in a tree representation.
In state-of-the-art dependency parsers the structure of the output and number of levels in the dependency tree representation also vary. Adding even one word in the sentence can alter the grammatical information of all the other words.
The given example about John would produce different structure for each sentence. The first sentence require extracting dependents of “dobj” relation connected to the word “likes”, in second all dependents of “xcomp” relation connected to the word “likes” and in third example there is a need of analyzing all governors connected to dependents of “xcomp” related to the word “likes”.
All of the above is the reason why it is difficult for people and especially non-linguists (developers, analysts) to use the parser output and write rules to adjust it to their current needs. E.g. to write information extraction engine to extract information about product features from reviews you could use constituency or dependency parser but you need to write complex algorithms to search through the parse tree. To move to another domain (e.g. extracting information from twitter) the algorithms must be redesigned, and part of the code rewritten.
To deal with these problems NLP systems use machine learning techniques. This approach has some limitations in terms of accuracy and amount of extracted information.
There are query languages to process structured data (e.g. SQL for relational databases, Cypher for graph databases, SPARQL-RDF tagged texts (resource description framework)) but there are no languages designed directly to query structure of the natural language (output of the NLP parser).
It would be desirable to have an efficient framework of storing information decoded from text. It should provide invariant and consistent way of storing information which would be insensitive to different types of input. Having such a framework, would be possible for non-experts to write efficient rules on top of NLP parser's output.
It would be desirable to have a parser for natural language processing that is built fully algorithmically so it allows for constantly improvement in accuracy, and the addition of new features, without building or re-annotating any corpus. It would desirable to have an NLP system that is more capable than current NLP parsers of dealing with non-typical grammatical input, deals well with constantly-changing language on the web, and produces accurate output which can be stored into an efficient framework of information.
It would also be desirable to have a query language that can be used on the logical layer across different input contexts allowing humans to write efficient rules for extracting information and is capable of effectively leveraging many NLP systems.
Embodiments of inventions disclosed herein include improvements on current NLP systems and methods that are especially relevant to processing input that consists of plain text from different types of context. As such, the embodiments disclosed herein provide a highly accessible platform for natural language processing. The output of the disclosed NLP system is easily leveraged by various software applications to analyze large quantities of text from any source in a more sophisticated and flexible manner than previously possible.
The Language Decoder is a novel fully algorithmic NLP parser that is decoding information out of text and store it into three-level framework which is capable of handling various type of texts from different domain like reviews, news, formal documents, tweets, comments etc.
The algorithmic nature of the system allows to achieve high accuracy on user generated content.
Embodiments of the NLP system can thus work properly on different kinds of domains at the same time.
This three-level hierarchical framework of processed text is leveraged by embodiments of a language decoder query language LDQL as further described herein. The LDQL is particularly easy to use for developers, without requiring specific linguistics training. However, other rule-based query languages could conceivably be developed for extraction (by query) of information from text processed by the LD.
Other systems and applications 106 are systems, including commercial systems and associated software applications that have the capability to access and use the output of the NLP system 202 through one or more application programming interface (APIs) as further described below. For example, other systems/applications 106 can include an online application offering its users a search engine for answering specific queries. End users 212 includes individuals who might use applications 106 through one or more of end user devices 121A. User devices 212A include without limitations personal computers, smart phones, tablet computers, and so on. In some embodiments, end users 121 access NLP system 202 directly through one or more APIs presented by NLP system 202.
The LD output 205 can also be operated on by embodiments of an LD query language (LDQL) 204. LDQL 204 is described according to various embodiments below as an improved query language designed to take advantage of LD output 205. However LDQL 204 can also operate on the output of any prior NLP system. Also provided in various embodiments is a higher level of APIs (as compared to LD output 205) for providing machine learning systems 104 and other systems/applications 106 more intuitive access to LD output 205. Other systems/applications 106 can include semantic databases, Freebase™, Wordnet™, etc.
In general, LD APIs 206 are relatively easy to use to for other systems seeking to manipulate LD output 205. However, machine learning systems 104 and other systems/applications 106 can also directly write queries using LDQL 204.
The word tagger 203A labels each word with a tag from the proprietary tagset previously described. The predicate finder 203B builds predicates from verb components. Then the clause divider 203C joins words into scraps, groups them into separate clauses, and determines relationships and types of those clauses. Next the accommodator 203D converts scraps within clauses into phrases and determines relationships and types of those phrases. At the end of the process, the accommodator determines types and relationships of each word within each phrase. The output 205 of the accommodator as shown is in a human readable form in contrast to tree-style output of various prior systems. The original sequence of the input words is preserved and the relationships, types, tags, etc. are simple to view.
Word Tagger Methodology
Contrary to all prior-art word taggers, present word tagger does not use any of machine learning techniques and contains only human-written disambiguation rules.
Word taggers rules are described by a language resembling well-known regular expressions, which is comprehensible and easy to change for human operators. Hence, the process of changes is relatively fast, especially in comparison to the prior-art machine learning based algorithms.
A set of rules is divided into two groups (referred to herein as “exact group” and “inexact group”) along with different usages within the tagging algorithm.
Contrary to machine learning approaches, present word tagger uses its own dictionary to recognize words and assign all possible tags to them. It is proven that word taggers using machine learning techniques achieves higher accuracy for tagging known tokens (words) than unknown. The term “known token” means that token appeared in training corpus at least once. Differential in accuracy for known and unknown tokens is about 7-8% for good parsers (see for example http://nlp.stanford.edu/pubs/CICLing2011-manning-tagging.pdf). Embodiments of the presented word tagger allow new words to be added directly to the referenced dictionary with all possible tags. For example, “google” can be added with tags related to noun as well as verb. The present word tagger contains rules responsible for resolving that kind of ambiguity, so it can automatically handle that problem.
Word Tagger Output
Present word tagger provides unique tagset.
Most important differences between present tagset and prior-art tagsets:
-
- The present tagset does not exclude particles as a separate tag. The definition of particle is very inconsistent. Therefore, a distinction for adverbs and prepositions is preferred.
- The present tagset provides distinction for prepositions and subordinating conjunctions.
- For determiners, the present tagset provides distinction for adjectival and pronominal function.
- For pronouns, the present tagset provides distinction for relative and interrogative function.
The process of the Word Tagger
The input to the word tagger 203 consists of plain text. The text can come from any source such as article, document, user generated content and any other source. Referring to
Exact rules 508 and inexact rules 510 are two groups of rides. Each rule examines the context of a word with an ambiguous tag (that is, more than one possible tag). Such context is not limited or restricted; it can address any element or the text, in both directions at the same time. This is a range aspect of the claimed invention describing the facility to access or examine any element of the text before or after an ambiguous element, without limit.
Another aspect of the word tagger 203A is the manner of expression employed. In order to examine the context of a word, rules permit the creation of patterns and sub-patterns which address words and/or their tags and which can be subject to following operations;
-
- conditionality (for single element or subpattern) (corresponding to “?” regular expression operator),
- alternatives (for single element or subpattern) (corresponding to “|” regular expression operator),
- repetition (for single element or subpattern) (corresponding to “+” and “*” regular expression operator),
- negation for single element or subpattern) (corresponding to “!” regular expression operator).
This form of rules expands their expressiveness in comparison to prior static look-ups and ranges (“next word is . . . ”, there is a . . . word in range of 3 words before” etc.) This in effect allows better description of situations in which a given rule should apply.
In an embodiment, the rules are applied separately for each sentence according to following algorithm:
1. Stop algorithm at any point when there is no ambiguity left in current sentence.
2. For all ambiguous words try to apply all exact rules in proper order.
3. Apply inexact rules until any is met. Go to step 2.
Methodology of Improving the Word Tagger and Resolving Exceptions
At 602 an exception indication is received within the system 202. It is determined which word caused the exception (604). The indicated word is searched for in the dictionary 502 (606). If the word is not in the dictionary, it is added to the dictionary at 608, then check if there is an existing set of tags for the word 612—if so check if the problem is solved 618, if not—create a new set of rules for the new set of tags 614 and then check if the problem is solved 618.
If, after checking 606, the word is in the dictionary, it is determined which rule is responsible for the exception, and the responsible rule is edited or a new rule is added 610. In an embodiment, the editing is performed by a person. In another embodiment, the editing is performed by a machine according to a separate rule-making algorithm.
If the exception is resolved (at 618), the process ends. If the exception is not resolved, the process returns to 604 to examine which word caused the exception.
The above methodology of resolving exceptions to the system may become an automated proces.
Clause Divider Methodology
Referring to
Embodiments of the clause divider 203C comprises an algorithm emulating human reasoning behind dividing text into clauses. To this end, a sequential approach is employed rather than applying patterns to an entire sentence. In effect, collections of one or more words (scraps) are sequentially considered (which is similar to what a human actually does while hearing or reading a sentence) and an attempt is made to assign each scrap to build a clause “on the fly” constantly trying and rejecting different alternatives. This approach simplifies aspects of language decoding, such as handling of nested relative clauses and facilitates the creation of simpler and more accurate rules than those based on patterns.
Output of the Clause Divider
According to embodiments, the clause divider provides division for clauses. Clauses can possess at the most one predicate (either finite or non-finite like infinitives, participles and gerunds) and all of its arguments. LD provides connections between clauses based on criteria concerning their function towards their superiors. Clauses can be connected directly or through the nodes. Nodes are sentence elements provided to connecting clauses, e.g. coordinating or subordinating conjunctions. LD provides unique classifications of clauses which correspond to the LD system 202 classification of phrases. Main clauses are distinguished. Other clauses can function as subjects, objects, complements and attributes, and therefore those can he labeled with proper function name (e.g. attribute clause).
In an embodiment, the clause divider includes at least the following characteristics:
With reference to one of the current, typical dependency parsers:
1. Embodiments provide implicit division into clauses. In referenced parser, derivation from the relations structure is required.
2. Coordinating nodes (elements connecting two coordinated clauses (e.g. “and”, “but”, “or”) are distinguished by embodiments. Referenced parser does not provide a distinction between a relation connecting two coordinated clauses and a relation connecting two coordinated phrases or words. In addition, elements connecting a subordinated clause to its superior clause (e.g. “which”, “after”, “when”) are distinguished by embodiments. In typical dependency parser the main connection holds between two verbs representing the clauses; hence the equivalent of a subordinate node has to be connected with a proper verb.
3. In the clause divider 203C, classification is based on different criteria (than in the typical dependency parser) and in consequence not every type from one classification is equivalent to a subset of the types of the other classification (although in a number of situations it is).
4. Types of clauses in the LD 202 are equivalent to types of phrases, making the LD system 202 more coherent in comparison to the typical parser. For example, some of referenced parser's relations corresponding to clause types are common with those for phrases and/or words.
With reference to one of the current, typical constituency parsers:
1. Typical constituency parser provides classification that is based on criteria concerning grammatical construction. This is in sharp contrast to the clause divider 203C, whose criteria is based on a clause's function towards its superior. In effect every type from one classification can match almost any type from the other, depending on the particular situation.
2. The LD 202 treats nodes connecting clauses (e.g. “and”, “but”, “which”, “after”) as separate elements on a clause level, whereas the referenced parser includes them into the following clause.
The Process of the Clause Divider
More precisely, a scrap is a maximal set of words that will certainly form a phrase either by itself or along with some other scrap(s). That is, if two elements are able to form two separate phrases, but this is not evident at the point of creating scraps, they become separate scraps. There is an exception to the foregoing. Specifically, prepositions are joined in scraps with following noun phrases in principle, despite the fact that they will later be divided into two separate phrases.
At 804, the clause divider 203C makes a preliminary estimate of which scraps are the strongest candidates to introduce a new clause (that is, which scraps are responsible for dividing the sentence into clauses. The output of the preliminary estimate process 704 is received by a main divider engine 706, which includes a scrap dispatcher and a correctness evaluator. The scrap dispatcher assigns scraps to appropriate clauses. In an embodiment, this assignment is based on constant interaction between the scrap dispatcher and the correctness evaluator. The correctness evaluator evaluates the decision made by the scrap dispatcher by determining whether the decision generated as correct clause or not. The output of the main divider engine 706 is received by a nodes extraction process 708. This process extracts nodes as separate elements on a clause level. Relationships between clauses are established by as clause connection detection process 710. Clause types are detected (e.g. subject, object, complement) by a clause type classification process 712. The output 714 of the clause divider is text divided into sentences and clauses, which contain words (along with their tags from the word tagger) grouped into scraps. The grouping serves only as an aid in prior and subsequent processes and is not itself a crucial element of the output. The words can also be considered as assigned directly to clauses. Moreover each clause has its proper grammatical information assigned, along with type and connection to its superior clause.
Methodology of Improving the Clause Divider 203C and Resolving Exceptions
By analyzing eases of wrong divisions, connections and types if clauses or adjusting the clause divider 203C into different type of content one can decide whether to:
-
- add or modify new divider candidates.
- add or modify rules in choosing possible alternative divisions.
- add or modify rules in correctness evaluation.
This provides flexibility and continuous improvement for different types of input context.
At 802 an exception indication is received within the system 202. It is determined if the clause is correctly divided (804). If not, it is determined if clause dividing scraps were chosen correctly (806). If not, a new candidate is added or the rules responsible for choice are changed (808).
After the changes (808) or in case if it was determined, that the dividing scraps were chosen correctly (806), it is determined if scraps were correctly distributed to clauses (810). If not, scrap dispatcher or correctness evaluator rules are adjusted (814).
After the changes in rules of scrap dispatcher or correctness evaluator (814), or in case if it was earlier determined that either scraps were correctly distributed to clauses (810) or clause is correctly divided (808), it is determined if connections between clauses are correct (812). If not, the rules governing connection determination are adjusted (816).
After adjusting the determination rules (816) or if the connections between clauses were determined as correct (812) it is determined if the clause classifiaction is correct (818). If yes, the process ends. Otherwise the clause classification rules are adjusted (820).
The above methodology of resolving exceptions to the system may become an automated proces.
Accomodator Methodology
Referring briefly to
The accomodator 203D receives words grouped into scraps and further into clauses (along with connections and types). In other embodiments the accomodator is based on words grouped directly into clauses. A phrase creator module 1602 detects boundaries of every phrase. Then a phrase connection detection module 1604 detects connections between phrases relying on a semantic n-gram database 1603. A phrase type classification module 1606 denotes types for each phrase in ever clause in a sentence. Then a word component classification module 1608 assigns components to each word within all of the phrases. Finally, as word connection detection module 1610 detects connections between each word within a phrase.
The accomodator 203D output 1612 consists of words grouped into phrases, which are further grouped into clauses. The phrases and words have proper connections and types assigned. In an embodiment the accomodator 203D output is the final output of the LD system 203, or LD output API 205 as shown in
An advantage of the accomodator over prior modules that perform analogous tasks is that the accomodator 203D method determines connections between phrases and their grammatical functions in way that is similar to the way in which is human processes language. The accomodator 203D sequentially considers separate scraps, and chooses, for a set of syntactic possibilities, the one that semantically makes most sense. In an embodiment this is achieved in part though using a database 1603 containing n-grams (representing these syntactic possibilities) along with the quantity of their occurrence in a large corpus. Ambiguities in interpretation are always reduced to a set of syntactic possibilities consisting of a few elements and are then solved on a semantic basis. The result is simple and intuitive for a human to understand. Thus, a human can readily see and understand the decisions the LD 203 makes and, when appropriate, correct its mistakes by modifying rules so that the system is continually improved.
The accomodator employs an unambiguous criterion for determining connections between phrases, which causes its performance to be reliable and its output to be easily understood by humans. This allows the output of the LD 203 to be more readily incorporated into other existing language applications and their respective rules. As an example, to determine the connection between certain prepositional phrases, the accomodator employs the so-called “it” test, which is illustrated in
Referring to
The next operation is word categorization. The word “construction” gets a specifier component because it is a noun specifying a phrase core. Referring to
Methodology of Improving Accomodator and Resolving Exceptions
At 1502 an exception indication is received within the system 202. It is determined if the phrase is built correctly (1504). If not, it is determined if the problem results from phrase creator (1506). If so, the rules of creating phrases are adjusted (1508). If it was determined that the problem of wrongly built phrase does not result from phrase creator (1506), determine if the problem results from scraps (1510). If so, repair scrapper engine (1514). If the scrapper engine was repaired (1514), the rules of creating phrases were adjusted (1508) or if it was determined that the problem does not result from scraps (1510) or that the phrase was built correctly (1504), it is determined if the connections of the phrases are correct (1512). If no, the connection rules are changed or the semantic n-gram database is edited (1518). After these changes (1518) or if the connection between phrases is determined as correct (1512), it is determined if the type of the phrase is correct (1520). If so, the process ends. Otherwise the rules for determining type of phrase are adjusted (1522).
The above methodology of resolving exceptions to the system may become an automated proces.
Methodology for Improving Language Decoder and Resolving Exceptions
At 2402 an exception indication is received within the system XX. It is determined if the input text is correctly tokenized (2404). If not, adding or changing existing rules in tokenizer is performed (2408).
After the changes (2408) or in case if it is determined, that text is correctly divided into sentences (2406). If not, adding or changing existing rules in sentence divider is performed (2412).
After the changes (2412) or in case if it is determined, that word tagger worked correctly (2410). If not, word tagger exception resolving process is started (600).
After the changes (600) or in case if it is determined, that predicate finder worked correctly (2414). If not, adding or changing existing rules in predicate finder is performed (2420).
After the changes (2420) or in case if it is determined, that clause divider worked correctly (2418). If not, clause divider exception resolving process is started (800).
After the changes (800) or in case if it is determined, that accommodator worked correctly (2432). If not, accommodator exception resolving process is started (1500).
The above methodology of resolving exceptions to the system may become an automated process.
LD Output as a Three-Level Framework of Information Interface
The output of the LD module 203 is a three-level framework of storing information (words, phrases and clauses). The three-level structure enables efficient organization of information. Embodiments of a three-level LD output framework capture a maximum amount of the information coded into text while maintaining the simplest possible structure for easy access. The first step was to design efficient framework for storing information and then using only algorithms written by humans (no machine learning techniques and corpus) to decode the information from any given text. The output of the LD thus conveys the logical structure of information stored in text, along with the grammatical structure.
The three-level LD output structure is compositional and predictable and consist of a relatively small number of elements on each level. Having a minimal set of components and relations speeds up the learning process, simplifies writing extraction rules, and can reduce chances of miscategorization. The output of the LD is effectively an easy-to-access interface standardizing the information coded into natural language.
Description of the Three-Level Framework
A three-level structure consisting of word-level, phrase-level and clause-level stores the information coded in text. The LD 203 is a three-level structure, invariant of the text processed through the system. The text could be short, simple sentences written in proper grammar, short non-proper grammar twitter content or long non-proper grammar reviews. This contrasts with prior systems, in which the parameters are responsible for attributes and the structure at the same time, which can produce varying results depending on the context of some fragment of text.
In an example from BACKGROUND Section:
-
- 1. John Likes Math.
2. John Likes to Learn.
3. John Likes Learning Math in the Evening.
When using LD parser in each case one gets consistent notation for the object which John likes. In the first example “math” is an object, in the second example “to learn” is object clause (its role is the same, but on different level) and in the third example “learning math in the evening” is also an object clause. This approach allows to separate grammatical layer from logical layer, so that a single rule can cover many different syntactic structures of the sentence.
As a result information extraction rules written on top of the LD are efficient. Fewer rules need to be written to capture more information, and the information is less ambiguous.
In an embodiment, a clause is a group of phrases that form a single minimal statement. Its internal meaning can he analyzed separately (and this is done by the LD on a phrase level), but only its context in combination with other clauses can lead to understanding the text and its meaning. This is caused by the fact that the original, literal meaning of the clause is often significantly altered by its relation towards some other clauses. The LD provides such information about relations between clauses as well as their functions in the whole utterance.
The three-level framework consists of elements on each level and allows for the addition of layers storing new kinds of information. This 3-level structure allows to integrate the decoded information with semantic databases (e.g. freebase™, wordnet™), ontologies taxonomies in order to add additional semantic information into existing components on the phrase and/or word level.
The clause level has additional layer of information about the tense (e.g. present simple, past simple) and construction (e.g. positive, negative, question). In embodiment, different layers of information are kept separate for several reasons. Layers of abstraction can be easily formed by combining only the relevant types of information and ignoring the other (e.g. if one needs only information about phrases division and not their types). It is possible to add other, new layers (e.g. coreference) in addition to the existing ones without distorting the already present information thanks to the separation of layers.
The three level framework can be seen and treated as an interface for information extraction from text. The information can be coded into different layers of the text. For example, in “He owns that red ear” the information about the car is described as “red” on the word level. In “The boy threw that ball” the information about which object was thrown is coded on a phrase level. In “It started to rain after we got home” the circumstances are stored on the clause level. It is also possible to include higher levels on which the information can be stored, e.g. the causation can be seen as pattern of clauses connected with certain nodes (“[If] you eat too much chocolate [then] your stomach will hurt”), therefore the information is coded into higher level than clause. Likewise it is possible to include yet higher levels, building patterns on top of other patterns. The three-level framework of the LD output reflects that natural division of information into separate levels and provide fundaments for creating higher level patterns to capture information store in multiple sentences.
Language Decoder Query Language
1. General Description and Purpose
LDQL 204 is a declarative, domain-specific querying language for text-structuring NLP systems outputs' information extraction. It is based on first-order predicate calculus and enables users to predefine their own formulas (including recursive definitions).
LDQL 204 queries can express a wide range of actions—from simple listing all subjects to actions as complicated as e.g. finding opinions about people.
LDQL 204 queries have SQL-like syntax, which is relatively easy to write and read for human operator, and does not require linguistic knowledge from them. The queries can be also created automatically, which is described below.
Our implementation of LDQL 204 is optimized for Language Decoder, however LDQL 204 can also be implemented for other text-structuring NLP systems, e.g. Stanford Parser or Illinois SRL.
It is possible (e.g. with the use of LD 203 and LDQL 204) to formulate queries in natural language, and then translate them to [high-level] LDQL 204 queries.
2. LDQL in Comparison to Other Query Languages
There are some prior-art query languages which were used for NLP (e.g. Prolog, SPARQL, IQL). As compared with them, LDQL 204 has some unique features:
-
- it was designed for querying NLP parsers' output: it's types and relations reflect 3-level information structure, regardless of particular parser's output structure,
- it is based on full first-order predicate calculus,
- users can define their own formulas (including recursive ones) within the language, and use them in queries, or other definitions.
LDQL 204 queries have SQL-like syntax: a SELECT section containing goals of extraction, an optional FROM section for fixing the search range (search by clauses, sentences, or whole text at once), and an optional WHERE section, containing restrictions on extracted objects in the form of (syntactically sugared) first-order formulas, where the variable's range over 3-level structure of words, phrases and clauses (regardless of particular parser's output structure).
E.g. to extract pairs of subject-predicate phrases from given text we could use the following LDQL 204 query:
LDQL's orientation towards formula composition encourages users to building new formulas out of previously defined ones, which we consider crucial in dealing with natural language's complexity; for example instead of the following pair of queries:
We would rather abstract the common pattern for finding “subject→predicate←object” connection:
which makes the queries more “granular”, and thus easy to maintain (in analogy with subroutines in imperative programming languages).
As far as we're concerned, this is the first use of full first-order query language for unstructured text information extraction. The presence of EXISTS quantifier (in addition to the Boolean connectives AND, OR and NOT present in e.g. SQL, Prolog or—in some flavor—SPARQL) with the ability to write recursive formula definitions makes LDQL 204 expressively stronger formalism than e.g., pure SQL, Prolog or even SPARQL (which seems to have some quantification constructs, but no recursive formula definitions). An example benefit of having such an expressive power (also present in e.g. Gödel, or—indirectly—Cypher, or any imperative language like Java) is the ability to describe closures of relations:
suppose we would like to know whether two phrases are linked by a sequence of one or more “→” connections. In LDQL 204 we could simply write recursive formula:
Which literally reads “x and y linked if x→y or x→z for some z, such that z is linked with y”—notice the circularity (recursion). Because of LDQL's expressive strength, some measures need to he taken to avoid infinite evaluations (the “x→z AND NOT z→x” and “z!=x AND z!=y” restrictions).
3. Merging LDQL with External Systems
LDQL 204 could be implemented to operate on the output of any NLP text-structuring system (e.g. Stanford Parser), provided that the implementation will contain accessors to attributes that given system offers.
LDQL 204 can be connected to external semantic bases (e.g. Freebase, Wordnet), lexical data bases (e.g. Wordnet), domain-specific data bases, ontologies, word banks, taxonomies, etc., in order to support it with more semantic content.
LDQL 204 rules can be hand-written, generated automatically or semi-automatically by either connecting LDQL 204 to sources of semantic content (as above), or by machine learning means (in case of LD's output it is possible to extract common structures out of annotated text corpora by means of unification heuristics or e.g. Koza's genetic programming), or both, or generated semi-automatically by merging any of the methods mentioned.
4. LDQL Implementation
The LDQL 204 implementation for LD 203 consists of three modules: parser, optimizer and compiler, as shown on
The parser module 3508 processes text of LDQL script 3502 (i.e. list of definitions 3504 and queries 3506) in order to produce its abstract representation.
The optimizer module 3510 takes the parser's output and performs additional annotations to guide compiler.
The compiler module 3512 takes annotated abstract representation and generates as php/c/some-imperative-language output program 3514 which, given valid NLP parser output 103 returns the query's results 3516 for that output.
With using output of Language Decoder it is possible to build language understanding applications or language understanding engines. It can be done by applying on top of LD one or more of the following: LDQL rules, raw code (e.g. PHP, Java, C#, .NET, C++), machine learning algorithms (supervised or unsupervised).
Usage of Language Decoder
Language Decoder is the component technology for building application in many areas including (but not limited to):
-
- Information Extraction
- Sentiment Analysis and Extraction
- Event and Relationship Extraction
- Opinion Mining
- Text Mining
- Document indexing
- Text Summarization/curation of information
- Speech processing
- Question&Answering
- Text proofing
- Translations
- Natural Language Search/Structured Search
- Query expansion
- Automated scoring (essay scoring)
Language Decoder may be used to process any type of texts including (but not limited to):
-
- User generated content,
- Social media content, microblogging (e.g. twitter)
- Reviews (e.g. review sites content, websites, wikipedia, etc.)
- Formal language documents
- Articles and new
- Biomedical free text
- Research papers
System built on top of language decoder can represent many domains including (but not limited to):
-
- Big data
- Customer feedback analytics
- Social listening/sentiment analytics
- Email analysis
- Text Analytics
- Search
- Voice Search
- Advertising
- Predictive Analytics
- Google Glasses and other reality devices
- Voice interface
Information Extraction Example: Knowledge Browser
Bubble visualization is a concept of showing multi-level structure of knowledge representation (LDQL extracted) in a form of clickable bubbles.
This representation can vary in size, color and position of bubbles representing frequency and additional properties (for example, color palette can represent polarity of detected object sentiments).
This multilevel approach allows to browse (e.g. zoom in or zoom out) through summarized knowledge encoded in analysed text, which can help to understand wider context along with more exact properties of extracted data.
This concept does not restrict data to be in specific format or order, pairs (object—opinion), triplets, quadruples (suggestion, aim, suggestion modifier, aim modifier) and many more are suitable here. Also additional categorization can be applied on top of extracted data to improve knowledge integrity and understanding. It can be achieved in multiple ways, starting from applying external lexicons or semantic databases (e.g. Freebase, FrameNet), through human made categorization to even logical correlations.
Bubble visualization concept is used in the information extraction example application described below.
With reference to
With reference to
With reference to
Referring to
Referring to
Referring to
Bubble size and bubble color are just examples of visual properties that can be used to convey the desired information. Any other visual characteristic that can be varied in a similar manner would be just as appropriate.
The various functions or processes disclosed herein may be described as data and/or instructions embodied in various computer-readable media, in terms of their behavioral, register transfer, logic component, transistor, layout geometries, and/or other characteristics. Computer-readable media in which such formatted data and/or instructions may be embodied include, but are not limited to, non-volatile storage media in various forms (e.g., optical, magnetic or semiconductor storage media) and carrier waves that may be used to transfer such formatted data and/or instructions through wireless, optical, or wired signaling media or any combination thereof. Examples of transfers of such formatted data and/or instructions by carrier waves include, but are not limited to, transfers (uploads, downloads, e-mail, etc.) over the internet and/or other computer networks via one or more data transfer protocols (e.g., HTTP, FTP, SMTP, etc.). When received within a computer system via one or more computer-readable media, such data and/or instruction-based expressions of components and/or processes under the system described may be processed by a processing entity (e.g., one or more processors within the computer system in conjunction with execution of one or more other computer programs.
Aspects of the systems and methods described herein may be implemented as functionality programmed into any of a variety of circuitry, including programmable logic devices (PLDs), such a field programmable gate arrays (FPGAs), programmable array logic (PAL) devices, electrically programmable logic and memory devices and standard cell-based devices, as well as application specific integrated circuits (ASICs). Some other possibilities for implementing aspects of the system include: microcontrollers with memory (such as electronically erasable programmable read only memory (EEPROM)), embedded microprocessors, firmware, software, etc. Furthermore, aspects of the system may be embodied in microprocessors having software-based circuit emulation, discrete logic (sequential and combinatorial), custom devices, fuzzy (neural) logic, quantum devices, and hybrids of any of the above device types. Of course the underlying device technologies may be provided in a variety of component types, e.g., metal-oxide semiconductor field-effect transistor (MOSFET) technologies like complementary metal-oxide semiconductor (CMOS), bipolar technologies like emitter-coupled logic (ECL), polymer technologies (e.g., silicon-conjugated polymer and metal-conjugated polymer-metal structures), mixed analog and digital, etc.
Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise,” “comprising,” and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense: that is to say, in a sense of “including, but not limited to.” Words using the singular or plural number also include the plural or singular number respectively. Additionally, the words “herein,” “hereunder,” “above,” “below,” and words of similar import refer to this application as a whole and not to any particular portions of this application. When the word “or” is used in reference to a list of two or more items, that word covers all of the following interpretations of the word: any of the items in the list, all of the items in the list and any combination of the items in the list.
The above description of illustrated embodiments of the systems and methods is not intended to be exhaustive or to limit the systems and methods to the precise forms disclosed. While specific embodiments of, and examples for, the systems components and methods are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the systems, components and methods, as those skilled in the relevant art will recognize. The teachings of the systems and methods provided herein can be applied to other processing systems and methods, not only for the systems and methods described above.
The elements and acts of the various embodiments described above can be combined to provide further embodiments. These and other changes can be made to the systems and methods in light of the above detailed description.
In general, in the following claims, the terms used should not be construed to limit the systems and methods to the specific embodiments disclosed in the specification and the claims, but should be construed to include all processing systems that operate under the claims. Accordingly, the systems and methods are not limited by the disclosure, but instead the scope of the systems and methods is to be determined entirely by the claims.
While certain aspects of the systems and methods are presented below in certain claim forms, the inventors contemplate the various aspects of the systems and methods in any number of claim forms. For example, while only one aspect of the systems and methods may be recited as embodied in machine-readable medium, other aspects may likewise be embodied in machine-readable medium. Accordingly, the inventors reserve the right to add additional claims after filing the application to pursue such additional claim forms for other aspects of the systems and methods.
Claims
1. (canceled)
2. A system for natural language processing (NLP) comprising:
- at least one processor configured to execute a process comprising extracting and structuring information from output of an NLP parser, the process further comprising,
- receiving input comprising output from the NLP parser;
- receiving a description of conditions characterizing fragments of the NLP parser output, wherein the fragments correspond to information to be extracted; and
- receiving a description of a form in which the extracted information is to be structured, wherein the description of conditions and the description of a form are provided as one or more queries, wherein each query comprises, a target section describing the form in which the extracted information is to be structured; and a conditions section describing conditions characterizing the fragments of the NLP parser output to be extracted.
3. The system of claim 2, wherein:
- the conditions section comprises one or more atomic formulas connected with zero or more logical connectives, and wherein a set of possible atomic formulas comprises: relation of identity; relation of membership; relation of inclusion; relations capturing properties of character strings; relations capturing the structure of NLP parser output; and
- the set of possible logical connectives comprises, negation; conjunction; and alternative.
4. The system of claim 3, wherein:
- the description of conditions and the description of a form are provided in a form of one or more queries, and zero or more user-defined predicate definitions;
- a conditions section of each query, and each user-defined predicate definition comprises a first order logic formula with variables ranging over fragments of the NLP parser output: and
- the set of possible atomic formulas comprises, first order predicate of identity; first order predicate of membership; first order predicate of inclusion; first order predicates capturing properties of character strings; first order predicates capturing the structure of NLP parser output; and each of the user-defined predicates; and
- each group of two or more queries may comprise a same user-defined predicate.
5. The system of claim 2, further comprising a compiler, wherein the process further comprises converting description of conditions and the description of a form into a program, wherein input of the program comprises the NLP parser output.
6. The system of claim 5, wherein:
- the program comprises a sequence of one or more operations to be performed on the NLP parser output in order to extract and structure the information; and
- the compiler comprises, a parser for converting the description of conditions and the description of a form, into an abstract syntax tree; a proper compiler, for converting the abstract syntax tree into the sequence of the operations to be performed on the NLP parser output.
7. The system of claim 6, wherein the compiler further comprises an optimizer configured to modify the abstract syntax tree and the sequence of the operations.
8. The system of claim 6, wherein the proper compiler produces a standalone executable.
9. The system of claim 6, wherein the system further comprises a virtual machine, and wherein input of the virtual machine comprises the NLP parser output.
10. The system of claim 9, wherein input of the virtual machine further comprises the sequence of the operations to be performed on the NLP parser output.
11. The system of claim 10, wherein the proper compiler produces the bytecode for the virtual machine.
12. A computer-implemented method for natural language queries of textual data, the method comprising:
- at least one processor executing a process comprising extracting and structuring information from output of an NLP parser, the process further comprising, receiving input comprising output from the NLP parser; receiving a description of conditions characterizing fragments of the NLP parser output, wherein the fragments correspond to information to be extracted; and receiving a description of a form in which the extracted information is to be structured, wherein the description of conditions and the description of a form are provided as one or more queries, wherein each query comprises, a target section describing the form in which the extracted information is to be structured; and a conditions section describing conditions characterizing the fragments of the NLP parser output to be extracted.
13. The method of claim 12, wherein:
- the conditions section comprises one or more atomic formulas connected with zero or more logical connectives, and wherein a set of possible atomic formulas comprises: relation of identity; relation of membership; relation of inclusion; relations capturing properties of character strings; relations capturing the structure of NLP parser output; and wherein the set of possible logical connectives comprises, negation; conjunction; and alternative.
14. The method of claim 13, wherein:
- the description of conditions and the description of a form are provided in a form of one or more queries, and zero or more user-defined predicate definitions;
- a conditions section of each query, and each user-defined predicate definition comprises a first order logic formula with variables ranging over fragments of the NLP parser output: and
- the set of possible atomic formulas comprises, first order predicate of identity; first order predicate of membership; first order predicate of inclusion; first order predicates capturing properties of character strings; first order predicates capturing the structure of NLP parser output; and each of the user-defined predicates; and
- each group of two or more queries may comprise a same user-defined predicate.
15. The method of claim 12, further comprising a compiler, wherein the process further comprises converting description of conditions and the description of a form into a program, wherein input of the program comprises the NLP parser output.
16. The method of claim 15, wherein:
- the program comprises a sequence of one or more operations to be performed on the NLP parser output in order to extract and structure the information; and
- the compiler comprises, a parser for converting the description of conditions and the description of a form into an abstract syntax tree; a proper compiler, for converting the abstract syntax tree into the sequence of the operations to be performed on the NLP parser output.
17. The method of claim 16, wherein the compiler further comprises an optimizer configured to modify the abstract syntax tree and the sequence of the operations.
18. The method of claim 16, wherein the proper compiler produces a standalone executable.
19. The method of claim 16, wherein the system further comprises a virtual machine, and wherein input of the virtual machine comprises the NLP parser output.
20. The method of claim 19, wherein input of the virtual machine further comprises the sequence of the operations to be performed on the NLP parser output.
21. The method of claim 20, wherein the proper compiler produces the bytecode for the virtual machine.
Type: Application
Filed: Sep 4, 2015
Publication Date: Mar 3, 2016
Applicant: Fido Labs Inc. (Palo Alto, CA)
Inventors: Michal Wroczynski (Gdynia), Tomasz Krupa (Sopot), Gniewosz Leliwa (Gdansk), Piotr Wiacek (Gdansk), Michal Stanczyk (Gdansk)
Application Number: 14/845,508