SYSTEMS AND METHODS FOR IDENTIFYING CLAIMS IN ELECTRONIC TEXT
A system, method, and computer program for identifying claims associated with electronic text are provided. In an approach, electronic text is accessed. Linguistic content associated with the electronic text is identified. A linguistic structure is generated based on the linguistic content identified. The linguistic structure is compared to a claim template. A claim is identified based on the comparison.
Latest Linguastat, Inc. Patents:
The present application claims the benefit under 35 U.S.C. 120 as a continuation of application Ser. No. 12/006,716, filed Jan. 4, 2008, which claims the benefit of provisional application 60/878,880, filed Jan. 5, 2007 the entire contents of which are hereby incorporated by reference for all purposes as if fully set forth herein.
BACKGROUND1. Field of the Invention
The present invention relates generally to natural language processing, and more particularly to systems and methods for identifying claims in electronic text.
2. Background Art
Conventionally, natural language processing systems may be utilized to process electronic text. Natural language processing may identify, for example, various file formats, character encoding schemes, parts-of-speech tagging, syntactic parsing, and so forth. Reasons for processing the electronic text range from storing and retrieving information to evaluating the electronic text to create and manage taxonomies.
Vast numbers of claims are made every year by numerous organizations, companies and individuals. For example, millions of product and service claims are made every year by many thousands of companies marketing through various communications channels. Governments, agencies and politicians regularly communicate claims to the general public and voters. Increasingly, marketing, advertising, communications and messaging is conducted through electronic channels, including traditional channels such as television and radio, and emerging electronic channels, such as the Internet, as well as cell phones and other handheld or wireless communications devices.
Claims of various kinds are of interest to millions of shoppers, marketing and purchasing professionals, public relations and communications professionals, business and political strategists, various organizations, the general public, and regulators, such as the FTC, FDA and SEC. For example, product and service claims may be used by shoppers to find suitable products and make buying decisions, by marketers to assess competitive offerings and position product offerings, by purchasing agents to support purchasing decisions and contracts, and by regulators to find and stop deceptive advertising and marketing practices. Political claims may be used by political strategists and candidates to characterize opponents and stake out positions on issues. Other kinds of claims are useful to various audiences. Claims may be located and analyzed via manual review or search engines. Unfortunately, manual review and analysis of the electronic text is typically time consuming and inconsistent, and search engines often produce voluminous results.
SUMMARY OF THE INVENTIONA system, method, and computer program for identifying claims associated with electronic text are provided. Electronic text is accessed. Linguistic content associated with the electronic text is identified. A linguistic structure is generated based on the linguistic content identified. The linguistic structure is compared to a claim template. A claim is identified based on the comparison.
Referring now to
The claims engine 104 accesses various electronic text via the network 106 or otherwise, and ultimately identifies claims from the electronic text (sources of electronic text are discussed further in connection with
Within product and service claims, many variants exist, including but not limited to: performance claims, energy efficiency claims, financial claims, and environmental or “green” claims, i.e. claims that a product or service is made and acts in a neutral or positive manner with respect to the earth and natural systems. Some claims have legal meanings and significance. For example, specific types of claims may comprise “health benefit claims”, i.e. claims to treat, cure, mitigate, prevent or diagnose disease; “structure/function claims”, i.e. claims to maintain normal function of the body; “safety claims”, i.e. claims that a product or service are safe to use; “efficacy claims”, claims that a product or service are effective for their intended use; and “nutrient content and health claims”, i.e. claims that a food provides a health or nutritional benefit.
The network 106 may comprise the Internet, a local area network, a peer to peer network, and so forth. Alternatively, the claims engine 104 may access the electronic text locally, such as from a storage medium located on the same computer on which the claims engine 104 resides.
The users 102 may access the claims engine 104 directly or via the network 106, as discussed herein. The claims engine 104 may communicate with claims storage 108. The users 102 may also communicate with claims storage 108 directly or via a network 106. The claims storage 108 may comprise more than one storage medium, claims databases, and so forth, according to exemplary embodiments. The users 102 may perform a search of the claims storage 108 via the claims engine 104, such as by querying the claims engine 104 for specific claims. Alternatively, the users 102 may perform a search of the claims storage 108 directly. The query may be based on attributes associated with the claims, for example. The claims identified by the claims engine 104 may be analyzed and presented to the users 102 in any manner. Analysis and presentation of the claims is discussed further in connection with
According to some embodiments, data mining and rules-based algorithms may be employed. A plurality of external standards may be incorporated as a reference for assessment or evaluation of the claims and associated attribute values, discussed herein. Queries comprising a plurality of attribute values, and a succession of such queries may be used to target and narrow results. Results may be displayed on a plurality of display devices.
The electronic text may originate from various information sources 202 and may derive from various kinds of documents or media, such as press releases, product labels, product information and specification sheets, marketing literature, advertisements, articles, blogs, movies, songs or other forms of unstructured text. Further, the electronic text may be received through a plurality of media channels, such as the Internet, television, radio, RSS feeds, newspapers, magazines, blogs, email, and so forth, or physically received on storage medium such as compact disc. The volume of electronic text analyzed may be small, as in a single document, or very large, such as the Internet. The analysis may be an isolated instance, or be continuous or recurring. Various and multiple natural language and linguistic processing techniques, such as information extraction, may be utilized by the claims engine 104. Various and multiple key attributes associated with the claims identified by the claims engine 104 may also be extracted.
The claims engine 104 may identify claims from the electronic text, and may store the claims in claims storage 108, in a database associated with the claims storage 108 discussed in
A user application 204 may communicate with the claims storage 108 and/or with the claims engine 104. Various user applications 204 may be in communication with the claims storage 108. Each application may utilize the claims stored in the claims storage 108. The user application 204 may include, but are not limited to: a shopping application, which may utilize the claims to direct potential purchasers to suitable products or services, or push products and services to targeted potential buyers; an advertising application, which may utilize the claims to direct an advertisement to targeted recipients; a buyer-seller matching application, which may utilize the claims to match appropriate buyers and sellers in ecommerce or electronic auctions; a compliance application, which may utilize the claims to find and identify noncompliant or illegal advertising or marketing practices; and a procurement application, which may utilize the claims to identify products and services meeting procurement specifications. One or more users 102 may interact with the user application 204 in order to utilize the claims for various purposes discussed herein, such as to locate products, perform analysis, and so forth. In alternative embodiments, the users may access the user application 204 via the claims engine 104.
The claims engine 104 may also include a monitoring module 304. The monitoring module 304 accesses the electronic text, either via the network 106 or locally, such as by accessing the electronic text from the same device that comprises the monitoring module 304. The monitoring module 304 may search the network 106, storage, databases, the information sources 202 discussed in
According to some embodiments, the monitoring schedule may be different for different sources of electronic text. For example, websites that tend to include a large number of claims may be monitored more frequently. Conversely, sources of claims determined to be consistent with regulatory standards associated with claims may be monitored less frequently. The monitoring module 304 may utilize various information discovery tools including searching, sorting, organizing, and browsing by full text or fields (attributes).
A linguistic structure module 306 identifies linguistic content associated with the electronic text. For example, identifying linguistic content may comprise identifying one or more linguistic features in the text and identifying one or more relationships between linguistic features. Identifying linguistic features may comprise, but is not limited to: tokenizing electronic text into segments of text, for example, identifying words, punctuation, white space, capital and lower case letters, sentences and so forth; identifying words for their part of speech, such as nouns, verbs, adjective, adverbs, and so forth; identifying groups of words, such as noun phrases, verb phrases, and prepositional phrases, and identifying entities, such as people, places, and things, such as in a health-related domain, identifying specific diseases. Other characteristics of the electronic text may be identified as linguistic features, and thus may be part of the linguistic content. Identifying relationships between linguistic features may comprise, but is not limited to: identifying subjects, verbs and objects of sentences and therefore identifying subject-verb-object relationships; identifying semantic roles of the sentence parts and therefore semantic relationships between the sentence parts. Other types of relationships identified may include, but are not limited to, grammatical, co-referring, semantic, and discourse relationships. A linguistic structure is generated based on the linguistic content identified by the linguistic structure module 306. A linguistic structure is any combination of two or more linguistic features, or a combination of at least one feature and one relationship. Linguistic structures may be stored in a linguistic database in the linguistic structure module, in the claims storage 108, or in any other storage. Linguistic structures may be indexed in the claims storage 108 or any other storage to enable rapid and efficient retrieval.
The linguistic structure module 306 may also comprise information extraction agents that extract other information associated with the claim from the electronic text. Such other information may include, but is not limited to people, places, things or times. For example, a company name or a person's name may be extracted, even though a company name or a person's name does not constitute a claim. Other information may be stored with the claims.
A template module 308 includes one or more claim templates, such as structured claim templates or statistical claim templates. Statistical claim templates may comprise statistical models associated with claims and/or text examples. For example, in looking for “health benefit claims” of interest to the FDA, an example structured claim template for “health benefit claim” may be defined as a set of criteria containing: 1. presence of a “change verb”, defined as any verb appearing in WordNet that is a hyponym of the verb “change” (an external reference), and 2. presence of a “disease term” defined as any term appearing as a node in the Diseases branch in the Medical Subject Headings tree (MeSH, also an external reference), and 3. presence of a predicate-object relationship between the “change verb” and the “disease term” in which the “disease term” is the syntactic object of the “change verb.” Statistical claim templates may include statistical models of linguistic features associated with claims and/or text examples. Various linguistic forms exist for claims. The template module 308 may generate or access a template for each form. Templates may be stored in a query database contained in the template module, in the claims storage 108, or in any other storage. The templates may be utilized to identify claims from the electronic text based on the linguistic content.
The claims engine 104 may include an analysis module 310. The analysis module 310 may analyze the claims identified by the claims engine 104. The analysis may be utilized to recognize trends, patterns, outliers, correlations, relevance, similarity, and other attributes, such as whether a particular claim is likely to be false, misleading, or illegal. The analysis module may utilize external standards or information to evaluate claims, such as FDA-approved labels and product information sheets to identify an “off-label” claim.
The analysis module 310 may include data mining agents for automatically analyzing claims. The data mining agents may consist of a series of software algorithms that query the claims storage 108, retrieve lists or records, perform some analysis (such as statistical, pattern matching, further NLP, etc.) Accordingly, the monitoring module 304 and the analysis module 310 may be utilized to perform a partially or completely automated claims analysis system, such as by performing an automated search and identification of the electronic text and the identification and analysis of the claims identified associated with the electronic text. The analysis module 310 may utilize the data mining agents to generate reports for further analysis, or the analysis module 310 may augment records of the claims by storing information to the claims storage 108.
A claims presentation module 312 may also be provided for modifying the claims. The claims may be modified in accordance with regulatory standards, such as regulations set forth by the Food and Drug Administration (“FDA”) or Federal Elections Commission. Alternatively, the claims presentation module 312 may suggest modifications to the claims detected in the electronic media, for example, to a provider of the electronic media, such as a website sponsor that displays the electronic media including the claims. The monitoring module 304 may monitor the electronic media specifically for claims that should have been modified based on suggestions, for example, from a government body or other authority.
A claims presentation module 314 may display the claims to users, such as analysts, shoppers, procurement professionals, regulators, strategists, communications specialists, researchers, and so forth. The display may be in response to a query for types of claims, for example. Any type of presentation may be provided by the claims presentation module 314. According to some embodiments, claim alerts based on user specified criteria can be setup and accessed via a web interface or pushed via email message. For example, a campaign manager may be alerted to the dissemination of a political claim made by an opponent.
The results to a query may be displayed in a variety of ways. For example, a results set may be presented as a list of claims meeting the query. For each claim, a summary of each of the claims and the associated information may be generated. There may be links from the summary to a cached copy of the text, in which claims are highlighted for easier examination. Highlighted claims are shown in
Although various modules are shown in
Example text 402 comprises the electronic text discussed herein. Although the electronic text may include more than the example text 402 shown in
The example text 402 is utilized to generate example linguistic structures 404. The example linguistic structures 404 may include linguistic features such as parts of speech and identified entities such as ingredients, products and diseases, and may include relationships between linguistic features such as subject-predicate-object relations, and so forth.
The example linguistic structures 404 are then compared to an example claim template, such as a structured claim template for an ability claim 406, to identify the claims. As discussed herein, the linguistic structures 404 may alternatively be statistically analyzed for a match with a statistical claim template based on a training set of text examples. Based on the comparison, the claim may be identified. For example, if the linguistic structures 404 match the ability claim template 406 in
In
The text example containing the text “Cancer can also be treated by drinking Goji Juice” is identified as a claim because the text has a subject, predicate and object, all in a subject-predicate-object relationship, and the subject contains a term identified as an ingredient, the predicate contains a term that is a “change verb”, i.e. a hyponym of the verb “change” in WordNet, and the object is a “disease term”, i.e. a word found as a node in the MeSH Disease branch. A claim may therefore be extracted from the electronic text and may be represented, as in
As discussed herein, various other claim templates may be utilized to identify claims, such as identity, attribute, attribution, and superiority claim templates. Examples of other templates include an identity claim template that may take the form “Gogi Juice is a cancer-killer,” an attribute claim template that may take the form “Gogi Juice has cancer-killing properties,”an attribution claim template that may take the form “4 out of 5 doctors recommend Gogi Juice for cancer”, and a superiority claim template that may take the form “Goji Juice is better than chemotherapy for treating cancer.” As discussed herein, any claim templates, such as structured templates or statistical templates may be employed for comparison with the linguistic structures 404 and identification of the claims.
As discussed herein, the claims identified based on the comparison between linguistic structures and claim templates may occur due to a statistical probability instead of a match. For example, a claim may be identified because the criteria set forth in a statistical claim template has a 75% chance of being satisfied by one or more of the linguistic structures.
As discussed herein, the claims identified based on the comparison between the linguistic structures and the claim templates may be stored in the claims storage 108 for later analysis or access. According to some embodiments, any user may access the claims for evidence, support, research, or for any other reason.
The claims identified along with other related information may be stored in the claims storage 108, which may then be searched, as shown in
At step 604, linguistic content associated with the electronic text is identified. For example, linguistic features may be identified by: breaking down or tokenizing the electronic text into words, sentences, punctuation, white spaces, and so forth; identifying words for their part of speech, such as nouns, verbs, adjectives, adverbs, and so forth; identifying word groups constituting phrases, identifying entities, such as places, people, companies, times, products, and so forth.
Using semantic role labeling, or other techniques, the words, phrases, and/or entities may be labeled as sentence parts (e.g., subject, object, predicate), thereby identifying linguistic roles that the words, phrases, and/or entities play and the linguistic relationships between them, thus further identifying linguistic content. For example, the herbal medicine electronic media 502 shown in
According to some embodiments, natural language processing is utilized to process the electronic text, such as by converting file formats and character encoding schemes, part-of-speech tagging, syntactic parsing, information extraction, automated text categorization, word sense disambiguation, text segmentation, relationship mining, event detection, toponym resolution, and creation and management of taxonomies, lexicons, and knowledge bases.
At step 606, a linguistic structure based on the linguistic content identified is generated, such as the exemplary linguistic structure 404 shown in
At step 608, the linguistic structures are compared with a claim template. The claim template may comprise a structured claim template, such as the ability claim template discussed in
At step 610, a claim is identified based on the comparison. For example, the electronic text giving rise to linguistic structures that match the predetermined claim template, such as the structured claim template or the statistical claim template discussed herein, may be deemed claims. The electronic text giving rise to linguistic structures that meet pre-set threshold probability criteria set forth by a statistical template may also be identified as one or more claims.
As discussed herein, the claims may also be stored for future access. The claims may be stored in the claims storage 108 discussed in
After the claims and/or other associated information, such as the additional information 506 shown in
The identified claims, as well as other information associated with the claims, may be presented using natural language generation tools, text summarization, or information visualization systems, for example. As discussed herein, any type of presentation of the claims identified by the claims engine 104 may be utilized.
According to some embodiments, the claims identified by the claims engine 104 may be automatically “red-flagged.” For example, instances in which claims have a high likelihood of violating the law may be red-flagged, e.g. supplement claims about diagnosis, cure, mitigation, treatment and prevention of disease. Further, the monitoring module 304 may automatically conduct “follow-up” monitoring to verify that offending companies associated with the detected claims have complied with FDA requests or orders for corrective action, such as modifications to the claims, as discussed herein.
The processor 704 executes instructions. The memory 706 permanently or temporarily stores data. Some examples of the memory 706 are RAM and ROM. The storage 708 also permanently or temporarily stores data. Some examples of the storage 708 are hard disks and disk drives.
The embodiments discussed herein are illustrative. As these embodiments are described with reference to illustrations, various modifications or adaptations of the methods and/or specific structures described may become apparent to those skilled in the art.
The above-described components and functions can be comprised of instructions that are stored on a computer-readable storage medium. The instructions can be retrieved and executed by a processor. Some examples of instructions are software, program code, and firmware. Some examples of storage medium are memory devices, tape, disks, integrated circuits, and servers. The instructions are operational when executed by the processor to direct the processor to operate in accord with the invention. Those skilled in the art are familiar with instructions, processor(s), and storage medium.
While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. For example, any of the elements associated with the claims engine 104 may employ any of the desired functionality set forth hereinabove. Thus, the breadth and scope of a preferred embodiment should not be limited by any of the above-described exemplary embodiments.
Claims
1. A data processing method comprising:
- receiving a corpus of documents containing one or more product claim annotations;
- identifying, based on the one or more product claim annotations in the corpus of documents, one or more template linguistic structures likely to indicate presence of a product claim;
- accessing electronic text;
- identifying linguistic content associated with the electronic text, wherein the linguistic content includes a plurality of linguistic features;
- generating a linguistic structure based on the linguistic content identified, wherein the linguistic structure identifies at least a relationship between the plurality of linguistic features;
- identifying a particular product claim within the electronic text based on comparing the linguistic structure to the one or more template linguistic structures;
- wherein the method is performed by one or more computing devices.
2. The method of claim 1, wherein identifying the one or more template linguistic structures comprises using one or more machine learning techniques including: maximum entropy, support vector machine, neural network, nearest neighbor, hidden Markov model, conditional random fields, or maximum entropy Markov model.
3. The method of claim 1, wherein the one or more product claim annotations identify at least one or more boundaries of a claim within the corpus of documents and a type categorization for the claim.
4. The method of claim 1, wherein the one or more template linguistic structures specify one or more features including: lexical entities, grammatical relations, semantic meanings, or argumentative structure.
5. The method of claim 1, wherein identifying the claim within the electronic text involves tagging the claim within the electronic text with one or more annotations identifying the linguistic structure.
6. The method of claim 5, wherein the one or more product claim annotations specify features including: product name, product type, product benefit, object of the product benefit, or product user category.
7. The method of claim 1, further comprising:
- determining whether the particular product claim is misleading based on one or more factors including: absence of risk information, absence of required supporting references, non- compliance with a set of rules, presence of false benefits, incorrect side effect information, omissions of material facts, obfuscated risk disclosure, amount or type of evidence presented for the particular product claim, presence of anecdotal evidence, references to government agency evaluations, or references to historical or traditional use of a product;
- in response to determining that the particular product claim is misleading, storing data that flags the particular product claim as misleading.
8. The method of claim 1, further comprising: generating, based on the particular product claim, one or more product recommendations for a user.
9. The method of claim 8, wherein the one or more product recommendations include one or more of: a list of suitable products or services for the user, an advertisement to the user identifying one or more products, or data specifying one or more sellers which sell the one or more products.
10. The method of claim 8, wherein generating the one or more product recommendations is based on input supplied by the user.
11. The method of claim 1, further comprising: generating, based on the particular product claim, one or more of: a product description, a product review, or a comparison between one or more different products.
12. A computer-readable storage medium storing one or more instructions which, when executed by one or more processors, cause the one or more processors to perform:
- receiving a corpus of documents containing one or more product claim annotations;
- identifying, based on the one or more product claim annotations in the corpus of documents, one or more template linguistic structures likely to indicate presence of a product claim;
- accessing electronic text;
- identifying linguistic content associated with the electronic text, wherein the linguistic content includes a plurality of linguistic features;
- generating a linguistic structure based on the linguistic content identified, wherein the linguistic structure identifies at least a relationship between the plurality of linguistic features;
- identifying a particular product claim within the electronic text based on comparing the linguistic structure to the one or more template linguistic structures.
13. The computer-readable storage medium of claim 12, wherein the instructions for identifying the one or more template linguistic structures comprise instructions which when executed by the one or more processors cause using one or more machine learning techniques including: maximum entropy, support vector machine, neural network, nearest neighbor, hidden Markov model, conditional random fields, or maximum entropy Markov model
14. The computer-readable storage medium of claim 12, wherein the one or more product claim annotations identify at least one or more boundaries of a claim within the corpus of documents and a type categorization for the claim.
15. The computer-readable storage medium of claim 12, wherein the one or more template linguistic structures specify one or more features including: lexical entities, grammatical relations, semantic meanings, or argumentative structure.
16. The computer-readable storage medium of claim 12, wherein the instructions which when executed cause identifying the claim within the electronic text comprise instructions which when executed by the one or more processors cause tagging the claim within the electronic text with one or more annotations identifying the linguistic structure.
17. The computer-readable storage medium of claim 16, the one or more product claim annotations specify features including: product name, product type, product benefit, object of the product benefit, or product user category.
18. The computer-readable storage medium of claim 12, comprising instructions which when executed by the one or more processors cause performing:
- determining whether the particular product claim is misleading based on one or more factors including: absence of risk information, absence of required supporting references, non- compliance with a set of rules, presence of false benefits, incorrect side effect information, omissions of material facts, obfuscated risk disclosure, amount or type of evidence presented for the particular product claim, presence of anecdotal evidence, references to government agency evaluations, or references to historical or traditional use of a product;
- in response to determining that the particular product claim is misleading, storing data that flags the particular product claim as misleading.
19. The computer-readable storage medium of claim 12, comprising instructions which when executed by the one or more processors cause performing generating, based on the particular product claim, one or more product recommendations for a user.
20. The computer-readable storage medium of claim 19, wherein the one or more product recommendations include one or more of: a list of suitable products or services for the user, an advertisement to the user identifying one or more products, or data specifying one or more sellers which sell the one or more products.
21. The computer-readable storage medium of claim 19, comprising instructions which when executed by the one or more processors cause generating the one or more product recommendations is based on input supplied by the user.
22. The computer-readable medium of claim 12, comprising instructions which when executed by the one or more processors cause generating, based on the particular product claim, one or more of: a product description, a product review, or a comparison between one or more different products.
23. A data processing method comprising:
- accessing electronic text;
- identifying linguistic content associated with the electronic text, wherein the linguistic content includes a plurality of linguistic features;
- generating a linguistic structure based on the linguistic content identified, wherein the linguistic structure identifies at least a relationship between the plurality of linguistic features;
- identifying a particular product claim within the electronic text based on comparing the linguistic structure to a claim template;
- generating, based on the particular product claim, one or more product recommendations for a user;
- wherein the method is performed by one or more computing devices.
24. The method of claim 23, wherein the one or more product recommendations include one or more of: a list of suitable products or services for the user, an advertisement to the user identifying one or more products, data specifying one or more sellers which sell the one or more products, a product description, a product review, or a comparison between one or more different products.
25. The method of claim 23, wherein generating the one or more product recommendations is based on input supplied by the user.
26. A computer-readable storage medium storing one or more instructions which, when executed by one or more processors, cause the one or more processors to perform:
- accessing electronic text;
- identifying linguistic content associated with the electronic text, wherein the linguistic content includes a plurality of linguistic features;
- generating a linguistic structure based on the linguistic content identified, wherein the linguistic structure identifies at least a relationship between the plurality of linguistic features;
- identifying a particular product claim within the electronic text based on comparing the linguistic structure to a claim template;
- generating, based on the particular product claim, one or more product recommendations for a user;
- wherein the method is performed by one or more computing devices.
27. The computer-readable medium of claim 26, wherein the one or more product recommendations include one or more of: a list of suitable products or services for the user, an advertisement to the user identifying one or more products, data specifying one or more sellers which sell the one or more products, a product description, a product review, or a comparison between one or more different products.
28. The computer-readable medium of claim 26, wherein generating the one or more product recommendations is based on input supplied by the user.
Type: Application
Filed: Mar 7, 2014
Publication Date: Jul 3, 2014
Applicant: Linguastat, Inc. (San Francisco, CA)
Inventors: JOHN HELLWIG (Pacifica, CA), JOHN M. PIERRE (Pacifica, CA), MARK H. BUTLER (Moraga, CA)
Application Number: 14/201,526
International Classification: G06F 17/24 (20060101);