SYSTEM, METHOD, AND COMPUTER PROGRAM PRODUCT FOR DATA MINING AND AUTOMATICALLY GENERATING HYPOTHESES FROM DATA REPOSITORIES
Various embodiments of the present invention provide systems, methods, and computer programs for generating a hypothesis. Specifically, some method embodiments include steps for accessing a system for extracting relationships and determining a relationship rule defining a relationship among a plurality of phrases and a plurality of concepts stored in the system for extracting relationships. Such embodiments further provide steps for parsing a plurality of documents in a data repository according to the relationship rule and generating a hypothesis comprising a previously unknown combination of phrases and concepts being at least partially determined from the parsed plurality of documents. Various embodiments also provide a step for presenting the hypothesis to a user so as to indicate the previously unknown combination.
This application is a continuation of co-pending International Application No. PCT/US2007/063983, filed Mar. 14, 2007, the contents of which are incorporated by reference in entirety, and which claims priority to U.S. Patent Application Ser. No. 60/782,935, filed Mar. 15, 2006.
FIELD OF THE INVENTIONVarious embodiments of the present invention relate generally to the field of query generation, information retrieval, and data mining with respect to data repositories (such as literature and/or record databases, for example).
BACKGROUNDThe wide volume of scientific literature provides a goldmine for the extraction of useful knowledge and information in support of practical decision-making as well as academic research. However, many of the currently-available search engines querying various data repositories offer very limited searching, indexing and categorizing functionalities that fall short of the capabilities to fully explore and utilize such data resources. As an example, Medical Literature Analysis and Retrieval system Online (“MEDLINE”) (the U.S. National Library of Medicine's (NLM) premier bibliographic database), contains approximately 13 million journal articles in life sciences with citation information of and references to concentration on biomedicine. Each year the exponentially-increasing amount of biomedical literature in the MEDLINE database poses tremendous challenges to the ultimate users of those databases, typically scientific researchers. Currently a small number of academic papers have proposed and discussed the idea of generating hypotheses from biomedical literature in databases like MEDLINE in a systematic way so as to facilitate biomedical researchers' discovery and even possibly suggest potential research directions. However, existing work in this area has focused only on generating one type of hypothesis, namely, “a potential pair wise relation”, which does not fully represent most patterns and rules embedded in the document corpus.
Furthermore, existing querying and/or discovery processes as discussed in these papers are usually conducted in a “retrieval mode” which necessarily implies that users must know what knowledge and information they need so that they can provide at least one concept of their search interest to initiate the discovery process. In many cases, however, users may not know how to express their knowledge and information needs or even may not realize and/or appreciate an existing information need. For instance, a given biomedical researcher may never be independently motivated to research a relation between a certain gene and a certain disease that as a matter of fact may be predicted from existing relationships within several recent publications. In addition, different types of users always have different knowledge and information needs based on their respective backgrounds and/or profiles, even if they issue the same query to the same database. For example, for a query of “Diabetics” to MEDLINE, a biomedical researcher may want to acquire some potential research directions for this disease, a medical practitioner may wish to keep current on state-of-the-art diagnosis progress, and a patient may want to ensure that the treatment plan prescribed by her physician is reasonable in light of current treatment options. In summary, each user brings different levels of expertise and different interests to a given query of a given database. Currently available query systems do not address this issue.
In light of the above, a need exists for an improved method, system and computer program product for automatically generating different types of hypotheses from data repositories. There is a further need for automatic analysis of a user's scope of interest and effective delivery of hypotheses, information and knowledge that match the user's interests and information needs.
BRIEF SUMMARYThe needs outlined above are met by the present invention which, in various embodiments, provides systems and methods that overcome many of the technical problems discussed above, as well other technical problems, with regard to the generation and display of potential hypotheses based on written works selected from a database. Specifically in one embodiment, the invention provides a method and computer program product for generating a hypothesis. In some embodiments, the method and/or computer program product may comprise accessing a system for extracting relationships, wherein the system for extracting relationships comprises a plurality of phrases and a plurality of concepts. In various embodiments, the system for extracting relationships may include, but is not limited to: a vocabulary database corresponding to a selected subject area; a predetermined lexicon; a semantic network; semantic database; a metathesaurus; and combinations of such systems.
The method and/or computer program product also comprises determining a relationship rule defining a relationship among at least a portion of the plurality of phrases and at least a portion of the plurality of concepts. In some embodiments, the determined relationship rule may include, but is not limited to: an assignment of at least one of the plurality of phrases to at least one of the plurality of concepts; an assignment of at least one of the plurality of phrases to a relationship identifier, the relationship identifier linking a first one of the plurality of concepts to a second one of the plurality of concepts; an assignment of at least one of the plurality of concepts to a semantic category; an arrangement of at least a portion of the plurality of concepts in a hierarchical relationship, wherein a first one of the portion of concepts comprises a child concept and a second one of the portion of concepts comprises a parent concept. Some embodiments may further comprise a step for storing the determined relationship rule for later or repeated use in a subsequent parsing step as described further herein.
The method and/or computer program product may also comprise parsing a plurality of documents in a data repository according to the relationship rule, wherein the plurality of documents each comprise at least a portion of one of the plurality of phrases and the plurality of concepts. In various embodiments, the data repository may include, but is not limited to: a biomedical literature database; a medical records database; a chemical literature database; a computer science literature database; a physics literature database; a legal literature database; a psychology literature database; a social science literature database; a news periodical database; a business journal database; and combinations of such data repositories.
The method and/or computer program product embodiments may also comprise steps for generating a hypothesis comprising a previously unknown combination, wherein the previously unknown combination includes one of at least one of the plurality of phrases and at least one of the plurality of concepts. The previously unknown combination may be at least partially determined from the parsed plurality of documents.
In some embodiments, at least a portion of the plurality of documents may comprise at least one of a first concept, a second concept, and a third concept. According to some such embodiments, the parsing step described herein may further comprise: detecting a first relationship between the first and second concepts; detecting a second relationship between the second and third concepts; detecting a third relationship between the first and third concepts; and determining a potential chain relationship among the first second, and third concepts at least partially from the detected first, second, and third relationships. Furthermore, according to some such embodiments, the step for generating the hypothesis may also comprise generating a chain hypothesis comprising the previously unknown combination of the first, second, and third concepts.
In some additional embodiments, at least a portion of the plurality of documents may comprise at least one of a first concept, a second concept, and a plurality of linking concepts. According to some such embodiments, the parsing step may further comprise: detecting a first relationship between the first concept and a first portion of the plurality of linking concepts; detecting a second relationship between the second concept and a second portion of the plurality of linking concepts; and determining a potential substitution relationship between the first concept and the second concept at least partially from the detected first and second relationships and a number of overlapping concepts present in both the first portion and the second portion of the plurality of linking concepts. Furthermore, in some such embodiments, the step for generating the hypothesis may further comprise generating a substitution hypothesis comprising the previously unknown combination of at least one of the first and second concepts with a portion of the plurality of linking concepts not present in the number of overlapping concepts. Furthermore, in some such embodiments, the parsing step may further comprise determining a strength of the potential substitution relationship between the first and second concepts based at least in part on the number of concepts present in both the first portion of the second portion of the plurality of linking concepts.
Furthermore, in some other method and/or computer program embodiments, at least a portion of the plurality of documents comprises at least one of a first concept, a second concept, and a third concept. In some such embodiments, the parsing step may further comprise: detecting a first relationship between the first concept and the second concept; detecting a second relationship between the second concept and the third concept; and determining a potential pairwise relationship between the first concept and the third concept at least partially from the detected first and second relationships. In some such embodiments, the step for generating the hypothesis may further comprise generating a pairwise hypothesis comprising the previously unknown combination of the first and third concepts. Furthermore, in some such embodiments, the parsing step may further comprise assessing a strength of the potential relationship between the first and third concepts at least partially from a known secondary relationship between the first and third concepts. In various embodiments, the known secondary relationship may comprise a common semantic category including both the first and third concepts. Furthermore, in some such embodiments, the relationship rule generated in the determining step may comprise the common semantic category used to assess the strength of the potential pairwise relationship between the first and third concepts.
Various method and/or computer program products may also comprise presenting the hypothesis so as to indicate the previously unknown combination. In some such embodiments, the step for presenting the hypothesis may comprise presenting a display to a user comprising a visual representation of the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts. Furthermore, according to some embodiments, the visual representation presented in the display comprises an interactive icon configured to be selectable by the user. According to such embodiments, the interactive icon may be further configured to modify the display when selected by the user.
Various method and/or computer program product embodiments of the present invention may also comprise various steps for optimizing the generated hypothesis to meet the information needs of a particular user. For example, some embodiments may comprise steps for identifying a portion of the plurality of documents in the data repository associated with the user, and creating a user profile based at least in part on the identified documents. The created user profile may be indicative of a user information need. Some such embodiments may further comprise a step for modifying the hypothesis in response to the user profile such that the modified hypothesis at least partially corresponds to the user information need. In some such embodiments, the created user profile may comprise at least one semantic category and the method and/or computer program product may further comprise a step for filtering the presented hypothesis such that the previously unknown combination includes only at least one phrase and at least one concept corresponding substantially to the at least one semantic category present in the user profile.
Various embodiments of the present invention may also provide systems for mining information from a data repository comprising a plurality of documents to produce a hypothesis. The data repository may include, but is not limited to: a biomedical literature database; a medical records database; a chemical literature database; a computer science literature database; a physics literature database; a legal literature database; a psychology literature database; a social science literature database; a news periodical database; a business journal database; and a combination of such databases.
The system comprises a system for extracting relationships comprising a plurality of phrases and a plurality of concepts. The system for extracting relationships may include, but is not limited to: a vocabulary database corresponding to a selected subject area; a predetermined lexicon; a semantic network; a metathesaurus; and combinations of such system for extracting relationships. Various system embodiments further comprise a host computing element in communication with the system for extracting relationships for accessing the system. The host computing element is configured for determining a relationship rule defining a relationship among at least a portion of the plurality of phrases and at least a portion of the plurality of concepts. The host computing element may be configured for determining a relationship rule that includes, but is not limited to: an assignment of at least one of the plurality of phrases to at least one of the plurality of concepts; an assignment of at least one of the plurality of phrases to a relationship identifier, wherein the relationship identifier links a first one of the plurality of concepts to a second one of the plurality of concepts; an assignment of at least one of the plurality of concepts to a semantic category; an arrangement of at least a portion of the plurality of concepts in a hierarchical relationship, wherein a first one of the portion of concepts comprises a child concept and a second one of the portion of concepts comprises a parent concept; and/or a combination of such relationship rules. Some system embodiments may further comprise a memory device in communication with the host computing element, wherein the memory device is configured for storing the determined relationship rule for later or repeated use in a subsequent parsing step.
Furthermore, the host computing element may also be configured for parsing the plurality of documents in a literature database according to the relationship rule, wherein the plurality of documents each comprises at least a portion of one of the plurality of phrases and the plurality of concepts. Furthermore, the host computing element is configured for generating the hypothesis comprising a previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts. The previously unknown combination generated by the host computing element may be at least partially determined from the parsed plurality of documents. Furthermore, some system embodiments may also comprise a user interface in communication with the host computing element, wherein the user interface is configured for presenting the hypothesis so as to indicate the previously unknown combination. In some system embodiments, the user interface may present the hypothesis as a display to a user comprising a visual representation of the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts. In some such system embodiments, the user interface may present the visual representation as an interactive icon configured to be selectable by the user. The interactive icon may be further configured to modify the display when selected by the user.
In some system embodiments, the host computing element may also be configured for customizing and/or optimizing the presented hypothesis for a particular user. For example, in some embodiments, the host computing element may identify a portion of the plurality of documents in the data repository associated with a user and thereby create a user profile based at least in part on the identified documents. The user profile created by the host computing element in such embodiments may be indicative of a user information need. The created user profile may also comprise at least one semantic category and the host computing element may therefore filter the presented hypothesis such that the previously unknown combination includes only at least one phrase and at least one concept corresponding substantially to the at least one semantic category. Furthermore, in some system embodiments, the host computing element may modify the hypothesis in response to the user profile such that the modified hypothesis at least partially corresponds to the user information need.
Thus the systems, methods, and computer program products for generating and displaying potential hypotheses based on written works selected from a database, as described in the embodiments of the present invention, provide many advantages that may include, but are not limited to: providing a conceptual research system configured for mining raw materials from the large amounts of literature in a given data repository to generate potential hypotheses for future directed research; providing a research system and method capable of uncovering previously unknown and/or unappreciated combinations of concepts and/or phrases in a data repository; providing a conceptual research system capable of defining a user profile that is indicative of a particular user's information needs and modifying a proposed conceptual research hypothesis based at least in part on the defined user profile; and providing a conceptual research concept that is configurable for mining usable data (and generating proposed hypotheses) in a variety of different types of data repositories.
In the description below, reference is made to the accompanying drawings, which are not necessarily drawn to scale, and wherein:
The present inventions will now be described with reference to the accompanying drawings, in which some, but not all embodiments of the inventions are shown. Indeed, these inventions may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like numbers refer to like elements throughout.
As shown in
Many of the exemplary embodiments described herein relate generally to the generation of hypotheses related to biomedical literature and/or research such that the various embodiments described herein may be capable of achieving the technical effect of producing proposed hypotheses that may lead to breakthroughs in the application of certain combinations of drugs to certain diseases or disease states. It should be understood, however, that the various embodiments described herein may be used to parse and/or mine other types of data repositories 20 for potentially groundbreaking research topics. For example, the various embodiments herein may be configured for parsing and/or analyzing documents found in data repositories 20 that may include, but are not limited to: biomedical literature databases; medical records databases; chemical literature databases; computer science literature databases; physics literature databases; legal literature databases; psychology literature databases; social science literature databases; news periodical databases; business journal databases; and combinations of such databases. The term “document” as used herein may include, but is not limited to: published journal articles; text strings (such as, for example, a physician's comments in a medical record entry); file records (such as a particular medical record); resumes and/or curriculum vitae; a thesis; a numerical string of data; a patent document (including, for example, issued patents, patent applications, and publicly-available patent prosecution documents); online journal articles; internet web pages; material safety data sheets; pharmaceutical and/or chemical data sheets; advertisements; reported court case and/or administrative proceedings; news articles; letters; and combinations of such materials.
It should be further understood that the generated hypotheses may be implied by patterns embedded in the document corpus of such data repositories such that appropriate relationship rules (as described further herein) may be determined and subsequently applied to the data repository in a substantially automatic “mining mode” to generate hypotheses that may be completely beyond the expectation of a system user.
In accordance with another embodiment, the present invention analyzes various semantic relations among the concepts involved in the identified hypotheses and provides visualization of these relations in an intuitive way. Particular documents in support of each of these relations may be identified to the system users for their further research. In addition, specific search results can be customized for particular researchers based on their specified or potential interests. In operation, a given researcher's interests are identified by automatically analyzing any prior publications or papers related to this researcher. Furthermore, in some embodiments, search results are verified using an independent resource.
As shown in
Referring to
Referring to
As shown in
The parsing step 120 may comprise performing various quantitative and/or qualitative operations on the component key phrases and/or concepts. As shown generally in
Various method embodiments may further comprise step 130 for generating a hypothesis (that may include, but is not limited to: potential chain hypotheses 132, potential substitution hypotheses 134, and/or potential pairwise hypotheses 136). The hypotheses generated in step 130 may comprise a previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts and may thus be used as the basis for “conceptual research” wherein a researcher is presented with a potential hypothesis that suggests and/or identifies a research topic or direction that has not been addressed in previous research (as documented by the documents in the data repository 20). As described herein, the previously unknown combination (embodied in the generated hypothesis) may be at least partially determined from the parsed (see step 120, for example) plurality of documents present in the data repository 20.
In some embodiments, a “chain” relationship may be established in the parsing step 120 among three or more previously unrelated phrases and/or concepts. For example, at least a portion of the plurality of documents present in the data repository 20 may comprise at least one of a first concept, a second concept, and a third concept. In some such embodiments, the parsing step 120 (utilizing one or more previously-identified and/or stored relationship rules (see elements 111, 112, 113, 114, for example)) may comprise: (1) detecting a first relationship between the first and second concepts; (2) detecting a second relationship between the second and third concepts; (3) detecting a third relationship between the first and third concepts; and (4) determining a potential chain relationship 132 among the first second, and third concepts at least partially from the detected first, second, and third relationships. In such embodiments, the generating step 130 may further comprise generating a chain hypothesis 132 comprising the previously unknown combination of the first, second, and third concepts in a “chain” combination.
For example, such a “chain” relationship may be established among three medical concepts (such as three therapeutic compounds (A, B, C) belonging to the same general class of drugs (as indicated, for example, by a relationship rule comprising an assignment of at least one of the plurality of concepts to a semantic category outlining the class of drug (see relationship rule 111, in
According to some embodiments, a “substitution” relationship may be established in the parsing step 120 among two or more previously unrelated phrases and/or concepts. For example, at least a portion of the plurality of documents present in the data repository 20 may comprise at least one of a first concept, a second concept, and a plurality of linking concepts. In some such embodiments, the parsing step 120 (utilizing one or more previously-identified relationship rules (see elements 111, 112, 113, 114, for example)) may comprise: (1) detecting a first relationship between the first concept and a first portion of the plurality of linking concepts; (2) detecting a second relationship between the second concept and the second portion of the plurality of linking concepts; and (3) determining a potential substitution relationship 134 between the first concept and the second concept at least partially from the detected first and second relationships and a number of overlapping concepts present in both the first portion and the second portion of the plurality of linking concepts. In such embodiments, the generating step 130 may further comprise generating a substitution hypothesis 134 comprising the previously unknown combination of at least one of the first and second concepts with a portion of the plurality of linking concepts not present in the number of overlapping concepts. In some such method embodiments, the parsing step 120 may further comprise determining a strength of the potential relationship between the first and second concepts in the proposed substitution hypothesis 134 based at least in part on the number of concepts present in both the first portion of the second portion of the plurality of linking concepts.
For example, such a substitution relationship may be established among: (1) a pair of medical concepts (such as two therapeutic compounds (A and B); (2) a list of component compounds present in both therapeutic compounds A and B (X1, X2, X3, . . . , Xm); and (3) a disease or condition (Y) that is reported as responding positively to treatment with therapeutic compound A. The parsing step 120 may first comprise applying a relationship rule comprising an assignment of therapeutic compounds A and B to a semantic category outlining the common class of drug (see relationship rule 111, in
According to some other embodiments, a “pairwise” relationship may be established by the parsing step 120 among two or more previously unrelated phrases and/or concepts. For example, at least a portion of the plurality of documents present in the data repository 20 may comprise at least one of a first concept, a second concept, and a third concept. In some such embodiments, the parsing step 120 (utilizing one or more previously-identified relationship rules (see elements 111, 112, 113, 114, for example)) may comprise: (1) detecting a first relationship between the first concept and the second concept; (2) detecting a second relationship between the second concept and the third concept; and (3) determining a potential pairwise relationship between the first concept and the third concept at least partially from the detected first and second relationships. In such embodiments, the generating step 130 may further comprise generating a pairwise hypothesis 136 comprising the previously unknown combination of the first and third concepts.
According to some such “pairwise” hypothesis embodiments, the parsing step 120 may further comprises assessing a strength of the potential relationship between the first and third concepts at least partially from a known secondary relationship between the first and third concepts. For example, in some embodiments, the known secondary relationship may comprise a common semantic category including both the first and third concepts (as indicated, for example, by a concept-semantic type relationship rule 111 (and/or another relationship rule type), that may be a product of the determining step 110 (as shown generally in
For example, a pairwise hypothesis 136 may be generated in step 130 by first determining that a strong relationship exists between concept A (i.e. a first therapeutic compound A) and concept X (a disease or condition X). This may be accomplished, for example, using the concept-concept relationship output 122 of an initial parsing step 120. In order to complete the generation of a potential pairwise hypothesis 136, step 130 may further comprise determining the existence of a strong relationship (via the concept-concept relationship output 122, for example) between the disease or condition X and the therapeutic compound B. According to some such embodiments, the generating step 130 may further comprise detecting an interesting secondary relationship between concepts A and B (i.e. detecting if therapeutic compounds A and B are in the same or similar semantic category (see element 111,
Referring again to
As shown generally in
Referring to
Similarly,
As shown in
For example, and referring generally to
In some embodiments (as shown, for example in
Thus, various system, method, and/or computer program products of the present invention may tailor the “conceptual research” results returned, for example, as part of the presented hypotheses 132, 134, 136 to meet a user information need that may be ascertained by analyzing a user's publications and/or previous search patterns. As described further herein, various system embodiments of the present invention may comprise a host computing element 700 including one or more memory devices 722, 724, 728 configured for storing a user profile 155 such that each user (identified, for example, by a unique user ID and/or password) may log on to a host computing element 700 so as to utilize the conceptual research capabilities of the various embodiments described more fully herein.
Some method embodiments may further comprise a step for verifying one or more generated hypotheses 132, 134, 136 using at least one independent resource. For example, in some method embodiments, the generated hypotheses 132, 134, 136 may be verified using an independent resource 800 as shown generally in
Some embodiments of the present invention further provide a system for mining information from a data repository 20 comprising a plurality of documents to produce a hypothesis (see elements 132, 134, 136 of
Various system embodiments may further comprise a host computing element 700 (see
In some embodiments, as shown generally in
Referring again to
Furthermore, in some system embodiments, the host computing element 700 may be further configured for performing step 130 for generating one or more potential hypotheses 132, 134, 136 comprising a previously unknown combination of at least one of the plurality of phrases and at least one of the plurality of concepts. As described herein, the previously unknown combination may be at least partially determined from the parsed plurality of documents present in the data repository 20.
As shown generally in
Referring to
As shown in
Referring now to
A communication device 726 may also be coupled to and/or in communication with the bus 716 for accessing remote computers or servers via the Internet or other network. Such remote computers and/or servers may house, for example, one or more system for extracting relationships 10 and/or data repositories 20. The communication device 726 may include, but is not limited to: a modem; a network interface card; and/or other interface devices, such as those used for interfacing with Ethernet, Token-ring, or other types of networks. In any event, in this manner, the host computing element 700 may be coupled to and/or in communication with a number of servers via a network infrastructure. The communication device 726 may enable one or more users to selectively access the host computing element 700 so as to take advantage of the relationship rules 111, 112, 113, 114 and/or generated hypotheses 132, 134, 136 that may be generated according to the various embodiments of the present invention.
In addition to providing systems and methods, the present invention also provides computer program products for performing the operations described above. The computer program products have a computer readable storage medium having computer readable program code embodied in the medium. With reference to
In this regard,
Accordingly, blocks or steps of the block diagram, flowchart or control flow illustrations support combinations of steps for performing the specified functions, and program instructions for performing the specified functions. It will also be understood that each block or step of the block diagram, flowchart or control flow illustrations, and combinations of blocks or steps in the block diagram, flowchart or control flow illustrations, can be implemented by special purpose hardware-based computer systems which perform the specified functions or steps, or combinations of special purpose hardware and computer instructions.
Many modifications and other embodiments of the invention will come to mind to one skilled in the art to which this invention pertains having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended exemplary inventive concepts. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
Claims
1. A method for generating a hypothesis, the method comprising:
- accessing a system for extracting relationships, the system for extracting relationships comprising a plurality of phrases and a plurality of concepts;
- determining a relationship rule defining a relationship among at least a portion of the plurality of phrases and at least a portion of the plurality of concepts;
- parsing a plurality of documents in a data repository according to the relationship rule, the plurality of documents each comprising at least a portion of one of the plurality of phrases and the plurality of concepts;
- generating a hypothesis comprising a previously unknown combination, the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts, the previously unknown combination being at least partially determined from the parsed plurality of documents; and
- presenting the hypothesis so as to indicate the previously unknown combination.
2. A method according to claim 1, wherein the relationship rule is selected from the group consisting of:
- an assignment of at least one of the plurality of phrases to at least one of the plurality of concepts;
- an assignment of at least one of the plurality of phrases to a relationship identifier, the relationship identifier linking a first one of the plurality of concepts to a second one of the plurality of concepts;
- an assignment of at least one of the plurality of concepts to a semantic category;
- an arrangement of at least a portion of the plurality of concepts in a hierarchical relationship, wherein a first one of the portion of concepts comprises a child concept and a second one of the portion of concepts comprises a parent concept; and
- combinations thereof.
3. A method according to claim 1, wherein at least a portion of the plurality of documents comprises at least one of a first concept, a second concept, and a third concept, and
- wherein the parsing step further comprises: detecting a first relationship between the first and second concepts; detecting a second relationship between the second and third concepts; detecting a third relationship between the first and third concepts; and determining a potential chain relationship among the first second, and third concepts at least partially from the detected first, second, and third relationships; and
- wherein the generating step further comprises generating a chain hypothesis comprising the previously unknown combination of the first, second, and third concepts.
4. A method according to claim 1, wherein at least a portion of the plurality of documents comprises at least one of a first concept, a second concept, and a plurality of linking concepts, and
- wherein the parsing step further comprises: detecting a first relationship between the first concept and a first portion of the plurality of linking concepts; detecting a second relationship between the second concept and a second portion of the plurality of linking concepts; and determining a potential substitution relationship between the first concept and the second concept at least partially from the detected first and second relationships and a number of overlapping concepts present in both the first portion and the second portion of the plurality of linking concepts; and
- wherein the generating step further comprises generating a substitution hypothesis comprising the previously unknown combination of at least one of the first and second concepts with a portion of the plurality of linking concepts not present in the number of overlapping concepts.
5. A method according to claim 4, wherein the parsing step further comprises determining a strength of the potential substitution relationship between the first and second concepts based at least in part on the number of overlapping concepts present in both the first portion of the second portion of the plurality of linking concepts.
6. A method according to claim 1, wherein at least a portion of the plurality of documents comprises at least one of a first concept, a second concept, and a third concept, and
- wherein the parsing step further comprises: detecting a first relationship between the first concept and the second concept; detecting a second relationship between the second concept and the third concept; and determining a potential pairwise relationship between the first concept and the third concept at least partially from the detected first and second relationships; and
- wherein the generating step further comprises generating a pairwise hypothesis comprising the previously unknown combination of the first and third concepts.
7. A method according to claim 6, wherein the parsing step further comprises assessing a strength of the potential relationship between the first and third concepts at least partially from a known secondary relationship between the first and third concepts.
8. A method according to claim 7, wherein the known secondary relationship comprises a common semantic category including both the first and third concepts.
9. A method according to claim 8, wherein the relationship rule comprises the common semantic category.
10. A method according to claim 1, further comprising:
- identifying a portion of the plurality of documents in the data repository associated with a user;
- creating a user profile based at least in part on the identified documents, the user profile being indicative of a user information need; and
- modifying the hypothesis in response to the user profile such that the modified hypothesis at least partially corresponds to the user information need.
11. A method according to claim 10, wherein the user profile comprises at least one semantic category and wherein the method further comprises filtering the presented hypothesis such that the previously unknown combination includes only at least one phrase and at least one concept corresponding substantially to the at least one semantic category.
12. A method according to claim 1, wherein presenting the hypothesis comprises presenting a display to a user comprising a visual representation of the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts.
13. A method according to claim 12, wherein the visual representation comprises an interactive icon configured to be selectable by the user, the interactive icon being further configured to modify the display when selected by the user.
14. A method according to claim 1, wherein the system for extracting relationships is selected from the group consisting of:
- a vocabulary database corresponding to a selected subject area;
- a predetermined lexicon;
- a semantic network;
- a metathesaurus; and
- combinations thereof.
15. A method according to claim 1, wherein the data repository is selected from the group consisting of:
- a biomedical literature database;
- a medical records database;
- a chemical literature database;
- a computer science literature database;
- a physics literature database;
- a legal literature database;
- a psychology literature database;
- a social science literature database;
- a news periodical database;
- a business journal database; and
- combinations thereof.
16. A method according to claim 1, further comprising storing the determined relationship rule for later or repeated use in the subsequent parsing step.
17. A method according to claim 1, further comprising verifying the hypothesis using at least one independent resource.
18. A computer program product for generating a hypothesis based on a plurality of documents in a data repository in a manner that reduces the burden on the data repository, said computer program product comprising a computer-readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising:
- a first set of computer instructions for accessing a system for extracting relationships, the system for extracting relationships comprising a plurality of phrases and a plurality of concepts;
- a second set of computer instructions for determining a relationship rule defining a relationship among at least a portion of the plurality of phrases and at least a portion of the plurality of concepts;
- a third set of computer instructions for parsing the plurality of documents in the data repository according to the relationship rule, the plurality of documents each comprising at least a portion of one of the plurality of phrases and the plurality of concepts;
- a fourth set of computer instructions for generating a hypothesis comprising a previously unknown combination, the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts, the previously unknown combination being at least partially determined from the parsed plurality of documents; and
- a fifth set of computer instructions for presenting the hypothesis so as to indicate the previously unknown combination.
19. A computer program product according to claim 18, wherein the relationship rule is selected from the group consisting of:
- an assignment of at least one of the plurality of phrases to at least one of the plurality of concepts;
- an assignment of at least one of the plurality of phrases to a relationship identifier, the relationship identifier linking a first one of the plurality of concepts to a second one of the plurality of concepts;
- an assignment of at least one of the plurality of concepts to a semantic category;
- an arrangement of at least a portion of the plurality of concepts in a hierarchical relationship, wherein a first one of the portion of concepts comprises a child concept and a second one of the portion of concepts comprises a parent concept; and
- combinations thereof.
20. A computer program product according to claim 18, wherein at least a portion of the plurality of documents comprises at least one of a first concept, a second concept, and a third concept, and
- wherein the third set of computer instructions for parsing further comprises: a sixth set of computer instructions for detecting a first relationship between the first and second concepts;
- a seventh set of computer instructions for detecting a second relationship between the second and third concepts; an eighth set of computer instructions for detecting a third relationship between the first and third concepts; and a ninth set of computer instructions for determining a potential chain relationship among the first second, and third concepts at least partially from the detected first, second, and third relationships; and
- wherein the fourth set of computer instructions for generating further comprises a tenth set of computer instructions for generating a chain hypothesis comprising the previously unknown combination of the first, second, and third concepts.
21. A computer program product according to claim 18, wherein at least a portion of the plurality of documents comprises at least one of a first concept, a second concept, and a plurality of linking concepts, and
- wherein the third set of computer instructions for parsing further comprises: an eleventh set of computer instructions for detecting a first relationship between the first concept and a first portion of the plurality of linking concepts; a twelfth set of computer instructions for detecting a second relationship between the second concept and a second portion of the plurality of linking concepts; and a thirteenth set of computer instructions for determining a potential substitution relationship between the first concept and the second concept at least partially from the detected first and second relationships and a number of overlapping concepts present in both the first portion and the second portion of the plurality of linking concepts; and
- wherein the fourth set of computer instructions for generating further comprises a fourteenth set of computer instructions for generating a substitution hypothesis comprising the previously unknown combination of at least one of the first and second concepts with a portion of the plurality of linking concepts not present in the number of overlapping concepts.
22. A computer program product according to claim 21, wherein the third set of computer instructions for parsing further comprises a fifteenth set of computer instructions for determining a strength of the potential substitution relationship between the first and second concepts based at least in part on the number of overlapping concepts present in both the first portion of the second portion of the plurality of linking concepts.
23. A computer program product according to claim 18, wherein at least a portion of the plurality of documents comprises at least one of a first concept, a second concept, and a third concept, and
- wherein the third set of computer instructions for parsing further comprises: a sixteenth set of computer instructions for detecting a first relationship between the first concept and the second concept; a seventeenth set of computer instructions for detecting a second relationship between the second concept and the third concept; and an eighteenth set of computer instructions for determining a potential pairwise relationship between the first concept and the third concept at least partially from the detected first and second relationships; and
- wherein the fourth set of computer instructions for generating further comprises a nineteenth set of computer instructions for generating a pairwise hypothesis comprising the previously unknown combination of the first and third concepts.
24. A computer program product according to claim 23, wherein the third set of computer instructions for parsing further comprises a twentieth set of computer instructions for assessing a strength of the potential relationship between the first and third concepts at least partially from a known secondary relationship between the first and third concepts.
25. A computer program product according to claim 24, wherein the known secondary relationship comprises a common semantic category including both the first and third concepts.
26. A computer program product according to claim 25, wherein the relationship rule comprises the common semantic category.
27. A computer program product according to claim 18, further comprising:
- a twenty-first set of computer instructions for identifying a portion of the plurality of documents in the data repository associated with a user;
- a twenty-second set of computer instructions for creating a user profile based at least in part on the identified documents, the user profile being indicative of a user information need; and
- a twenty-third set of computer instructions for modifying the hypothesis in response to the user profile such that the modified hypothesis at least partially corresponds to the user information need.
28. A computer program product according to claim 27, wherein the user profile comprises at least one semantic category, the computer program product further comprising a twenty-fourth set of computer instructions for filtering the presented hypothesis such that the previously unknown combination includes only at least one phrase and at least one concept corresponding substantially to the at least one semantic category.
29. A computer program product according to claim 18, wherein fifth set of computer instructions for presenting the hypothesis comprises a twenty-fifth set of computer instructions for presenting a display to a user comprising a visual representation of the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts.
30. A computer program product according to claim 29, wherein the visual representation comprises an interactive icon configured to be selectable by the user, the interactive icon being further configured to modify the display when selected by the user.
31. A computer program product according to claim 18, wherein the system for extracting relationships is selected from the group consisting of:
- a vocabulary database corresponding to a selected subject area;
- a predetermined lexicon;
- a semantic network;
- a semantic database;
- a metathesaurus; and
- combinations thereof.
32. A computer program product according to claim 18, wherein the data repository is selected from the group consisting of:
- a biomedical literature database;
- a medical records database;
- a chemical literature database;
- a computer science literature database;
- a physics literature database;
- a legal literature database;
- a psychology literature database;
- a social science literature database;
- a news periodical database;
- a business journal database; and
- combinations thereof.
33. A computer program product according to claim 18, further comprising a twenty-sixth set of computer instructions for storing the determined relationship rule for later or repeated use in the subsequent parsing step.
34. A computer program product according to claim 18, further comprising a twenty-seventh set of computer instructions for verifying the hypothesis using at least one independent resource.
35. A system for mining information from a data repository comprising a plurality of documents to produce a hypothesis, the system comprising:
- a system for extracting relationships comprising a plurality of phrases and a plurality of concepts;
- a host computing element in communication with said system for extracting relationships for accessing said system for extracting relationships; wherein said host computing element determines a relationship rule defining a relationship among at least a portion of the plurality of phrases and at least a portion of the plurality of concepts; wherein said host computing element parses the plurality of documents in a data repository according to the relationship rule, the plurality of documents each comprising at least a portion of one of the plurality of phrases and the plurality of concepts; and wherein said host computing element generates the hypothesis comprising a previously unknown combination, the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts, the previously unknown combination being at least partially determined from the parsed plurality of documents; and
- a user interface in communication with said host computing element, said user interface configured for presenting the hypothesis so as to indicate the previously unknown combination.
36. A system according to claim 35, wherein said host computing element determines a relationship rule selected from the group consisting of:
- an assignment of at least one of the plurality of phrases to at least one of the plurality of concepts;
- an assignment of at least one of the plurality of phrases to a relationship identifier, the relationship identifier linking a first one of the plurality of concepts to a second one of the plurality of concepts;
- an assignment of at least one of the plurality of concepts to a semantic category;
- an arrangement of at least a portion of the plurality of concepts in a hierarchical relationship, wherein a first one of the portion of concepts comprises a child concept and a second one of the portion of concepts comprises a parent concept; and
- combinations thereof.
37. A system according to claim 35,
- wherein said host computing element identifies a portion of the plurality of documents in the data repository associated with a user;
- wherein said host computing element creates a user profile based at least in part on the identified documents, the user profile being indicative of a user information need; and
- wherein said host computing element modifies the hypothesis in response to the user profile such that the modified hypothesis at least partially corresponds to the user information need.
38. A system according to claim 37, wherein the user profile comprises at least one semantic category and wherein said host computing element filters the presented hypothesis such that the previously unknown combination includes only at least one phrase and at least one concept corresponding substantially to the at least one semantic category.
39. A system according to claim 35, wherein said user interface presents the hypothesis as a display to a user comprising a visual representation of the previously unknown combination including one of at least one of the plurality of phrases and at least one of the plurality of concepts.
40. A system according to claim 39, wherein said user interface presents the visual representation comprising an interactive icon configured to be selectable by the user, the interactive icon being further configured to modify the display when selected by the user.
41. A system according to claim 35, wherein said system for extracting relationships is selected from the group consisting of:
- a vocabulary database corresponding to a selected subject area;
- a predetermined lexicon;
- a semantic network;
- a semantic database;
- a metathesaurus; and
- combinations thereof.
42. A system according to claim 35, wherein said host computing element is in communication with a data repository selected from the group consisting of:
- a biomedical literature database;
- a medical records database;
- a chemical literature database;
- a computer science literature database;
- a physics literature database;
- a legal literature database;
- a psychology literature database;
- a social science literature database;
- a news periodical database;
- a business journal database; and
- combinations thereof.
43. A system according to claim 35, further comprising a memory device in communication with said host computing element, said memory device configured for storing the determined relationship rule for later or repeated use in the subsequent parsing step.
44. A system according to claim 35, further comprising an independent resource in communication with said host computing device, said independent resource configured for verifying the generated hypothesis.
Type: Application
Filed: Sep 15, 2008
Publication Date: Mar 26, 2009
Inventors: Vijay V. Raghavan (Lafayette, LA), Ying Xie (Kennesaw, GA), Anthony Prestigiacomo (Baton Rouge, LA)
Application Number: 12/210,253
International Classification: G06N 5/02 (20060101);