EFFICIENT PASSAGE RETRIEVAL USING DOCUMENT METADATA
A system, method and computer program product for efficiently retrieving relevant passages to questions based on a corpus of data. A processor device receives an input query and performs a query analysis to obtain searchable query terms. The processor performs: matching metadata associated with one or more documents against the query terms. The document metadata includes one or more of: a title of the documents, one or more user tags or clouds. Then the processor device performs: mapping matched document metadata to corresponding one or more documents; identifying corresponding matched documents to form a subcorpus of documents; and conducting a search in the data subcorpus using the searchable query terms to obtain one or more passages relevant input query from the identified documents.
Latest IBM Patents:
The present invention relates to and claims the benefit of the filing date of commonly-owned, co-pending U.S. Provisional Patent Application No. 61/386,019, filed Sep. 24, 2010, the entire contents and disclosure of which is incorporated by reference as if fully set forth herein.
BACKGROUNDThe invention relates generally to information retrieval systems, and more particularly, the invention relates to an automated query/answer system and method implementing a passage retrieval component to conduct a search that identifies passages relevant to a given question using document metadata from a collection including text-based resources.
DESCRIPTION OF THE RELATED ARTAn introduction to the current issues and approaches of question answering (QA) can be found in the web-based reference http://en.wikipedia.org/wiki/Question_answering. Generally, QA is a type of information retrieval. Given a collection of documents (such as the World Wide Web or a local collection) the system should be able to retrieve answers to questions posed in natural language. QA is regarded as requiring more complex natural language processing (NLP) techniques than other types of information retrieval such as document retrieval, and it is sometimes regarded as the next step beyond search engines.
QA research attempts to deal with a wide range of question types including: fact, list, definition, How, Why, hypothetical, semantically-constrained, and cross-lingual questions. Search collections vary from small local document collections, to internal organization documents, to compiled newswire reports, to the World Wide Web.
Closed-domain QA deals with questions under a specific domain, for example medicine or automotive maintenance, and can be seen as an easier task because NLP systems can exploit domain-specific knowledge frequently formalized in ontologies. Open-domain QA deals with questions about nearly everything, and can only rely on general ontologies and world knowledge. On the other hand, these systems usually have much more data available from which to extract the answer.
Alternatively, closed-domain QA might refer to a situation where only a limited type of questions are accepted, such as questions asking for descriptive rather than procedural information.
Access to information is currently dominated by two paradigms. First, a database query that answers questions about what is in a collection of structured records. Second, a search that delivers a collection of document links in response to a query against a collection of unstructured data, for example, text or html.
A major unsolved problem in such information query paradigms is the lack of a computer program capable of accurately answering factual questions based on information included in a collection of documents that can be either structured, unstructured, or both. Such factual questions can be either broad, such as “what are the risks of vitamin K deficiency?”, or narrow, such as “when and where was Hillary Clinton's father born?”
It is a challenge to understand the query, to find appropriate documents that might contain the answer, and to extract the correct answer to be delivered to the user. There is a need to further advance the methodologies for answering open-domain questions.
SUMMARYIn one aspect there is provided a computing infrastructure and methodology that conducts question and answering and performs automatic passage retrieval operations in a highly efficient manner.
In one aspect, there is provided a computer-implemented method for efficiently retrieving relevant passages to questions based on a corpus of data comprising: receiving an input query; performing a query context analysis upon the input query to obtain searchable query terms; matching metadata associated with one or more documents against the query terms; mapping matched document metadata to corresponding one or more documents; identifying corresponding matched documents to form a subcorpus of documents; and conducting a search in the data subcorpus using the searchable query terms to obtain one or more passages relevant to the input query from the identified documents, wherein one or more processor devices performs one or more the retrieving, performing, matching, mapping, identifying and conducting.
In this aspect, the document metadata includes one or more of: a title of the documents, one or more user tags, one or more automatically identified document labels.
Further to this aspect, prior to matching of metadata associated with one or more documents against the query terms there is performed: extracting document metadata from one or more documents of a corpus of documents; providing the extracted document metadata as a dictionary in a storage device, each document metadata stored in the dictionary being associated with a corresponding document identification (ID), wherein the matching of metadata against the query terms comprises: performing, by the processor device, a dictionary matching.
In an alternate embodiment, there is provided a computer-implemented method for efficiently retrieving relevant passages to questions based on a corpus of data comprising: receiving, at a processor device, an input query; performing, at the processor device, a query context analysis upon the input query to obtain searchable query terms; accessing a dictionary of document metadata obtained from one or more documents of the data corpus, each stored document metadata being associated with a corresponding document identification (ID); performing, by the processor device, a dictionary matching of the metadata associated with one or more documents against the query terms; mapping matched document metadata to corresponding one or more document IDs; identifying corresponding matched documents to form a subcorpus of documents; and conducting a search in the subcorpus using the searchable query terms to obtain passages relevant to the input query from the identified documents.
A computer program product is provided for performing operations. The computer program product includes a storage medium readable by a processing circuit and storing instructions run by the processing circuit for running a method(s). The method(s) are the same as listed above.
The objects, features and advantages of the invention are understood within the context of the Detailed Description, as set forth below. The Detailed Description is understood within the context of the accompanying drawings, which form a material part of this disclosure, wherein:
In current questions and answer systems, one key component is the passage retrieval operations conducted when searching for candidate answers in heterogeneous collection of structured, semi-structured and unstructured information resources. Passage retrieval operations adapt a search engine at its core to identify passages relevant to a given question from the collection of sources, e.g., text-based sources. Passage retrieval is also relevant to any search application where selecting passages containing, for example, 1-3 sentences is more appropriate than retrieving entire documents either for processing by downstream components, or for presentation to the end user.
Most existing systems performing a passage retrieval operation adopts one of two approaches. The first approach is to adopt a document search engine to retrieve a list of relevant documents using the search engine's internal document ranking criteria, and to apply a custom post-hoc passage scoring algorithm to identify the most relevant text segments from these documents. The second approach is to adopt a search engine with passage retrieval capability and to make use of the engine's internal ranking algorithm to return a set of relevant passages. In either approach, the retrieval process is performed over the entire collection, which typically contains millions of documents or more. This poses an efficiency issue for real-time question answering systems that must deliver answers to users in no more than a few seconds. A typical solution for this problem is to split the search index into multiple subindices on multiple machines so that retrieval against the subindices can be performed in parallel and their result merged. While this solution addresses the efficiency issue, it poses other problems related to merging search results from multiple indices.
It would be highly desirable to provide a system and method that improves the efficiency of passage retrieval based on dynamic subcorpus selection to constrain the number of relevant documents considered in the retrieval process.
In one embodiment, the present system and method for efficient passage retrieval against a corpus given a question is applicable and may be part of a Question Answering (QA) system. Alternatively, the system and method for efficient passage retrieval against a corpus given a question may be implemented in non-QA applications, i.e., applications implemented to return a passage, for example, a 1-sentence to 3-sentence passage most relevant to a question, as opposed to an answer per se.
Commonly-owned, co-pending U.S. patent application Ser. No. 12/126,642, titled “SYSTEM AND METHOD FOR PROVIDING QUESTION AND ANSWERS WITH DEFERRED TYPE EVALUATION” and co-pending U.S. patent application Ser. No. 12/152411, titled “SYSTEM AND METHOD FOR PROVIDING ANSWERS TO QUESTIONS” are both incorporated by reference herein, and describe a QA (Question and Answer) system and method in which the present passage retrieval system may be incorporated.
In one embodiment, the present disclosure may extend and complement the effectiveness of a QA or non-QA system and method by improving the efficiency of passage retrieval operations based on dynamic subcorpus selection to constrain the number of relevant documents considered in the retrieval process.
In one embodiment, the subcorpus selection process is based on a matching algorithm that identifies relevant documents based on the question text and metadata associated with the documents in the collection, such as document titles, user tags (“clouds”), or automatically identified document labels. The passage retrieval process is then restricted to return passages only from this subcorpus, which typically contains several orders of magnitude fewer documents than the entire collection.
The approach to efficient passage retrieval significantly constrains the pool of documents from which passages may be retrieved based on metadata associated with documents, such as document titles and user tags (“clouds”). The efficiency of passage retrieval is improved by providing the ability to dynamically select a subcorpus from which search will take place based on terms in the user question and metadata associated with documents in the corpus. More specifically, the user's input question string is analyzed to extract all matches between question terms and document metadata. Those matched documents comprise a subcorpus from which the system will extract passages for this question.
In a non-limiting example, there is considered the following user question
- “which modem artist was Francoise Gilot, Dr. Jonas Salk's wife, once the companion of?”
In the example, matching the instances of document titles to the terms in the question, yields five entities: “modern”, “artist”, “Francoise Gilot”, “Jonas Salk”, and “companion” are identified as document titles in the corpus. It is understood that a term may map to multiple documents with that title. For example, “companion” may map to an article that talks about a caregiver, or an architectural feature of ships, or a character in “Doctor Who”. Using the document identifications (IDs) that corresponds to each document title, the documents with the identified document IDs are selected to form a subcorpus consisting of potentially highly relevant documents for answering the given question. The passage retrieval process is then constrained to finding the most relevant passages from this document subcorpus which may contain on the order of tens of documents, instead of from the entire collection which many contain millions of documents or more. In this example, several relevant passages, such as “Francoise Gilot (born 1921) is a French born painter and is known as a companion of Picasso between 1944 and 1953” from the document titled “Francoise Gilot”, and “In 1968, they divorced, and in 1970 Salk married Francoise Gilot, the former mistress of Pablo Picasso” from the document titled “Jonas Salk”.
More particularly, as shown in
For example, a document containing George W. Bush's 2007 State of the Union address may include the following metadata:
Title: 2007 State of the Union Address
Category: Presidential Addresses, George W. Bush Speeches, . . .
Tags: Security, Iraq, Terrorists, Health, America, . . .
A sample implementation of this matcher component 80 is to represent the metadata in dictionary form and to leverage a dictionary matcher to identify dictionary terms that appear in an input question. For example, any matching component can be used to identify closed or open domain dictionary terms in text (e.g., legal terms, medical terms, or generic named entities) may be used. Thus, given a piece of text (an input query), the matching algorithm determines from the question text those terms that match entries in the dictionary. In one embodiment, a dictionary matcher includes the open source ConceptMapper annotator available at http://uima.apache.org/sandbox.html#concept.mapper.annotator, whose functionality is incorporated by reference as if fully set forth herein.
The matched dictionary entries (question terms) are used to identify a subset of documents for the passage retrieval process. That is, for the query terms that are mapped to the metadata (titles, tags, clouds) of a document in the resource 84, that document's index (or other document identifier) is flagged, tagged, or recorded for its inclusion in a subcorpus. In one embodiment, each dictionary entry in resource 84 encodes the document ID for each document that contains metadata matching that dictionary term. The metadata and associated document information in the dictionary entry that match the terms in the input question is represented as 85 in
The passage retrieval component can be any standard IR (Information Retrieval) search engine 90 that supports both of: Retrieval of relevant short passages, instead of full documents; and Runtime specification of a relevant subcorpus for retrieval. One example IR search engine that satisfies this requirement is the Indri engine from the Lemur Toolkit such as the search engine with passage retrieval capability, such as Indri, http://www.lemurproject.org/indri/, incorporated by reference as if fully set forth herein.
In further view of
A passage retrieval method 100 employed by the passage retrieval components 75 for improving the efficiency of passage retrieval is described with respect to
That is, in one embodiment, the semi-structured source of information may be formed via off-line processes that extract document metadata from one or more documents of a large corpus of documents. The extracted document metadata is stored as a dictionary in the memory storage device, with each document metadata stored in the dictionary having one or more associated document identifications (IDs) that represent those documents matching the metadata in that dictionary entry.
Then, at 110, the programmed processor device performs invoking a matching component to match a document metadata against the query terms. As mentioned, a dictionary matcher may be invoked that includes the open source ConceptMapper annotator available at http://uima.apache.org/sandbox.html#concept.mapper.annotator.
Continuing to 115, there is next performed mapping of the matched document metadata to corresponding one or more document IDs. Then at 120, from the corresponding IDs, there is performed identifying the corresponding matched documents.
In one embodiment, for the matched document metadata found in the dictionary, the corresponding documents indicated by the mapped document IDs are identified, e.g., flagged, tagged or recorded in the corpus in which the actual documents are electronically stored with their ID. Thus, in one embodiment, the identified corresponding matched documents form the subcorpus 92 of documents including only the identified matched metadata documents of the larger corpus of documents. This step invokes corpus construction functionality to identify the subset of flagged, tagged or otherwise identified matched metadata documents obtained from the first corpus 84 (
In an alternate embodiment, there may be further performed at 125, extracting the identified corresponding matched documents are found in step 120 as the subcorpus 92.
Then, at 130, the method performs passage retrieval operations against those identified matched metadata documents obtained from the subcorpus 92 formed at step 120 or 125.
Finally, assuming a search engine has internal document ranking ability, then at 135, there is returned the resulting list of ranked passages at 125.
In one embodiment, the passage retrieval process 100,
As mentioned,
Generally, as shown in
The Candidate Answer generation module 30 of architecture 10 generates a plurality of output data structures containing candidate answers based upon the analysis of retrieved data. In
As depicted in
An Answer Ranking module 60 may be invoked to provide functionality for ranking candidate answers and determining a response 99 returned to a user via a user's computer display interface (not shown) or a computer system 22, where the response may be an answer, or an elaboration of a prior answer or request for clarification in response to a question—when a high quality answer to the question is not found. A machine learning implementation is further provided where the “answer ranking” module 60 includes a trained model component (not shown) produced using a machine learning techniques from prior data.
The processing depicted in
In one embodiment, when employed in a QA system, the system and method of
In one embodiment, UIMA may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The UIMA system, method and computer program may be used to generate answers to input queries. The method includes inputting a document and operating at least one text analysis engine that comprises a plurality of coupled annotators for tokenizing document data and for identifying and annotating a particular type of semantic content. Thus it can be used to analyze a question and to extract entities as possible answers to a question from a collection of documents.
In an alternative environment, modules of
In describing the GATE processing model any resource whose primary characteristics are algorithmic, such as parsers, generators and so on, is modeled as a Processing Resource. A PR is a Resource that implements the Java Runnable interface. The GATE Visualisation Model implements resources whose task is to display and edit other resources are modeled as Visual Resources. The Corpus Model in GATE is a Java Set whose members are documents. Both Corpora and Documents are types of Language Resources(LR) with all LRs having a Feature Map (a Java Map) associated with them that stored attribute/value information about the resource. FeatureMaps are also used to associate arbitrary information with ranges of documents (e.g. pieces of text) via an annotation model. Documents have a DocumentContent which is a text at present (future versions may add support for audiovisual content) and one or more AnnotationSets which are Java Sets.
As UIMA, GATE can be used as a basis for implementing natural language dialog systems and multimodal dialog systems having a question answering system as one of the main submodules. The references, incorporated herein by reference above (U.S. Pat. Nos. 6,829,603 and 6,983,252, and 7,136,909) enable one skilled in the art to build such an implementation.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a system, apparatus, or device running an instruction.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a system, apparatus, or device running an instruction.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may run entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Thus, in one embodiment, the system and method for efficient passage retrieval may be performed with data structures native to various programming languages such as Java and C++.
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which run via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which run on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more operable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be run substantially concurrently, or the blocks may sometimes be run in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The embodiments described above are illustrative examples and it should not be construed that the present invention is limited to these particular embodiments. Thus, various changes and modifications may be effected by one skilled in the art without departing from the spirit or scope of the invention as defined in the appended claims.
Claims
1. A computer-implemented method for efficiently retrieving relevant passages to questions based on a corpus of data comprising:
- receiving an input query;
- performing a query analysis upon said input query to obtain searchable query terms;
- matching metadata associated with one or more documents against said query terms;
- mapping matched document metadata to corresponding one or more documents;
- identifying corresponding matched documents to form a subcorpus of documents; and
- conducting a search in said data subcorpus using said searchable query terms to obtain one or more passages relevant to the input query from said identified documents,
- wherein one or more processor devices performs one or more said retrieving, performing, matching, mapping, identifying and conducting.
2. The computer-implemented method of claim 1, wherein the document metadata includes one or more of: a title of the documents, one or more user tags, one or more automatically identified document labels.
3. The computer-implemented method of claim 2, wherein prior to matching of metadata associated with one or more documents against said query terms:
- extracting document metadata from one or more documents of a corpus of documents;
- providing said extracted document metadata as a dictionary in a storage device, each document metadata stored in said dictionary being associated with one or more corresponding document identifications.
4. The computer-implemented method of claim 3, wherein said matching of metadata against said query terms comprises: performing, by said processor device, dictionary matching.
5. The computer-implemented method of claim 2, wherein said data corpus comprising document metadata information includes variations of metadata including one or more of: singular and plural forms of metadata terms, and synonyms for metadata terms.
6. The computer-implemented method of claim 2, wherein obtaining searchable query terms from said input query comprises parsing, by said processor device, said input query to obtain terms matching document metadata.
7. The computer-implemented method of claim 2, wherein said identifying corresponding matched documents to form a subcorpus of documents includes tagging or flagging each matched metadata documents in said corpus of documents.
8. The computer-implemented method of claim 2, further comprising: extracting said tagged or flagged identified corresponding matched documents to form said subcorpus of documents.
9. A computer program product for efficiently retrieving relevant passages to questions based on a corpus of data, the computer program device comprising a non-transitory storage medium readable by a processing circuit and storing instructions run by the processing circuit for performing a method, the method comprising:
- receiving an input query;
- performing a query context analysis upon said input query to obtain searchable query terms;
- matching metadata associated with one or more documents against said query terms;
- mapping matched document metadata to corresponding one or more documents;
- identifying corresponding matched documents to form a subcorpus of documents; and
- conducting a search in said data subcorpus using said searchable query terms to obtain one or more passages relevant to the input query from said identified.
10. The computer program product of claim 9, wherein the document metadata includes one or more of: a title of the documents, one or more user tags, one or more automatically identified document labels.
11. The computer program product of claim 9, wherein prior to matching of metadata associated with one or more documents against said query terms:
- extracting document metadata from one or more documents of a corpus of documents;
- providing said extracted document metadata as a dictionary in a storage device, each document metadata stored in said dictionary being associated with one or more corresponding document identifications.
12. The computer program product of claim 11, wherein said matching of metadata against said query terms comprises: performing, by said processor device, dictionary matching.
13. The computer program product of claim 9, wherein said data corpus comprising document metadata information includes variations of metadata including one or more of:
- singular and plural forms of metadata terms, and synonyms for metadata terms.
14. The computer program product of claim 10, wherein obtaining searchable query terms from said input query comprises parsing, by said processor device, said input query to obtain terms matching document metadata.
15. The computer program product of claim 10, wherein said identifying corresponding matched documents to form a subcorpus of documents includes tagging or flagging each matched metadata documents in said corpus of documents.
16. The computer program product of claim 10, further comprising: extracting said tagged or flagged identified corresponding matched documents to form said subcorpus of documents.
17. A computer-implemented method for efficiently retrieving relevant passages to questions based on a corpus of data comprising:
- receiving an input query;
- performing a query context analysis upon said input query to obtain searchable query terms;
- accessing a dictionary of document metadata obtained from one or more documents of the data corpus, each stored document metadata being associated with one or more corresponding document identifications (IDs);
- performing a dictionary matching of said metadata associated with one or more documents against said query terms;
- mapping matched document metadata to corresponding one or more document IDs;
- identifying corresponding matched documents to form a subcorpus of documents; and
- conducting a search in said subcorpus using said searchable query terms to obtain one or more passages relevant to the input query from said identified documents, wherein one or more processor devices perform one or more said retrieving, performing query context analysis, accessing, performing dictionary matching, mapping, identifying and conducting.
18. The computer-implemented method of claim 17, wherein the document metadata includes one or more of: a title of the documents, one or more user tags, one or more automatically identified document labels.
19. The computer-implemented method of claim 18, wherein obtaining searchable query terms from said input query comprises parsing, by said processor device, said input query to obtain terms matching document metadata.
20. The computer-implemented method of claim 17, wherein said identifying corresponding matched documents to form a subcorpus of documents includes:
- tagging or flagging each matched metadata documents in said data corpus; and,
- extracting said tagged or flagged identified corresponding matched documents to form said subcorpus of documents.
21. A system for efficiently retrieving relevant passages to questions based on a corpus of data comprising:
- a memory storage device;
- a processor device in communication with the memory device that performs a method comprising:
- receiving an input query;
- performing a query context analysis upon said input query to obtain searchable query terms;
- matching metadata associated with one or more documents against said query terms;
- mapping matched document metadata to corresponding one or more documents;
- identifying corresponding matched documents to form a subcorpus of documents; and
- conducting a search in said data subcorpus using said searchable query terms to obtain one or more passages relevant to the input query from said identified documents.
22. The system of claim 21, wherein the document metadata includes one or more of: a title of the documents, one or more user tags, one or more automatically identified document labels.
23. The system of claim 22, wherein prior to matching of metadata associated with one or more documents against said query terms:
- extracting document metadata from one or more documents of a corpus of documents;
- providing said extracted document metadata as a dictionary in a storage device, each document metadata stored in said dictionary being associated with a corresponding document identification, wherein said matching of metadata against said query terms comprises performing a dictionary matching.
24. A computer program product for efficiently retrieving relevant passages to questions based on a corpus of data, the computer program device comprising a storage medium readable by a processing circuit and storing instructions run by the processing circuit for performing a method, the method comprising:
- receiving, at a processor device, an input query;
- performing, at said processor device, a query context analysis upon said input query to obtain searchable query terms;
- accessing a dictionary of document metadata obtained from one or more documents of the data corpus, each stored document metadata being associated with a corresponding document identification (ID);
- performing, by said processor device, a dictionary matching of said metadata associated with one or more documents against said query terms;
- mapping matched document metadata to corresponding one or more document IDs;
- identifying corresponding matched documents to form a subcorpus of documents; and
- conducting a search in said subcorpus using said searchable query terms to obtain one or more passages relevant to the input query from said identified documents.
25. The computer program product of claim 24, wherein the document metadata includes one or more of: a title of the documents, one or more user tags, one or more automatically identified document labels.
Type: Application
Filed: Sep 24, 2011
Publication Date: Mar 29, 2012
Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION (Armonk, NY)
Inventors: Jennifer Chu-Carroll (Hawthorne, NY), David A. Ferrucci (Yorktown Heights, NY)
Application Number: 13/244,347
International Classification: G06F 17/30 (20060101);