ESTIMATION OF POSTINGS LIST LENGTH IN A SEARCH SYSTEM USING AN APPROXIMATION TABLE

- GLOBALSPEC, INC.

The present invention provides a method of minimizing accesses to secondary storage when searching an inverted index for a search term. The method comprises automatically obtaining a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size. Corresponding computer system and program products are also provided.

Skip to: Description  ·  Claims  · Patent History  ·  Patent History
Description
CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority under 35 U.S.C. §119 to the following U.S. Provisional Applications, which are herein incorporated by reference in their entirety:

Provisional Patent Application Ser. No. 61/233,411, by Flatland et al., entitled “ESTIMATION OF POSTINGS LIST LENGTH IN A SEARCH SYSTEM USING AN APPROXIMATION TABLE,” filed on Aug. 12, 2009; and

Provisional Patent Application No. 61/233,420, by Flatland et al., entitled “EFFICIENT BUFFERED READING WITH A PLUG IN FOR INPUT BUFFER SIZE DETERMINATION,” filed on Aug. 12, 2009;

Provisional Patent Application Ser. No. 61/233,427, by Flatland et al., entitled “SEGMENTING POSTINGS LIST READER,” filed on Aug. 12, 2009.

This application contains subject matter which is related to the subject matter of the following applications, each of which is assigned to the same assignee as this application and filed on the same day as this application. Each of the below listed applications is hereby incorporated herein by reference in its entirety:

U.S. Non-Provisional patent application Ser. No. ______, by Flatland et al., entitled “EFFICIENT BUFFERED READING WITH A PLUG IN FOR INPUT BUFFER SIZE DETERMINATION” (Attorney Docket No. 1634.069A); and

U.S. Non-Provisional patent application Ser. No. ______, by Flatland et al., entitled “SEGMENTING POSTINGS LIST READER” (Attorney Docket No. 1634.070A).

TECHNICAL FIELD

The present invention generally relates to searching an inverted index. More particularly, the invention relates to estimating a posting list size based on document frequency in order to minimize accesses to the posting list stored in secondary storage.

BACKGROUND

The following definition of Information Retrieval (IR) is from the book Introduction to Information Retrieval by Manning, Raghavan and Schutze, Cambridge University Press, 2008:

    • Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

An inverted index is a data structure central to the design of numerous modern information retrieval systems. In chapter 5 of Search Engines: Information Retrieval in Practice (Addison Wesley, 2010), Croft, Metzler and Strohman observe:

    • An inverted index is the computational equivalent of the index found in the back of this textbook . . . . The book index is arranged in alphabetical order by index term. Each index term is followed by a list of pages about the word.

In a search system implemented using a computer, an inverted index 100 often comprises two related data structures (see FIG. 1):

    • 1. A lexicon 101 contains the distinct set of terms 102 (i.e., with duplicates removed) that occur throughout all the documents of the index. To facilitate rapid searching, terms in the lexicon are usually stored in sorted order. Each term typically includes a document frequency 104 and a pointer into the other major data structure of the inverted index, the posting file 108. The document frequency is a count of the number of documents in which a term occurs. The document frequency is useful at search time both for prioritizing term processing and as input to scoring algorithms.
    • 2. The posting file 108 consists of one posting list per term in the lexicon, e.g., list 110 for term 112, recording for each term the set of documents in which the term occurs. Each entry in a posting list is called a posting. The number of postings in a given posting list equals the document frequency of the associated lexicon entry. A posting includes at least a document identifier and may include additional information such as: a count of the number of times the term occurs in the document; a list of term positions within the document where the term occurs; and more generally, scoring information that ascribes some degree of importance (or lack thereof) to the fact that the document contains the term.

When processing a user's query, a computerized search system needs access to the postings of the terms that describe the user's information need. As part of processing the query, the search system aggregates information from these postings, by document, in an accumulation process that leads to a ranked list of documents to answer the user's query.

A large inverted index may not fit into a computer's main memory, requiring secondary storage, typically disk storage, to help store the posting file, lexicon, or both. Each separate access to disk may incur seek time on the order of several milliseconds if it is necessary to move the hard drive's read heads, which is very expensive in terms of runtime performance compared to accessing main memory.

Therefore, it would be helpful to minimize accesses to secondary storage when searching an inverted list, in order to improve runtime performance.

BRIEF SUMMARY OF INVENTION

The present invention provides, in a first aspect, a method of minimizing accesses to secondary storage when searching an inverted index for a search term. The method comprises automatically obtaining a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.

The present invention provides, in a second aspect, a computer system for minimizing accesses to secondary storage when searching an inverted index for a search term. The computer system comprises a memory, and a processor in communication with the memory to perform a method. The method comprises automatically obtaining a predetermined size of a posting list for the search term based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.

The present invention provides, in a third aspect, a program product for minimizing accesses to secondary storage when searching an inverted index for a search term. The program product comprises a storage medium readable by a processor and storing instructions for execution by the processor for performing a method. The method comprises automatically obtaining a predetermined size of a posting list for the search term based on document frequency for the search term, the posting list being stored in secondary storage, and reading at least a portion of the posting list into memory based on the size approximated.

The present invention provides, in a fourth aspect, a data structure for use in minimizing accesses to data stored in secondary storage when searching an inverted index for a search term. The data structure comprises a posting list length approximation table, comprising a hash table, the hash table comprising: a plurality of range IDs, each range ID corresponding to a subset of posting lists of predetermined similar size and representing a non-overlapping range of document frequencies, and a posting list length approximation for each range ID.

These, and other objects, features and advantages of this invention will become apparent from the following detailed description of the various aspects of the invention taken in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

One or more aspects of the present invention are particularly pointed out and distinctly claimed as examples in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 depicts one example of an inverted index consisting of a lexicon and corresponding posting file.

FIG. 2 depicts one example of a posting list length approximation table data structure, according to one aspect of the present invention.

FIG. 3 is a flow diagram for one example of a method of reading a posting list in accordance with one or more aspects of the present invention.

FIG. 4 depicts one example of an inverted index with the storage split between main memory and secondary storage.

FIG. 5 is an object oriented instance diagram showing one example of a posting list reader and the main objects it uses, in accordance with the present invention.

FIG. 6 is a block diagram of one example of a computing unit incorporating one or more aspects of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

The present invention approximates posting list size, preferably as a length in bytes, according to a term's document frequency. The approximate posting list size is preferably predetermined, and it covers, with high probability, the size of the associated posting list in secondary storage. Knowing the approximate size is useful for minimizing the number of accesses to secondary storage when reading a posting list. For example, if the approximate covering read size is several megabytes or less, a highly efficient strategy is to scoop up the whole posting list in a single access to secondary storage through a single read system call. If the approximate covering read size is larger than the largest available main memory input buffer, then the list can be read, for example, by filling the largest available input buffer several times using a single system call per buffer fill operation, and then doing one more partial read to pick up the remainder of the approximate covering read size. For the rare case where the approximate covering read size does not cover the posting list being read, additional supplemental reads can be issued as necessary.

U.S. Non-Provisional Patent Application entitled “EFFICIENT BUFFERED READING WITH A PLUG IN FOR INPUT BUFFER SIZE DETERMINATION” (Attorney Docket No. 1634.069A) filed concurrently herewith, describes an enhanced buffered reader that can be configured with predetermined buffer fill size strategies. When the posting file is in secondary storage, using an enhanced buffered reader to read a posting list offers advantages over a conventional buffered reader. An enhanced buffered reader can be configured with a predetermined buffer fill size strategy that is based on both the size of the posting list (in bytes, for example) and the size of the available input buffer, ensuring that the fewest required number of system calls to read from secondary storage are issued. Another advantage of the enhanced buffered reader is that it neatly encapsulates buffer management details inside the enhanced buffered reader. The detailed description of present invention assumes a working understanding of enhanced buffered readers.

Reading a posting list with a predetermined buffer fill size strategy requires knowledge of the size of the posting list in terms of buffer elements (typically bytes) before reading begins. When an inverted index is small enough that the lexicon fits entirely into memory, it is a simple matter to determine the size of a posting list in bytes. For example, referring back to FIG. 1, and assuming that the lexicon is entirely in main memory, simply subtract adjacent posting list addresses to know the size in bytes of a given posting list. As will become apparent in the detailed description below, an advantage of the present invention is that this sizing information preferably used for efficient reading from secondary storage can be instantly available in main memory without needing to store the full lexicon in main memory and without needing to store posting list size in bytes as a separate field in the lexicon.

The present invention uses a posting list range to predetermine approximate posting list size. A posting list range is a set of posting lists defined by an inclusive minimum and an inclusive maximum document frequency. A posting list is a member of a posting list range if the posting list's document frequency falls within the inclusive minimum and maximum of the range. Posting lists that are part of the same posting list range will have the same approximate posting list size. The present invention builds on the concept of a posting list range to predetermine approximate posting list size.

As a prerequisite for populating the Posting List Length Approximation Table data structure pictured in FIG. 2, and given an inverted index, the posting lists in the inverted index are partitioned into a collection of non-overlapping ranges whose union is the complete set of posting lists in the index. Each of these ranges is assigned a unique range identifier (rangeId).

One example of a way to accomplish this partitioning of posting lists is through a function called documentFrequencyToRangeIdTranslator, shown below and summarized here. The function takes as input an integer that is the length of a posting list in number of postings, also known as the document frequency. The function returns the ID of the range that includes the posting list whose document frequency was passed in. ln( ) is the natural logarithm function, and ceil( ) is a function that rounds a number with a fractional part to the next higher integer.

       documentFrequencyToRangeIdTranslator int documentFrequencyToRangeIdTranslator(int documentFrequency)   return ceil(ln(documentFrequency) / ln(2.0)); }

Table I below shows how the implementation of documentFrequencyToRangeIdTranslator above partitions posting lists into posting list ranges. The implementation of documentFrequencyToRangeIdTranslator has been found to work well in practice with a natural language corpus in which the word distribution adheres to Zipf's law. Each successive rangeId includes twice as many document frequencies as the preceding rangeId.

TABLE I Sample Range Definitions minDocumentFrequency maxDocumentFrequency rangeId 1 1 0 2 2 1 3 4 2 5 8 3 9 16 4 17  32 5 etc.

Other implementations of the documentFrequencyToRangeIdTranslator are possible. The above is merely one example. This function could be implemented in any way that defines a complete non-overlapping partitioning of the posting lists into ranges.

Given an inverted index, the Posting List Length Approximation table data structure pictured in FIG. 2 can be created in one example as follows. For each posting list range compute the mean and standard deviation of posting list size as stored in secondary storage, preferably in bytes. Next, create a Posting List Length Approximation object consisting of the rangeId of the current range, the mean of posting list size, and the standard deviation of posting list size. Finally, add a hash table entry to the Posting List Length Approximation Table mapping the rangeId to the Posting List Length Approximation object.

FIG. 2 depicts one example of a data structure 200 for a posting list length approximation table, in accordance with one aspect of the present invention. The data structure comprises a hash table 210 with keys 220 and associated values 230. The keys comprise a plurality of range ID's 240, as described above. The associated values comprise the posting list length approximation information 250. In the presently preferred embodiment, the length approximation information is based on a predetermined length. The information comprises, for example, the corresponding range ID, a mean posting list length, and a standard deviation for the posting list length. The mean length and standard deviation are preferably expressed, for example, in bytes.

In one example, in addition to the structure shown in FIG. 2, the posting list length approximation table has an access method getPostingListLengthApproximation(documentFrequency) which returns a Posting List Length Approximation object based on a document frequency passed in. In the present example, the implementation of this method translates the document frequency to a rangeId using the documentFrequencyToRangeIdTranslator function discussed earlier. This rangeId is then used to do a hash table lookup to find the proper Posting List Length Approximation object to return. The resulting Posting List Length Approximation object can then be turned into an approximate covering read size by, for example, adding the mean posting list length in bytes to the desired number of standard deviations.

One example of how to use a posting list length approximation table to read a posting list efficiently will now be provided with reference to the flow diagram 300 of FIG. 3. The present example has a similar structure to the inverted index of FIG. 1. In the scenario of this example, the inverted index is large enough that the posting file is entirely in secondary storage and only half of the lexicon fits into main memory.

FIG. 4 shows how the inverted index 400 is divided between main memory 402, storing the lexicon index 404, and secondary storage 406, storing the full lexicon 408 and the posting file 410. Referring to FIG. 4, let N be the total number of terms in the full lexicon in secondary storage. Only every second term, for a total of N/2 terms, are kept in the lexicon index in main memory due to memory constraints. The full lexicon in secondary storage is preferably organized as a sequence of blocks, e.g., block 412, each of a constant size k (e.g., in bytes) such that any block can accommodate the largest pair of lexicon entries in the lexicon. This causes some internal fragmentation within the full lexicon, but the advantage is that the lexicon index does not need to store explicit disk pointers into the full lexicon. Instead, to locate the block in the full lexicon of the lexicon index record with zero-based index i, simply seek to offset i*k in the full lexicon. By design, the lexicon index includes document frequency but does not include posting list sizing in bytes. The goal is to keep the main memory lexicon data structure as compact as possible. The Posting List Length Approximation Table will provide needed sizing information for efficient reading of posting lists in secondary storage.

In this example, it is assumed that the search engine implementation uses an object called a Posting List Reader 500, shown in FIG. 5, to read postings from secondary storage during query processing. The Posting List Reader uses a Posting List Length Approximation Table 502 to accurately estimate the sizes of posting lists to be read. It uses an Enhanced Buffered Reader 504 with an internal buffer of size bufsize bytes to read postings 506 from secondary storage using efficient predetermined buffer fill size strategies. Preferably, bufsize is relatively large (for example several megabytes) to facilitate reading large posting lists with relatively few read system calls. The Posting List Reader provides the following access methods:

    • initialize(documentFrequency, postingListAddress)—Prepares the Posting List

Reader for reading based on a document frequency and posting list address of a term obtained from the lexicon. After initialization, the readPosting( )method may be used.

    • readPosting( )—Reads the next posting from the posting list.

When a user runs a query, the search system first parses the query, identifies the terms for which postings are needed to process the query, and locates each of these terms in the lexicon to obtain a document frequency and posting list address for each. Assuming a lexicon structured similar to that shown in FIG. 4, a term's document frequency and posting list address can be retrieved without accessing secondary storage about half the time by doing a binary search of the lexicon index in main memory, which is very fast. If necessary, a disk seek can be used to find the term in the full lexicon in secondary storage by seeking to offset i*k in the full lexicon and reading the lexicon entries there, where i is the zero-based record offset in the lexicon index of the lexically greatest term that is lexically less than the sought term, and k is the block size of the blocks in the full lexicon. Having obtained a document frequency and posting list address for a term, the search system initializes a Posting List Reader, preparing it to read postings, as discussed below.

Returning to FIG. 3, the Posting List Reader receives an initialize request (step 302) that includes a document frequency and a posting list address. The document frequency is the length of the posting list to read in number of postings, and the posting list address is the byte offset in the posting file where the posting list to read starts.

The Posting List Reader obtains a Posting List Length Approximation object (step 304) by calling the getPostingListLengthApproximation method on the Posting List Length Approximation Table pictured in FIG. 2, passing the document frequency to this getter. (The implementation of getPostingListLengthApproximation in turn translates the document frequency passed in to a rangeId using the documentFrequencyToRangeIdTranslator function described earlier and does a hash table lookup in the Posting List Length Approximation table based on the rangeId to obtain the Posting List Length Approximation object.)

The Posting List Reader next obtains the approximate size of the posting list to read (step 306) by getting the mean and standard deviation of posting list length from the Posting List Length Approximation object and adding the desired number of standard deviations to the mean. Let approximateReadSize be the approximate read size calculated in this step.

The next step in initializing the Posting List Reader is to build a predetermined buffer fill size strategy (step 308) for use with the Enhanced Buffered Reader. A predetermined buffer fill size strategy is an ordered sequence of (fillSize, numTimesToUse) pairs, where fillSize indicates how much of the Enhanced Buffered Reader's internal input buffer to fill when a buffer fill is needed, and numTimesToUse indicates how many times to use the associated fillSize. There are two cases to consider, based on the relative sizes of the bufsize (the Enhanced Buffered Reader's internal buffer size) and approximateReadSize.

Case 1: approximateReadSize<=bufsize; and

Case 2: approximateReadSize>bufsize.

A discussion of these cases follows.

Case 1: approximateReadSize<=bufsize

Build a two-stage predetermined buffer fill size strategy as indicated below in Table II.

TABLE II Stage Fill Size Number of Times to Use 1 approximateReadSize 1 2 8 kilobytes Repeat as necessary

The above two-stage strategy, when installed in an Enhanced Buffered Reader and used to read the posting list, will with high probability result in a single disk seek and read of exactly approximateReadSize bytes. As many supplemental 8 kilobyte reads as necessary may then be issued to handle the relatively rare case when the approximateReadSize is insufficient.

Case 2: approximateReadSize>bufsize

For this discussion, let “/” represent the operation of integer division, and “%” represent the operation of integer modulo.

In this case, we build a predetermined buffer fill size strategy that generally has three stages, as indicated in the following table. However, the second stage is not necessary when the bufsize divides the approximateReadSize evenly).

TABLE III Stage Fill Size Number of Times to Use 1 bufsize approximateReadSize/bufsize 2 approximateReadSize % bufsize 1 3 8 kilobytes Repeat as necessary

The above strategy, when installed in an Enhanced Buffered Reader and used to read the posting list, will utilize the available input buffer of size bufsize bytes to read approximateReadSize bytes of data using a minimal number of disk seeks and minimal data transfer. The approximateReadSize is sufficient to read the entire posting list with high probability; however, as many supplemental 8 kilobyte reads as necessary will be issued to handle the relatively rare case when the approximateReadSize is insufficient.

Referring once again to FIG. 3, the next step in initializing the Posting List Reader is to seek the Enhanced Buffered Reader to start of posting list (step 310). The posting list address that was passed to the initialize request (step 302) is forwarded to the Enhanced Buffered Reader's seek method.

Finally, the predetermined buffer fill size strategy of step 308 is installed in the Enhanced Buffered Reader (step 312), by calling the appropriate setter. The posting list reader is now ready to start processing read requests for postings (step 314). As the search system's search logic issues read requests as desired, the Enhanced Buffered Reader automatically initiates buffer refilling as needed using read sizes consistent with good runtime performance when accessing secondary storage.

As shown in FIG. 6, one example of a data processing system 600 may be provided suitable for storing and/or executing program code is usable that includes at least one processor 610 coupled directly or indirectly to memory elements through a system bus 620. As known in the art, the memory elements include, for instance, data buffers 630 and 640, local memory employed during actual execution of the program code, bulk storage 650, and cache memory which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.

Input/Output or I/O devices 660 (including, but not limited to, keyboards, displays, pointing devices, DASD, tape, CDs, DVDs, thumb drives and other memory media, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modems, and Ethernet cards are just a few of the available types of network adapters.

The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a computer program product for minimizing accesses to secondary storage for a posting list when searching an inverted index for a search term. The computer program product comprises a storage medium readable by a processing circuit and storing instructions for execution by a computer for performing a method. The method includes, for instance, automatically obtaining a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage, and reading at least a portion of the posting list into memory based on the predetermined size.

Methods and systems relating to one or more aspects of the present invention are also described and claimed herein. Further, services relating to one or more aspects of the present invention are also described and may be claimed herein.

Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.

In one aspect of the present invention, an application can be deployed for performing one or more aspects of the present invention. As one example, the deploying of an application comprises providing computer infrastructure operable to perform one or more aspects of the present invention.

As a further aspect of the present invention, a computing infrastructure can be deployed comprising integrating computer readable code into a computing system, in which the code in combination with the computing system is capable of performing one or more aspects of the present invention.

As yet a further aspect of the present invention, a process for integrating computing infrastructure comprising integrating computer readable code into a computer system may be provided. The computer system comprises a computer readable medium, in which the computer medium comprises one or more aspects of the present invention. The code in combination with the computer system is capable of performing one or more aspects of the present invention.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system”. Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.

In one example, a computer program product includes, for instance, one or more computer readable media to store computer readable program code means or logic thereon to provide and facilitate one or more aspects of the present invention. The computer program product can take many different physical forms, for example, disks, platters, flash memory, etc., including those above.

Program code embodied on a computer readable medium may be transmitted using an appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language, such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof.

The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below, if any, are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiment with various modifications as are suited to the particular use contemplated.

Claims

1. A method of minimizing accesses to secondary storage when searching an inverted index for a search term, the method comprising:

obtaining by at least one computing unit a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage; and
reading by the at least one computing unit at least a portion of the posting list into memory based on the predetermined size.

2. The method of claim 1, wherein the size is a length in bytes.

3. The method of claim 1, wherein if the size obtained is a predetermined minimum size or less, then the reading comprises reading all of the posting list at once.

4. The method of claim 3, wherein the reading comprises issuing a single read system call.

5. The method of claim 3, wherein the predetermined minimum size comprises a size of a main memory input buffer.

6. The method of claim 1, wherein if the predetermined size is greater than a predetermined minimum size, then the reading comprises performing a plurality of read operations.

7. The method of claim 6, wherein the predetermined minimum size comprises a size of a main memory input buffer.

8. The method of claim 7, wherein the performing comprises filling a largest available main memory input buffer a minimum number of times.

9. The method of claim 1, wherein the obtaining comprises:

partitioning all posting lists in the inverted index into a plurality of non-overlapping ranges, each range having a minimum document frequency and a maximum document frequency;
assigning a range ID to each posting list; and
using the range ID to look up the predetermined size.

10. The method of claim 9, wherein each successive maximum document frequency is twice that of an immediate prior one.

11. A computer system for minimizing accesses to secondary storage when searching an inverted index for a search term, the computer system comprising:

a memory; and
a processor in communication with the memory to perform a method, the method comprising: obtaining a predetermined size of a posting list for the search term based on document frequency for the search term, wherein the posting list is stored in secondary storage; and reading at least a portion of the posting list into memory based on the predetermined size.

12. The system of claim 11, wherein the size is a length in bytes.

13. The system of claim 11, wherein if the size obtained is a predetermined minimum size or less, then the reading comprises reading all of the posting list at once.

14. The system of claim 13, wherein the reading comprises issuing a single read system call.

15. The system of claim 13, wherein the predetermined minimum size comprises a size of a main memory input buffer.

16. The system of claim 11, wherein if the predetermined size is greater than a predetermined minimum size, then the reading comprises performing a plurality of read operations.

17. The system of claim 16, wherein the predetermined minimum size comprises a size of a main memory input buffer.

18. The system of claim 17, wherein the performing comprises filling a largest available main memory input buffer a minimum number of times.

19. The system of claim 11, wherein the obtaining comprises:

partitioning all posting lists in the inverted index into a plurality of non-overlapping ranges, each range having a minimum document frequency and a maximum document frequency;
assigning a range ID to each posting list; and
using the range ID to look up the predetermined size.

20. The system of claim 19, wherein each successive maximum document frequency is twice that of an immediate prior one.

21. A program product for minimizing accesses to secondary storage when searching an inverted index for a search term, the program product comprising:

a storage medium readable by a processor and storing instructions for execution by the processor for performing a method, the method comprising: obtaining by at least one computing unit a predetermined size of a posting list for the search term, the predetermined size based on document frequency for the search term, wherein the posting list is stored in secondary storage; and reading by the at least one computing unit at least a portion of the posting list into memory based on the predetermined size.

22. The program product of claim 21, wherein the size is a length in bytes.

23. The program product of claim 21, wherein if the size obtained is a predetermined minimum size or less, then the reading comprises reading all of the posting list at once.

24. The program product of claim 23, wherein the reading comprises issuing a single read system call.

25. The program product of claim 23, wherein the predetermined minimum size comprises a size of a main memory input buffer.

26. The program product of claim 21, wherein if the predetermined size is greater than a predetermined minimum size, then the reading comprises performing a plurality of read operations.

27. The program product of claim 26, wherein the predetermined minimum size comprises a size of a main memory input buffer.

28. The program product of claim 27, wherein the performing comprises filling a largest available main memory input buffer a minimum number of times.

29. The program product of claim 21, wherein the obtaining comprises:

partitioning all posting lists in the inverted index into a plurality of non-overlapping ranges, each range having a minimum document frequency and a maximum document frequency;
assigning a range ID to each posting list; and
using the range ID to look up the predetermined size.

30. The program product of claim 29, wherein each successive maximum document frequency is twice that of an immediate prior one.

31. A data structure for use in minimizing accesses to data stored in secondary storage when searching an inverted index for a search term, the data structure comprising:

a posting list length approximation table, comprising a hash table, the hash table comprising: a plurality of range IDs, each range ID corresponding to a subset of posting lists of predetermined similar size and representing a non-overlapping range of document frequencies; and a posting list length approximation for each range ID.

32. The data structure of claim 31, wherein the posting list length approximation is a length in bytes.

33. The data structure of claim 32, wherein the posting list length approximation comprises a mean length and a standard deviation length.

Patent History
Publication number: 20110040761
Type: Application
Filed: Aug 11, 2010
Publication Date: Feb 17, 2011
Applicant: GLOBALSPEC, INC. (East Greenbush, NY)
Inventors: Steinar Flatland (Clifton Park, NY), Jeff J. Dalton (Northampton, MA)
Application Number: 12/854,726