Patents by Inventor John W. Tukey

John W. Tukey has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Method and apparatus for information access employing overlapping clusters

Patent number: 5999927

Abstract: The present invention is a method and apparatus for document clustering-based browsing of a corpus of documents, and more particularly to the use of overlapping clusters to improve recall. The present invention is directed to improving the performance of information access methods and apparatus through the use of non-disjoint (overlapped) clustering operations. The present invention is further described in terms of two possible methods for expanding document clusters so as to achieve the overlap, and a method for increasing precision through the use of the overlapped clusters.

Type: Grant

Filed: April 24, 1998

Date of Patent: December 7, 1999

Assignee: Xerox Corporation

Inventors: John W. Tukey, Jan O. Pedersen
Method of ordering document clusters given some knowledge of user interests

Patent number: 5911140

Abstract: A method of automatically ordering the presentation of documents clusters generated from a ranked corpus of documents. First, the corpus is ordered into a plurality of clusters. Next, a rank is determined for each cluster based upon the rank of a document within that cluster. Afterward, the clusters are presented to a computer user in the order determined by their rank.

Type: Grant

Filed: December 14, 1995

Date of Patent: June 8, 1999

Assignee: Xerox Corporation

Inventors: John W. Tukey, Jan O. Pedersen
Automatic method of identifying drop words in a document image without performing character recognition

Patent number: 5850476

Abstract: A method of automatically identifying drop words in a document image without performing character recognition to generate an ASCII representation of the document text. First, the document image is analyzed to identify word equivalence classes, each of which represents at least one word of the multiplicity of words included in the document. Second, for each word equivalence class, the likelihood that it is not a drop word is determined. Third, document length is analyzed to determine whether the document is short. For a short document, the number of word equivalence classes identified as drop words based upon their likelihood is proportional to document length. For long documents, a fixed number of word equivalence classes are identified as drop words based upon the likelihood that they are not drop words.

Type: Grant

Filed: December 14, 1995

Date of Patent: December 15, 1998

Assignee: Xerox Corporation

Inventors: Francine R. Chen, John W. Tukey
Automatic method of generating thematic summaries from a document image without performing character recognition

Patent number: 5848191

Abstract: A method of automatically generating a thematic summary from a document image without performing character recognition to generate an ASCII representation of the document text. The method begins with decomposition of the document image into text blocks, and text lines. Using the median x-height of text blocks the main body of text is identified. Afterward, word image equivalence classes and sentence boundaries within the blocks of the main body of text are determined. The word image equivalence classes are used to identify thematic words. These, in turn are used to score the sentences within the main body of text, and the highest scoring sentences are selected for extraction.

Type: Grant

Filed: December 14, 1995

Date of Patent: December 8, 1998

Assignee: Xerox Corporation

Inventors: Francine R. Chen, Dan S. Bloomberg, John W. Tukey
Method of ordering document clusters without requiring knowledge of user interests

Patent number: 5787420

Abstract: A computerized method of ordering document clusters for presentation after browsing a corpus of documents that presents document clusters in a logical fashion in the absence of any indication of the computer user's interests. The method begins by grouping the corpus into a plurality of clusters, each having a centroid and including at least one document. Next, for each cluster a degree of similarity between that cluster and every other cluster is by finding a dot product between each cluster centroid and every other cluster centroid. The similarity information is then used to determine an order of presentation for the plurality of in a way that maximizes the degree of similarity between adjacent clusters.

Type: Grant

Filed: December 14, 1995

Date of Patent: July 28, 1998

Assignee: Xerox Corporation

Inventors: John W. Tukey, Jan O. Pedersen
Method and apparatus for information accesss employing overlapping clusters

Patent number: 5787422

Abstract: The present invention is a method and apparatus for document clustering-based browsing of a corpus of documents, and more particularly to the use of overlapping clusters to improve recall. The present invention is directed to improving the performance of information access methods and apparatus through the use of non-disjoint (overlapped) clustering operations. The present invention is further described in terms of two possible methods for expanding document clusters so as to achieve the overlap, and a method for increasing precision through the use of the overlapped clusters.

Type: Grant

Filed: January 11, 1996

Date of Patent: July 28, 1998

Assignee: Xerox Corporation

Inventors: John W. Tukey, Jan O. Pedersen
Detecting function words without converting a scanned document to character codes

Patent number: 5455871

Abstract: A method and apparatus detects function words in a first image of a scanned document without first converting the image to character codes. Function words include determiners, prepositions, articles, and other words that play a largely grammatical role, as opposed to words such as nouns and verbs that convey topic information. Non-content based morphological characteristics of image units are predetermined as well as the presence or omission of character ascenders and descenders in image units. Predetermined characteristics of function word image units are compared with the image units of an image and when a match occurs, the image unit is identified as a function word. Conversely when no matching characteristics occur, the image unit is identified as a non-function word. Additionally, image units are classified and identified as containing only upper case characters, only lower case characters, only digits, and mixed character types.

Type: Grant

Filed: May 16, 1994

Date of Patent: October 3, 1995

Assignee: Xerox Corporation

Inventors: Dan S. Bloomberg, John W. Tukey, M. Margaret Withgott
Scatter-gather: a cluster-based method and apparatus for browsing large document collections

Patent number: 5442778

Abstract: Scatter-Gather is a computer based document browsing method which operates in time proportional to a number of documents in a target corpus. The Scatter-Gather method includes: preparing an initial ordering of the corpus using, for example, an off-line computational method; determining a summary of the initial ordering of the corpus for interactive utility; and providing a further ordering of the corpus using, for example, an on-line non-deterministic method. The step of an off-line preparation of an initial ordering of a corpus is non-time-dependent, thus an accurate initial ordering is prepared. The step of determining a summary includes determining a summary for presentation to a user without scrolling on a CRT. The step of providing a further ordering includes truncated group average agglomerate clustering, merging disjointed document sets, center finding, assign-to-nearest and other refinement methods.

Type: Grant

Filed: November 12, 1991

Date of Patent: August 15, 1995

Assignee: Xerox Corporation

Inventors: Jan. O. Pedersen, David Karger, Douglass R. Cutting, John W. Tukey
Iterative technique for phrase query formation and an information retrieval system employing same

Patent number: 5278980

Abstract: An information retrieval system and method are provided in which an operator inputs one or more query words which are used to determine a search key for searching through a corpus of documents, and which returns any matches between the search key and the corpus of documents as a phrase containing the word data matching the query word(s), a non-stop (content) word next adjacent to the matching word data, and all intervening stop-words between the matching word data and the next adjacent non-stop word. The operator, after reviewing one or more of the returned phrases can then use one or more of the next adjacent non-stop-words as new query words to reformulate the search key and perform a subsequent search through the document corpus. This process can be conducted iteratively, until the appropriate documents of interest are located. The additional non-stop-words from each phrase are preferably aligned with each other (e.g., by columnation) to ease viewing of the "new" content words.

Type: Grant

Filed: August 16, 1991

Date of Patent: January 11, 1994

Assignee: Xerox Corporation

Inventors: Jan O. Pedersen, Per-Kristian Halvorsen, Douglass R. Cutting, John W. Tukey, Eric A. Bier, Daniel G. Bobrow