Patents by Inventor Mark J. Tomko

Mark J. Tomko has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Identifying potential duplicates of a document in a document corpus

Patent number: 9195714

Abstract: According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document, is provided. A source document is obtained. A list of queries corresponding to the source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document's ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.

Type: Grant

Filed: February 17, 2011

Date of Patent: November 24, 2015

Assignee: Amazon Technologies, Inc.

Inventors: Srikanth Thirumalai, Aswath Manoharan, Mark J. Tomko, Grant M. Emery, Vijai Mohan
Duplicate entry detection system and method

Patent number: 8046372

Abstract: A computer system and method for determining whether the subject matter described in a received document is substantially similar to the subject matter of other documents in a document corpus, such that the received document can be considered a duplicate document. After receiving a first document, a set of tokens for the first document is generated. A non-fielded relevance search on a token index is executed. The relevance search returns a set of candidate duplicate documents with scores corresponding to each candidate document. For each candidate document with a score above a threshold, filtering is performed on each candidate document to determine whether each candidate document is a true duplicate of the first document. A set of candidate documents with a score above the threshold that were not disqualified as candidate documents is then provided.

Type: Grant

Filed: May 25, 2007

Date of Patent: October 25, 2011

Assignee: Amazon Technologies, Inc.

Inventors: Srikanth Thirumalai, Aswath Manoharan, Mark J. Tomko, Grant M. Emery, Vijai Mohan, Egidio Terra
Determining variation sets among product descriptions

Patent number: 7970773

Abstract: Systems and methods for determining a set of variation-phrases from a collection of documents in a document corpus is presented. Potential variation-phrase pairs among the various documents in the document corpus are identified. The identified potential variation-phrase pairs are then added to a variation-phrase set. The potential variation-phrase pairs in the variation-phrase set are filtered to remove those potential variation-phrase pairs that do not satisfy a predetermined criteria. After filtering the variation-phrase set, the resulting variation-phrase set is stored in a data store.

Type: Grant

Filed: September 27, 2007

Date of Patent: June 28, 2011

Assignee: Amazon Technologies, Inc.

Inventors: Srikanth Thirumalai, Aswath Manoharan, Xiaoxin Yin, Mark J. Tomko, Grant M. Emery, Vijai Mohan, Egidio Terra
Filtering invalid tokens from a document using high IDF token filtering

Patent number: 7908279

Abstract: Systems and methods for filtering tokens from a document for determining whether the document describes substantially similar subject matter compared to another document are described. In one embodiment, a first document is obtained. This document is organized into a plurality of fields, and at least some of the fields include tokens representing the subject matter described by the document. A field of this document is selected and a token from within the selected field having the highest inverse document frequency (IDF) is selected. Those tokens that have a higher IDF than the selected token are removed. Using the remaining tokens, a determination is made as to whether the first document describes substantially similar subject matter to the subject matter described by a second document. An indication is provided as to whether the first document describes substantially similar subject matter to that described by a second document according to the determination.

Type: Grant

Filed: September 17, 2007

Date of Patent: March 15, 2011

Assignee: Amazon Technologies, Inc.

Inventors: Srikanth Thirumalai, Aswath Manoharan, Mark J. Tomko, Grant M. Emery, Vijai Mohan, Egidio Terra
Comparison engine for identifying documents describing similar subject matter

Patent number: 7904462

Abstract: Systems and methods for determining whether a first document is a potential duplicate of a second document such that the two documents describe the same or substantially the same subject matter, wherein the first and second documents include attribute data in attribute fields. A set of rules is obtained for determining whether the first document is a potential duplicate of the second document. Moreover, for each rule in the set of rules, a determination is made as to whether data in a first set of attributes of the first document is contained in a second set of attributes of the second document. According to the results of the evaluated rules in the rules set, determining whether the first document is a potential duplicate of the second document. If, according to the evaluated rules in the rules set, the first document is determined to be a potential duplicate of the second document, storing a reference to the first document in a set of potential duplicates of the second document.

Type: Grant

Filed: December 10, 2007

Date of Patent: March 8, 2011

Assignee: Amazon Technologies, Inc.

Inventors: Srikanth Thirumalai, Aswath Manoharan, Mark J. Tomko, Grant M. Emery, Vijai Mohan, Egidio Terra
Identifying potential duplicates of a document in a document corpus

Patent number: 7895225

Abstract: According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document is provided. A source document is obtained. A list of queries corresponding to a source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document's ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.

Type: Grant

Filed: December 6, 2007

Date of Patent: February 22, 2011

Assignee: Amazon Technologies, Inc.

Inventors: Srikanth Thirumalai, Aswath Manoharan, Mark J. Tomko, Grant M. Emery, Vijai Mohan
Generating similarity scores for matching non-identical data strings

Patent number: 7814107

Abstract: A system and method for determining the likelihood of two documents describing substantially similar subject matter is presented. A set of tokens for each of two documents is obtained, each set representing strings of characters found in the corresponding document. A matrix of token pairs is determined, each token pair comprising a token from each set of tokens. For each token pair in the matrix, a similarity score is determined. Those token pairs in the matrix with a similarity score above a threshold score are selected and added to a set of matched tokens. A similarity score for the two documents is determined according to the scores of the token pairs added to the set of matched tokens. The determined similarity score is provided as the likelihood that the first and second documents describing substantially similar subject matter.

Type: Grant

Filed: May 25, 2007

Date of Patent: October 12, 2010

Assignee: Amazon Technologies, Inc.

Inventors: Srikanth Thirumalai, Egidio Terra, Vijai Mohan, Mark J. Tomko, Grant M. Emery, Aswath Manoharan

Identifying potential duplicates of a document in a document corpus

Duplicate entry detection system and method

Determining variation sets among product descriptions

Filtering invalid tokens from a document using high IDF token filtering

Comparison engine for identifying documents describing similar subject matter

Identifying potential duplicates of a document in a document corpus

Generating similarity scores for matching non-identical data strings