Abstract: A system, method and computer program product for identifying near and exact-duplicate documents in a document collection, including for each document in the collection, reading textual content from the document; filtering the textual content based on user settings; determining N most frequent words from the filtered textual content of the document; performing a quorum search of the N most frequent words in the document with a threshold M; and sorting results from the quorum search based on relevancy. Based on the values of N and M near and exact-duplicate documents are identified in the document collection.
Type:
Grant
Filed:
August 16, 2012
Date of Patent:
August 6, 2013
Assignee:
MSC Intellectual Properties B.V.
Inventors:
Johannes C. Scholtes, Siebe Bloembergen
Abstract: A system, method and computer program product for identifying near and exact-duplicate documents in a document collection, including for each document in the collection, reading textual content from the document; filtering the textual content based on user settings; determining N most frequent words from the filtered textual content of the document; performing a quorum search of the N most frequent words in the document with a threshold M; and sorting results from the quorum search based on relevancy. Based on the values of N and M near and exact-duplicate documents are identified in the document collection.
Type:
Grant
Filed:
March 30, 2011
Date of Patent:
August 21, 2012
Assignee:
MSC Intellectual Properties B.V.
Inventors:
Johannes C. Scholtes, Siebe Bloembergen
Abstract: A system, method and computer program product for identifying near and exact-duplicate documents in a document collection, including for each document in the collection, reading textual content from the document; filtering the textual content based on user settings; determining N most frequent words from the filtered textual content of the document; performing a quorum search of the N most frequent words in the document with a threshold M; and sorting results from the quorum search based on relevancy. Based on the values of N and M near and exact-duplicate documents are identified in the document collection.
Type:
Grant
Filed:
April 30, 2008
Date of Patent:
April 19, 2011
Assignee:
MSC Intellectual Properties B.V.
Inventors:
Johannes C. Scholtes, Siebe Bloembergen