Abstract: Systems, methods, and computer readable media for staging a corpus of electronic communication documents for analysis, such as, for example, via a content analysis platform. The staging may include a staging platform accessing the corpus of electronic communication document. For each electronic communication document within the corpus, the staging platform may generate a fingerprint based upon the output of a hash function executed upon a set of characteristics corresponding to each segment within the electronic communication document. The staging platform may analyze the generated fingerprints to generated a plurality of threaded conversations that do not include electronic communication documents that fail to convey any new information. The systems and methods may also include detecting and flagging any segments within an electronic communication document that may have been mutated by its author.
Type:
Grant
Filed:
July 8, 2016
Date of Patent:
June 30, 2020
Assignee:
RELATIVITY ODA LLC
Inventors:
Michael DiSalvo, Jeffrey Gilles, Brandon Gauthier
Abstract: Systems and methods for generating visualizations of a set of processed electronic documents are disclosed. According to certain aspects, a set of clusters may be generated to reflect similarities among content of a set of electronic documents. An electronic device may generate a visualization of the set of clusters, where the visualization may include a set of representations corresponding to the set of clusters. A user interface may display the visualization, where the representations may be positioned to reflect similarities and differences between a set of documents included in a target cluster and additional sets of documents included in additional clusters.
Abstract: A method for efficiently grouping electronic documents that are likely textual near-duplicates includes processing first and second electronic documents to determine respective sets of character sequence counts. The processing includes, for each document, (1) identifying non-contiguous character sequences expressed within the document text, with each character sequence corresponding to a different starting position within the text and including at least a first character at the respective starting position and a second character at a pre-defined offset from the respective starting position, and (2) determining character sequence counts for each unique character sequence within the identified character sequences. The method also includes generating one or more similarity metrics, at least by comparing the sets of character sequence counts determined for the first and second electronic documents.
Abstract: A method for efficiently grouping electronic documents that are likely textual near-duplicates includes processing first and second electronic documents to determine respective sets of character sequence counts. The processing may include, for each document, identifying a plurality of non-contiguous character sequences expressed within the document text, with each character sequence including at least one character from each of at least two different words in the text, and determining character sequence counts for each unique character sequence within the identified character sequences. The method also includes generating one or more similarity metrics, at least by comparing the sets of character sequence counts determined for the first and second electronic documents. The method may also include using the similarity metric(s) to calculate a similarity score, and assigning, based on the similarity score, the second electronic document to a same document group as the first electronic document.