Patents Assigned to Stratify, Inc.

Language identification for documents containing multiple languages

Patent number: 8938384

Abstract: Multiple nonoverlapping languages within a single document can be identified. In one embodiment, for each of a set of candidate languages, a set of non-overlapping languages is defined. The document is analyzed under the hypothesis that the whole document is in one language and that part of the document is in one language while the rest is in a different, non-overlapping language. Language(s) of the document are identified based on comparing these competing hypotheses across a number of language pairs. In another embodiment, transitions between non-overlapping character sets are used to segment a document, and each segment is scored separately for a subset of candidate languages. Language(s) of the document are identified based on the segment scores.

Type: Grant

Filed: July 16, 2012

Date of Patent: January 20, 2015

Assignee: Stratify, Inc.

Inventor: Sauraj Goswami
Systems and methods for interactively analyzing communication chains based on messages

Patent number: 8862670

Abstract: A pool of messages, e.g., e-mails and/or other electronic documents that each correspond to a communication from a sender to a recipient, is analyzed to identify communication chains between a source and a target. Sender and recipient identifiers extracted from the messages are used to detect direct and indirect communication links between pairs of entities. Information related to the identified communication chains can be presented to a user via an interactive network graph that supports iterative analysis of the communication-chain data.

Type: Grant

Filed: January 26, 2007

Date of Patent: October 14, 2014

Assignee: Stratify, Inc.

Inventors: Hakan Ancin, David Bayer, Kumar Maddalli, Joy Thomas
Rapid notification system

Patent number: 8788601

Abstract: Improved techniques of fulfilling a request to perform a task involve a master computer placing the request in a first queue and a copy of the request in a second queue, the second queue being frequently accessed by a set of worker computers which rapidly scans the second queue for requests to fulfill. If, during the scanning, a worker computer determines that it has a capability to fulfill the request, the worker computer removes the copy of the request from the second queue. Furthermore, if the copy of the request remains in the second queue after a brief time period, it is clear that the set of worker computers is unable to perform the task. In this case, the master computer takes a remedial action such as notifying a client computer which sent the request that the worker computers, as currently configured, are unable to perform the task.

Type: Grant

Filed: May 26, 2011

Date of Patent: July 22, 2014

Assignee: Stratify, Inc.

Inventors: Anand Rajasekar, Pankaj Nayal
Phrase based document clustering with automatic phrase extraction

Patent number: 8781817

Abstract: Meaningful phrases are distinguished from chance word sequences statistically, by analyzing a large number of documents and using a statistical metric such as a mutual information metric to distinguish meaningful phrases from groups of words that co-occur by chance. In some embodiments, multiple lists of candidate phrases are maintained to optimize the storage requirement of the phrase-identification algorithm. After phrase identification, a combination of words and meaningful phrases can be used to construct clusters of documents.

Type: Grant

Filed: March 4, 2013

Date of Patent: July 15, 2014

Assignee: Stratify, Inc.

Inventors: Joy Thomas, Karthik Ramachandran
Automated parsing of e-mail messages

Patent number: 8527436

Abstract: An automated parser for e-mail messages identifies component parts such as header, body, signature, and disclaimer. The parser uses a hidden Markov model (HMM) in which the lines making up an e mail are treated as a sequence of observations of a system that evolves according to a Markov chain having states corresponding to the component parts. The HMM is trained using a manually-annotated set of e-mail messages, then applied to parse other e-mail messages. HMM-based parsing can be further refined or expanded using heuristic post-processing techniques that exploit redundancy of some component parts (e.g., signatures, disclaimers) across a corpus of e-mail messages.

Type: Grant

Filed: August 30, 2010

Date of Patent: September 3, 2013

Assignee: Stratify, Inc.

Inventors: Vamsi Salaka, Joy Thomas
LANGUAGE IDENTIFICATION FOR DOCUMENTS CONTAINING MULTIPLE LANGUAGES

Publication number: 20130191111

Abstract: Multiple nonoverlapping languages within a single document can be identified. In one embodiment, for each of a set of candidate languages, a set of non-overlapping languages is defined. The document is analyzed under the hypothesis that the whole document is in one language and that part of the document is in one language while the rest is in a different, non-overlapping language. Language(s) of the document are identified based on comparing these competing hypotheses across a number of language pairs. In another embodiment, transitions between non-overlapping character sets are used to segment a document, and each segment is scored separately for a subset of candidate languages. Language(s) of the document are identified based on the segment scores.

Type: Application

Filed: July 16, 2012

Publication date: July 25, 2013

Applicant: Stratify, Inc.

Inventor: Sauraj GOSWAMI
PHRASE BASED DOCUMENT CLUSTERING WITH AUTOMATIC PHRASE EXTRACTION

Publication number: 20130185060

Abstract: Meaningful phrases are distinguished from chance word sequences statistically, by analyzing a large number of documents and using a statistical metric such as a mutual information metric to distinguish meaningful phrases from groups of words that co-occur by chance. In some embodiments, multiple lists of candidate phrases are maintained to optimize the storage requirement of the phrase-identification algorithm. After phrase identification, a combination of words and meaningful phrases can be used to construct clusters of documents.

Type: Application

Filed: March 4, 2013

Publication date: July 18, 2013

Applicant: STRATIFY, INC.

Inventor: STRATIFY, INC.
Adaptive routing of documents to searchable indexes

Patent number: 8484221

Abstract: Documents are assigned to one or more indexes in a document indexing system on the basis of document properties such as total number of tokens in the document, number of numeric tokens in the document, number of alphabetic tokens in the document, size of the document, and metadata associated with the document. Based on statistical distributions of document properties (over a large number of documents), different indexes can be defined, and a document router can direct a particular document to one index or another based on the properties of the particular document. In some implementations, certain document properties may be used to identify a nonrelevant document, or garbage document, so that it is either not indexed or assigned to an index dedicated for such documents.

Type: Grant

Filed: May 25, 2010

Date of Patent: July 9, 2013

Assignee: Stratify, Inc.

Inventors: Kumar Maddali, Joy Thomas
Phrase-based document clustering with automatic phrase extraction

Patent number: 8392175

Abstract: Meaningful phrases are distinguished from chance word sequences statistically, by analyzing a large number of documents and using a statistical metric such as a mutual information metric to distinguish meaningful phrases from groups of words that co-occur by chance. In some embodiments, multiple lists of candidate phrases are maintained to optimize the storage requirement of the phrase-identification algorithm. After phrase identification, a combination of words and meaningful phrases can be used to construct clusters of documents.

Type: Grant

Filed: May 21, 2010

Date of Patent: March 5, 2013

Assignee: Stratify, Inc.

Inventors: Joy Thomas, Karthik Ramachandran
Composite locality sensitive hash based processing of documents

Patent number: 8244767

Abstract: Reliable identification of highly similar documents allows such documents to be treated as identical for purposes of document analysis. Identification of highly similar documents can be based on a composite hash value or other value for which the likelihood of two documents having the same value is high if and only if the documents have a high degree of similarity. Prior to performing content based analysis, the composite hash value for the current document is determined and compared to composite hash values of previously analyzed documents. If a match is found, the results of the analysis of the previous document can be applied to the current document. If no match is found, the current document is analyzed.

Type: Grant

Filed: May 21, 2010

Date of Patent: August 14, 2012

Assignee: Stratify, Inc.

Inventors: Hakan Ancin, Rajashekhar Goli, Ankita Bakshi, Kumar Maddali, Joy Thomas, Karthik Ramachandran
Language identification for documents containing multiple languages

Patent number: 8224641

Abstract: Multiple nonoverlapping languages within a single document can be identified. In one embodiment, for each of a set of candidate languages, a set of non-overlapping languages is defined. The document is analyzed under the hypothesis that the whole document is in one language and that part of the document is in one language while the rest is in a different, non-overlapping language. Language(s) of the document are identified based on comparing these competing hypotheses across a number of language pairs. In another embodiment, transitions between non-overlapping character sets are used to segment a document, and each segment is scored separately for a subset of candidate languages. Language(s) of the document are identified based on the segment scores.

Type: Grant

Filed: November 19, 2008

Date of Patent: July 17, 2012

Assignee: Stratify, Inc.

Inventor: Sauraj Goswami
Automated identification of documents as not belonging to any language

Patent number: 8224642

Abstract: An “impostor profile” for a language is used to determine whether documents are in that language or no language. The impostor profile for a given language provides statistical information about the expected results of applying a language model for one or more other (“impostor”) languages to a document that is in fact in the given language. After a most likely language for a test document is identified, the impostor profile is used together with the scores for the test document in the various impostor languages to determine whether to identify the test document as being in the most likely language or in no language.

Type: Grant

Filed: November 20, 2008

Date of Patent: July 17, 2012

Assignee: Stratify, Inc.

Inventor: Sauraj Goswami
AUTOMATED PARSING OF E-MAIL MESSAGES

Publication number: 20120054135

Abstract: An automated parser for e-mail messages identifies component parts such as header, body, signature, and disclaimer. The parser uses a hidden Markov model (HMM) in which the lines making up an e mail are treated as a sequence of observations of a system that evolves according to a Markov chain having states corresponding to the component parts. The HMM is trained using a manually-annotated set of e-mail messages, then applied to parse other e-mail messages. HMM-based parsing can be further refined or expanded using heuristic post-processing techniques that exploit redundancy of some component parts (e.g., signatures, disclaimers) across a corpus of e-mail messages.

Type: Application

Filed: August 30, 2010

Publication date: March 1, 2012

Applicant: Stratify, Inc.

Inventors: Vamsi Salaka, Joy Thomas
ADAPTIVE ROUTING OF DOCUMENTS TO SEARCHABLE INDEXES

Publication number: 20110191347

Abstract: Documents are assigned to one or more indexes in a document indexing system on the basis of document properties such as total number of tokens in the document, number of numeric tokens in the document, number of alphabetic tokens in the document, size of the document, and metadata associated with the document. Based on statistical distributions of document properties (over a large number of documents), different indexes can be defined, and a document router can direct a particular document to one index or another based on the properties of the particular document. In some implementations, certain document properties may be used to identify a nonrelevant document, or garbage document, so that it is either not indexed or assigned to an index dedicated for such documents.

Type: Application

Filed: May 25, 2010

Publication date: August 4, 2011

Applicant: Stratify, Inc.

Inventors: Kumar Maddali, Joy Thomas
PHRASE-BASED DOCUMENT CLUSTERING WITH AUTOMATIC PHRASE EXTRACTION

Publication number: 20110191098

Abstract: Meaningful phrases are distinguished from chance word sequences statistically, by analyzing a large number of documents and using a statistical metric such as a mutual information metric to distinguish meaningful phrases from groups of words that co-occur by chance. In some embodiments, multiple lists of candidate phrases are maintained to optimize the storage requirement of the phrase-identification algorithm. After phrase identification, a combination of words and meaningful phrases can be used to construct clusters of documents.

Type: Application

Filed: May 21, 2010

Publication date: August 4, 2011

Applicant: Stratify, Inc.

Inventors: Joy Thomas, Karthik Ramachandran
Techniques for organizing data to support efficient review and analysis

Patent number: 7945600

Abstract: Techniques for organizing a corpus of electronic documents. The electronic documents are organized in a manner that facilitates review of the documents. The documents are organized into a concept-based hierarchical collection of folders based upon contents of the documents.

Type: Grant

Filed: March 4, 2005

Date of Patent: May 17, 2011

Assignee: Stratify, Inc.

Inventors: Joy Aloysius Thomas, Mohana Krishna Lakhamraju, George Manianghat Mathew, Pangal Pandurang Nayak, Gollakota Venkata Ramana, John O. Lamping
COMPOSITE LOCALITY SENSITIVE HASH BASED PROCESSING OF DOCUMENTS

Publication number: 20110087669

Abstract: Reliable identification of highly similar documents allows such documents to be treated as identical for purposes of document analysis. Identification of highly similar documents can be based on a composite hash value or other value for which the likelihood of two documents having the same value is high if and only if the documents have a high degree of similarity. Prior to performing content based analysis, the composite hash value for the current document is determined and compared to composite hash values of previously analyzed documents. If a match is found, the results of the analysis of the previous document can be applied to the current document. If no match is found, the current document is analyzed.

Type: Application

Filed: May 21, 2010

Publication date: April 14, 2011

Applicant: Stratify, Inc.

Inventors: Hakan Ancin, Rajashekhar Goli, Ankita Bakshi, Kumar Maddali, Joy Thomas
CLUSTERING OF NEAR-DUPLICATE DOCUMENTS

Publication number: 20110087668

Abstract: Documents likely to be near-duplicates are clustered based on document vectors that represent word-occurrence patterns in a relatively low-dimensional space. Edit distance between documents is defined based on comparing their document vectors. In one process, initial clusters are formed by applying a first edit-distance constraint relative to a root document of each cluster. The initial clusters can be merged subject to a second edit-distance constraint that limits the maximum edit distance between any two documents in the cluster. The second edit-distance constraint can be defined such that whether it is satisfied can be determined by comparing cluster structures rather than individual documents.

Type: Application

Filed: August 27, 2010

Publication date: April 14, 2011

Applicant: Stratify, Inc.

Inventors: Joy Thomas, Sauraj Goswami, Vamsi Salaka
Method and system for guided cluster based processing on prototypes

Patent number: 7877388

Abstract: A method (and system) for clustering a plurality of items. Each of the items includes information. The method includes inputting a plurality of items. Each of the items includes information. The items are provided into a clustering process. The method also inputs an initial organization structure into the clustering process. The initial organization structure includes one or more categories, at least one of the categories being associated with one of the items. The method processes the plurality of items based upon at least the initial organization structure and the information in each of the items; and determines a resulting organization structure based upon the processing. The resulting organization structure relates to the initial organization structure.

Type: Grant

Filed: October 31, 2007

Date of Patent: January 25, 2011

Assignee: Stratify, Inc.

Inventors: John O. Lamping, Ramana Venkata, Shashidhar Thakur, Samdeer Siruguri
Techniques for sharing content information with members of a virtual user group in a network environment without compromising user privacy

Patent number: 7822812

Abstract: Techniques for sharing content information between members of a virtual user group without compromising the privacy of the members. A user can identify content information to be shared with other members of a virtual user group using a user computer system. The content information is then communicated to the other members of the virtual user group and can be accessed by members of the virtual user group in such a manner that the privacy of the user and of the other members of the virtual user group is not compromised. The present invention preserves user privacy by controlling and minimizing the amount of user-related information available/accessible to server systems hosting the virtual user groups.

Type: Grant

Filed: January 3, 2007

Date of Patent: October 26, 2010

Assignee: Stratify, Inc.

Inventors: Rakesh Mathur, Ramesh Subramonian, Ramana Venkata, Pangal P. Nayak, Joy A. Thomas

1 2 next