Patents by Inventor Matthias Galle

Matthias Galle has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Computer-implemented method for distributional detection of machine-generated text

Patent number: 12361214

Abstract: There is disclosed a computer-implemented method for detecting machine-generated documents in a collection of documents including machine-generated and human-authored documents. The computer-implemented method includes computing a set of long-repeated substrings (such as super-maximal repeats) with respect to the collection of documents and using a subset of the long-repeated substrings to designate documents containing the subset of the repeated substrings as machine-generated. The documents designated as machine-generated serve as positive examples of machine-generated documents and a set of documents including at least one human-authored document serves as negative examples of machine-generated documents. A plurality of classifiers are trained with a dataset including both the positive and negative examples of machine-generated documents. Classified output of the classifiers is then used to detect an extent to which a given document of the dataset is machine-generated.

Type: Grant

Filed: August 5, 2022

Date of Patent: July 15, 2025

Assignee: Naver Corporation

Inventors: Matthias Galle, Hady Elsahar, Joseph Rozen, German Kruszewski
Methods for unsupervised prediction of performance drop due to domain shift

Patent number: 11907663

Abstract: A system includes: a natural language processing (NLP) model trained in a training domain and configured to perform natural language processing on an input dataset; an accuracy module configured to: calculate a domain shift metric based on the input dataset; and calculate a predicted decrease in accuracy of the NLP model attributable to domain shift relative to the training domain based on the domain shift metric; and a retraining module configured to selectively trigger a retraining of the NLP model based on the predicted decrease in accuracy of the NLP model.

Type: Grant

Filed: April 26, 2021

Date of Patent: February 20, 2024

Assignee: NAVER FRANCE

Inventors: Matthias Galle, Hady Elsahar
Abstractive multi-document summarization through self-supervision and control

Patent number: 11797591

Abstract: A method for generating enriched training data for a multi-source transformer neural network for generation of a summary of one or more passages of input text comprises creating, from a plurality of input text sets, training points each comprising an input text subset of the input text set and a corresponding reference input text from the input text set, wherein the size of the input text subset is a predetermined number. Control codes are selected based on reference features corresponding to categorical labels of reference texts in the created training points. The input text is enriched with the selected control codes to generate enriched training data.

Type: Grant

Filed: March 5, 2021

Date of Patent: October 24, 2023

Assignee: NAVER CORPORATION

Inventors: Matthias Galle, Maximin Coavoux, Hady Elsahar
Unsupervised aspect-based multi-document abstractive summarization

Patent number: 11494564

Abstract: A multi-document summarization system includes: an encoding module configured to receive multiple documents associated with a subject and to, using a first model, generate vector representations for sentences, respectively, of the documents; a grouping module configured to group first and second ones of the sentences associated with first and second aspects into first and second groups, respectively; a group representation module configured to generate a first vector representation based on the first ones of the sentences and a second vector representation based on the second ones of the sentences; a summary module configured to: using a second model: generate a first sentence regarding the first aspect based on the first vector representation; and generate a second sentence regarding the second aspect based on the second vector representation; and store a summary including the first and second sentences in memory in association with the subject.

Type: Grant

Filed: March 27, 2020

Date of Patent: November 8, 2022

Assignee: NAVER CORPORATION

Inventors: Hady Elsahar, Maximin Coavoux, Matthias Galle
Identifying repeat subsequences by left and right contexts

Patent number: 9760546

Abstract: A system and method of identifying repeat subsequences having at least a value of x for threshold of different left contexts and a value of y for a threshold of different right contexts for an input sequence are disclosed. The method may include generating a lexicographically sorted suffix array for the input sequence and a longest common prefix array. The suffix array is traversed in lexicographic order comparing the longest common prefix values between consecutive suffixes. Suffixes with the same longest common prefix are representative of occurrence of the same repeat, a higher longest common prefix indicates a new occurrence of a longer repeat, and a lower longest common prefix indicates the last occurrence of a repeat.

Type: Grant

Filed: May 24, 2013

Date of Patent: September 12, 2017

Assignee: XEROX CORPORATION

Inventor: Matthias Galle
Method and system for motif extraction in electronic documents

Patent number: 9483463

Abstract: A method, system, and computer program product for extracting text motifs from the electronic documents is disclosed. A user provides a largest-maximal repeat or a super-maximal repeat as a first text block. The occurrences of the first text block are detected to identify the second text blocks in the vicinity of the occurrences of the first text block on the basis of pre-defined parameters. The text motifs are determined by combining the first text block and the second text block. Finally, the text motifs are extracted from the electronic documents.

Type: Grant

Filed: September 10, 2012

Date of Patent: November 1, 2016

Assignee: Xerox Corporation

Inventors: Matthias Galle, Jean-Michel Renders
Incremental computation of repeats

Patent number: 9268749

Abstract: A method of updating a suffix tree includes providing an initial suffix tree based on a first sequence of symbols drawn from an alphabet. The suffix tree includes existing nodes representing respective subsequences occurring in the first sequence of symbols. The existing nodes are associated with information relating to membership of the subsequences in at least one class of repeat subsequences. A second sequence of symbols is received and the initial suffix tree is updated to form an updated suffix tree by adding new nodes representing subsequences occurring in the second sequence of symbols that are not represented by the existing nodes. The subsequences represented by the new nodes are ordered in a new node data structure which is processed to updating the information relating to the at least one class of repeat subsequences associated with at least some of the nodes in the updated suffix tree.

Type: Grant

Filed: October 7, 2013

Date of Patent: February 23, 2016

Assignee: XEROX CORPORATION

Inventors: Matias D. Tealdi, Matthias Galle
Reconstructing documents from -gram information

Patent number: 9201980

Abstract: A method for reconstruction includes providing a directed input graph generated from a set of n-grams and statistics for the n-grams, edges of the graph being joined through nodes of the graph. Each edge has an associated label and a multiplicity of at least one. Each of the n-grams in the set being represented by a respective one of the labels, whereby a Eulerian cycle through the graph traverses each edge the respective multiplicity of times. Reduction rules are applied iteratively to generate a refined graph which is both irreducible and equivalent to the input graph. Information is output based on the labels of the refined graph.

Type: Grant

Filed: November 19, 2013

Date of Patent: December 1, 2015

Assignee: XEROX Corporation

Inventors: Matias D. Tealdi, Matthias Galle
Bag-of-repeats representation of documents

Patent number: 9183193

Abstract: A system and method for representing a textual document based on the occurrence of repeats are disclosed. The system includes a sequence generator which defines a sequence representing words forming a collection of documents. A repeat calculator identifies a set of repeats within the sequence, the set of repeats comprising subsequences of the sequence which each occur more than once. A representation generator generates a representation for at least one document in the collection of documents based on occurrence, in the document, of repeats from the set of repeats.

Type: Grant

Filed: February 12, 2013

Date of Patent: November 10, 2015

Assignee: XEROX CORPORATION

Inventor: Matthias Galle
RECONSTRUCTING DOCUMENTS FROM n-GRAM INFORMATION

Publication number: 20150142853

Abstract: A method for reconstruction includes providing a directed input graph generated from a set of n-grams and statistics for the n-grams, edges of the graph being joined through nodes of the graph. Each edge has an associated label and a multiplicity of at least one. Each of the n-grams in the set being represented by a respective one of the labels, whereby a Eulerian cycle through the graph traverses each edge the respective multiplicity of times. Reduction rules are applied iteratively to generate a refined graph which is both irreducible and equivalent to the input graph. Information is output based on the labels of the refined graph.

Type: Application

Filed: November 19, 2013

Publication date: May 21, 2015

Applicant: Xerox Corporation

Inventors: Matias D. Tealdi, Matthias Galle
INCREMENTAL COMPUTATION OF REPEATS

Publication number: 20150100304

Abstract: A method of updating a suffix tree includes providing an initial suffix tree based on a first sequence of symbols drawn from an alphabet. The suffix tree includes existing nodes representing respective subsequences occurring in the first sequence of symbols. The existing nodes are associated with information relating to membership of the subsequences in at least one class of repeat subsequences. A second sequence of symbols is received and the initial suffix tree is updated to form an updated suffix tree by adding new nodes representing subsequences occurring in the second sequence of symbols that are not represented by the existing nodes. The subsequences represented by the new nodes are ordered in a new node data structure which is processed to updating the information relating to the at least one class of repeat subsequences associated with at least some of the nodes in the updated suffix tree.

Type: Application

Filed: October 7, 2013

Publication date: April 9, 2015

Applicant: Xerox Corporation

Inventors: Matias D. Tealdi, Matthias Galle
IDENTIFYING REPEAT SUBSEQUENCES BY LEFT AND RIGHT CONTEXTS

Publication number: 20140350917

Abstract: A system and method of identifying repeat subsequences having at least a value of x for threshold of different left contexts and a value of y for a threshold of different right contexts for an input sequence are disclosed. The method may include generating a lexicographically sorted suffix array for the input sequence and a longest common prefix array. The suffix array is traversed in lexicographic order comparing the longest common prefix values between consecutive suffixes. Suffixes with the same longest common prefix are representative of occurrence of the same repeat, a higher longest common prefix indicates a new occurrence of a longer repeat, and a lower longest common prefix indicates the last occurrence of a repeat.

Type: Application

Filed: May 24, 2013

Publication date: November 27, 2014

Applicant: Xerox Corporation

Inventor: Matthias Galle
Full and semi-batch clustering

Patent number: 8880525

Abstract: A method for clustering documents is provided. Each document is represented by a multidimensional data point. The data points are initially assigned to a respective cluster and serve as their initial representative points. Thereafter, in an iterative process, the data points are clustered among the clusters, by assigning the data points to the clusters based on a comparison measure of each data point with the cluster or its representative point, and a threshold of the comparison measure. Based on this clustering, a new representative point for each of the clusters can be computed. Optionally, overlapping clusters are merged. For the next iteration, the new representative points are used as the representative points. An assignment of the documents to the clusters is output, based on a clustering of the data points in the latest iteration. Multiple batches may be processed, retaining the initial clusters to which the original batch was assigned.

Type: Grant

Filed: April 2, 2012

Date of Patent: November 4, 2014

Assignee: Xerox Corporation

Inventors: Matthias Galle, Jean-Michel Renders
BAG-OF-REPEATS REPRESENTATION OF DOCUMENTS

Publication number: 20140229160

Abstract: A system and method for representing a textual document based on the occurrence of repeats are disclosed. The system includes a sequence generator which defines a sequence representing words forming a collection of documents. A repeat calculator identifies a set of repeats within the sequence, the set of repeats comprising subsequences of the sequence which each occur more than once. A representation generator generates a representation for at least one document in the collection of documents based on occurrence, in the document, of repeats from the set of repeats.

Type: Application

Filed: February 12, 2013

Publication date: August 14, 2014

Applicant: Xerox Corporation

Inventor: Matthias Galle
METHOD AND SYSTEM FOR MOTIF EXTRACTION IN ELECTRONIC DOCUMENTS

Publication number: 20140074455

Abstract: A method, system, and computer program product for extracting text motifs from the electronic documents is disclosed. A user provides a largest-maximal repeat or a super-maximal repeat as a first text block. The occurrences of the first text block are detected to identify the second text blocks in the vicinity of the occurrences of the first text block on the basis of pre-defined parameters. The text motifs are determined by combining the first text block and the second text block. Finally, the text motifs are extracted from the electronic documents.

Type: Application

Filed: September 10, 2012

Publication date: March 13, 2014

Applicant: Xerox Corporation

Inventors: Matthias Galle, Jean-Michel Renders
FULL AND SEMI-BATCH CLUSTERING

Publication number: 20130262465

Abstract: A method for clustering documents is provided. Each document is represented by a multidimensional data point. The data points are initially assigned to a respective cluster and serve as their initial representative points. Thereafter, in an iterative process, the data points are clustered among the clusters, by assigning the data points to the clusters based on a comparison measure of each data point with the cluster or its representative point, and a threshold of the comparison measure. Based on this clustering, a new representative point for each of the clusters can be computed. Optionally, overlapping clusters are merged. For the next iteration, the new representative points are used as the representative points. An assignment of the documents to the clusters is output, based on a clustering of the data points in the latest iteration. Multiple batches may be processed, retaining the initial clusters to which the original batch was assigned.

Type: Application

Filed: April 2, 2012

Publication date: October 3, 2013

Applicant: Xerox Corporation

Inventors: Matthias Galle, Jean-Michel Renders