Patents by Inventor Matthias Gallé

Matthias Gallé has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Multilingual unsupervised neural machine translation with denoising adapters

Patent number: 12380288

Abstract: Methods and systems for unsupervised training for a neural multilingual sequence-to-sequence (seq2seq) model. Denoising adapters for each of one or more languages is inserted into an encoder and/or a decoder of the seq2seq model. Parameters of the one or more denoising adapters are trained on a language-specific denoising task using monolingual text for each of the one or more languages. Cross-attention weights of the seq2seq model with the trained denoising adapter layers are fine-tuned on a translation task in at least one of the one or more languages with parallel data.

Type: Grant

Filed: September 9, 2022

Date of Patent: August 5, 2025

Assignee: NAVER CORPORATION

Inventors: Alexandre Bérard, Laurent Besacier, Matthias Gallé, Ahmet Üstün
MULTILINGUAL UNSUPERVISED NEURAL MACHINE TRANSLATION WITH DENOISING ADAPTERS

Publication number: 20230214605

Abstract: Methods and systems for unsupervised training for a neural multilingual sequence-to-sequence (seq2seq) model. Denoising adapters for each of one or more languages is inserted into an encoder and/or a decoder of the seq2seq model. Parameters of the one or more denoising adapters are trained on a language-specific denoising task using monolingual text for each of the one or more languages. Cross-attention weights of the seq2seq model with the trained denoising adapter layers are fine-tuned on a translation task in at least one of the one or more languages with parallel data.

Type: Application

Filed: September 9, 2022

Publication date: July 6, 2023

Inventors: Alexandre BÉRARD, Laurent BESACIER, Matthias GALLÉ, Ahmet ÜSTÜN
Computer-Implemented Method for Distributional Detection of Machine-Generated Text

Publication number: 20230109734

Abstract: There is disclosed a computer-implemented method for detecting machine-generated documents in a collection of documents including machine-generated and human-authored documents. The computer-implemented method includes computing a set of long-repeated substrings (such as super-maximal repeats) with respect to the collection of documents and using a subset of the long-repeated substrings to designate documents containing the subset of the repeated substrings as machine-generated. The documents designated as machine-generated serve as positive examples of machine-generated documents and a set of documents including at least one human-authored document serves as negative examples of machine-generated documents. A plurality of classifiers are trained with a dataset including both the positive and negative examples of machine-generated documents. Classified output of the classifiers is then used to detect an extent to which a given document of the dataset is machine-generated.

Type: Application

Filed: August 5, 2022

Publication date: April 13, 2023

Applicant: Naver Corporation

Inventors: Matthias GALLE, Hady ELSAHAR, Joseph ROZEN, German KRUSZEWSKI
ADAPTERS FOR ZERO-SHOT MULTILINGUAL NEURAL MACHINE TRANSLATION

Publication number: 20220147721

Abstract: Multilingual neural machine translation systems having monolingual adapter layers and bilingual adapter layers for zero-shot translation include an encoder configured for encoding an input sentence in a source language into an encoder representation and a decoder configured for processing output of the encoder adapter layer to generate a decoder representation. The encoder includes an encoder adapter selector for selecting, from a plurality of encoder adapter layers, an encoder adapter layer for the source language to process the encoder representation. The decoder includes a decoder adapter selector for selecting, from a plurality of decoder adapter layers, a decoder adapter layer for a target language for generating a translated sentence of the input sentence in the target language from the decoder representation.

Type: Application

Filed: November 8, 2021

Publication date: May 12, 2022

Applicant: Naver Corporation

Inventors: Matthias GALLE, Alexandre BERARD, Laurent BESACIER, Jerin PHILIP
ABSTRACTIVE MULTI-DOCUMENT SUMMARIZATION THROUGH SELF-SUPERVISION AND CONTROL

Publication number: 20210342377

Abstract: A method for generating enriched training data for a multi-source transformer neural network for generation of a summary of one or more passages of input text comprises creating, from a plurality of input text sets, training points each comprising an input text subset of the input text set and a corresponding reference input text from the input text set, wherein the size of the input text subset is a predetermined number. Control codes are selected based on reference features corresponding to categorical labels of reference texts in the created training points. The input text is enriched with the selected control codes to generate enriched training data.

Type: Application

Filed: March 5, 2021

Publication date: November 4, 2021

Inventors: Matthias GALLE, Maximin COAVOUX, Hady ELSAHAR
METHODS FOR UNSUPERVISED PREDICTION OF PERFORMANCE DROP DUE TO DOMAIN SHIFT

Publication number: 20210342544

Abstract: A system includes: a natural language processing (NLP) model trained in a training domain and configured to perform natural language processing on an input dataset; an accuracy module configured to: calculate a domain shift metric based on the input dataset; and calculate a predicted decrease in accuracy of the NLP model attributable to domain shift relative to the training domain based on the domain shift metric; and a retraining module configured to selectively trigger a retraining of the NLP model based on the predicted decrease in accuracy of the NLP model.

Type: Application

Filed: April 26, 2021

Publication date: November 4, 2021

Applicant: NAVER FRANCE

Inventors: Matthias GALLE, Hady ELSAHAR
UNSUPERVISED ASPECT-BASED MULTI-DOCUMENT ABSTRACTIVE SUMMARIZATION

Publication number: 20210303796

Abstract: A multi-document summarization system includes: an encoding module configured to receive multiple documents associated with a subject and to, using a first model, generate vector representations for sentences, respectively, of the documents; a grouping module configured to group first and second ones of the sentences associated with first and second aspects into first and second groups, respectively; a group representation module configured to generate a first vector representation based on the first ones of the sentences and a second vector representation based on the second ones of the sentences; a summary module configured to: using a second model: generate a first sentence regarding the first aspect based on the first vector representation; and generate a second sentence regarding the second aspect based on the second vector representation; and store a summary including the first and second sentences in memory in association with the subject.

Type: Application

Filed: March 27, 2020

Publication date: September 30, 2021

Applicant: NAVER CORPORATION

Inventors: Hady ELSAHAR, Maximin COAVOUX, Matthias GALLE
System for mapping a set of related strings on an ontology with a global submodular function

Patent number: 10546009

Abstract: A computer-implemented system and method provide for mapping a set of strings onto an ontology which may be represented as a graph. The method includes receiving a set of strings, each string denoting a respective object. For each of the strings, a pairwise similarity is computed between the string and each of a set of objects in the ontology. For each of a set of candidate subsets (subgraphs) of the set of objects, a global score is computed, which is a function of the pairwise similarities between the strings and the objects in the subset and a tightness score. The tightness score is computed on the objects in the subset with a submodular function. An optimal subset is identified from the set of candidate subsets based on the global scores. Strings in the set of strings are mapped to the objects in the optimal subset, based on the pairwise similarities.

Type: Grant

Filed: October 22, 2014

Date of Patent: January 28, 2020

Assignee: CONDUENT BUSINESS SERVICES, LLC

Inventors: Matthias Gallé, Nikolaos Lagos
Enriching how-to guides by linking actionable phrases

Patent number: 10339122

Abstract: A computer-implemented linking system and method provide for linking actionable phrases in a first document to other documents in a document corpus. The method includes identifying at least one actionable phrase in a first document. The actionable phrase may include an action, its direct object, and any modifier of the direct object. For each identified action phrase the document corpus is searched to identify other documents, which are scored using a scoring function which takes into account occurrences of words of the actionable phrase in each identified document. The actionable phrase is linked to at least a part of one of the most highly ranked documents in the set of documents.

Type: Grant

Filed: September 10, 2015

Date of Patent: July 2, 2019

Assignee: Conduent Business Services, LLC

Inventors: Nikolaos Lagos, Matthias Gallé, Alexandr Chernov
System and method for pruning a set of symbol-based sequences by relaxing an independence assumption of the sequences

Patent number: 10311046

Abstract: A pruning method includes representing a set of sequences in a data structure. Each sequence s includes a first symbol w and a context c of at least one symbol. Some of the sequences are associated with a conditional probability p(w|c), based on observations of cw in training data. For others, p(w|c) is computed as a function of the probability p(w|?) of the respective symbol w in a back-off context ?, p(w|?) being based on observations of sequence ?w in the training data. A scoring function ƒ(cw) value is computed for each sequence in the set, based on p(w|c) for the sequence and a probability distribution p(s) of each symbol in the sequence if it is removed from the set of sequences. Iteratively, one of the represented sequences is selected to be removed, based on the computed scoring function values, and the scoring function values of remaining sequences are updated.

Type: Grant

Filed: September 12, 2016

Date of Patent: June 4, 2019

Assignee: Conduent Business Services, LLC

Inventors: Matias Hunicken, Matthias Gallé
METHOD AND SYSTEM FOR ASSISTING USERS IN AN AUTOMATED DECISION-MAKING ENVIRONMENT

Publication number: 20180204126

Abstract: A system and method guide the modification of an input feature vector to an automatic classifier model to cause the classifier to give a desired class without modifying the classifier. A user defines costs for independently modifying feature values for at least some of the features in an initial feature vector that the classifier model has given an undesired class. Subspaces are identified in a feature space in which the classifier model classifies feature vectors in the desired class. With a cost function which takes into account the user-defined costs, a modified feature vector is identified in one of the identified subspaces which optimizes the cost function. The modified feature vector or information based thereon is output.

Type: Application

Filed: January 17, 2017

Publication date: July 19, 2018

Applicant: Xerox Corporation

Inventor: Matthias Gallé
Matching co-referring entities from serialized data for schema inference

Patent number: 9977817

Abstract: A system and method provide for identifying coreference from serialized data coming from different services. The method includes generating a tree structure from serialized data. The serialized data includes responses to queries from the different services. The responses each identify a hierarchical relationship between a respective set of objects. Nodes of the tree structure each have a name corresponding to a respective one of the objects. The tree structure is traversed in a breadth first manner and, for each node in the tree structure, a respective pairwise similarity is computed with each of the other nodes of the tree structure. The computed pairwise similarity is compared with a threshold to identify co-referring nodes that refer to a same entity. The threshold is a function of a depth of the node in the tree structure.

Type: Grant

Filed: October 20, 2014

Date of Patent: May 22, 2018

Assignee: Conduent Business Services, LLC

Inventors: Matthias Gallé, Nikolaos Lagos
SYMBOL PREDICTION WITH GAPPED SEQUENCE MODELS

Publication number: 20180011839

Abstract: A symbol prediction method includes storing a statistic for each of a set of symbols w in at least one context, each context including a string of k preceding symbols and a string of l subsequent symbols, the statistic being based on observations of a string kwl in training data. For an input sequence of symbols, a prediction is computed for at least one symbol in the input sequence, based on the stored statistics. The computing includes, where the symbol is in a context in the sequence not having a stored statistic, computing the prediction for the symbol in that context based on a stored statistic for the symbol in a more general context.

Type: Application

Filed: July 7, 2016

Publication date: January 11, 2018

Applicant: Xerox Corporation

Inventors: Matthias Gallé, Matías Hunicken
SCALABLE SPECTRAL MODELING OF SPARSE SEQUENCE FUNCTIONS VIA A BEST MATCHING ALGORITHM

Publication number: 20170351786

Abstract: A method for modeling a sparse function over sequences is described. The method includes inputting a set of sequences that support a function. A set of prefixes and a set of suffixes for the set of sequences are identified. A sub-block of a full matrix is identified which has the full structural rank as the full matrix. The full matrix includes an entry for each pair of a prefix and a suffix from the sets of prefixes and suffixes. A matrix for the sub-block is computed. A minimal non-deterministic weighted automaton which models the function is computed, based on the sub-block matrix. Information based on the identified minimal non-deterministic weighted automaton is output.

Type: Application

Filed: June 2, 2016

Publication date: December 7, 2017

Applicant: Xerox Corporation

Inventors: Ariadna Julieta Quattoni, Xavier Carreras, Matthias Gallé
Generation of textual documents with reduced de Bruijn graphs

Patent number: 9740679

Abstract: A method for generating an output sequence includes receiving an input sequence of symbols. An output sequence is generated from a reduced directed graph derived from n-gram statistics for a corpus sequence of symbols. The graph includes nodes connected by edges that are labeled with a sequence of symbols and associated with a multiplicity representing a number of occurrences of the sequence of symbols in the corpus sequence. Each path through the graph where each edge is traversed its multiplicity of times reconstructs the corpus sequence. The sequences of symbols in the reduced graph vary in number of symbols. The output sequence from the first iteration, and optionally also an output sequence from at least one subsequent iteration, is/are output. The output sequence may be proposed to an author to assist in generating a document.

Type: Grant

Filed: December 8, 2015

Date of Patent: August 22, 2017

Assignee: XEROX CORPORATION

Inventor: Matthias Gallé
GENERATION OF TEXTUAL DOCUMENTS WITH REDUCED DE BRUIJN GRAPHS

Publication number: 20170161254

Abstract: A method for generating an output sequence includes receiving an input sequence of symbols. An output sequence is generated from a reduced directed graph derived from n-gram statistics for a corpus sequence of symbols. The graph includes nodes connected by edges that are labeled with a sequence of symbols and associated with a multiplicity representing a number of occurrences of the sequence of symbols in the corpus sequence. Each path through the graph where each edge is traversed its multiplicity of times reconstructs the corpus sequence. The sequences of symbols in the reduced graph vary in number of symbols. The output sequence from the first iteration, and optionally also an output sequence from at least one subsequent iteration, is/are output. The output sequence may be proposed to an author to assist in generating a document.

Type: Application

Filed: December 8, 2015

Publication date: June 8, 2017

Applicant: Xerox Corporation

Inventor: Matthias Gallé
Language identification on social media

Patent number: 9645995

Abstract: A method for language prediction of a social network post includes generating a social network graph which includes nodes connected by edges. Some of the nodes are user nodes representing users of a social network and some of the nodes are social network post nodes representing social network posts. At least some of the users are authors of social network posts represented by respective social network post nodes. Edges of the graph are associated with respective weights. At least one of the social network post nodes is unlabeled. Language labels are predicted for the at least one unlabeled social network post node which includes propagating language labels through the graph. A language of the social network post is predicted based on the predicted language labels for the social network post node representing that social network post and optionally also based on content-based features.

Type: Grant

Filed: March 24, 2015

Date of Patent: May 9, 2017

Assignee: CONDUENT BUSINESS SERVICES, LLC

Inventors: Matthias Gallé, William Radford
ENRICHING HOW-TO GUIDES BY LINKING ACTIONABLE PHRASES

Publication number: 20170075935

Abstract: A computer-implemented linking system and method provide for linking actionable phrases in a first document to other documents in a document corpus. The method includes identifying at least one actionable phrase in a first document. The actionable phrase may include an action, its direct object, and any modifier of the direct object. For each identified action phrase the document corpus is searched to identify other documents, which are scored using a scoring function which takes into account occurrences of words of the actionable phrase in each identified document. The actionable phrase is linked to at least a part of one of the most highly ranked documents in the set of documents.

Type: Application

Filed: September 10, 2015

Publication date: March 16, 2017

Applicant: Xerox Corporation

Inventors: Nikolaos Lagos, Matthias Gallé, Alexandr Chernov
SYSTEM AND METHOD FOR ADDING NOISE TO n-GRAM STATISTICS

Publication number: 20160342706

Abstract: A system and method modify n-gram statistics to allow their release by inhibiting reconstruction of a sequence from which they are derived. n-gram statistics for the sequence are obtained which include, for each of a set of n-grams, an associated measure of occurrence in the sequence. An initial directed graph is generated from the n-gram statistics. The graph includes nodes connected by edges, each of the edges corresponding to one of the n-grams in the set of n-grams. The edge is associated with a multiplicity which is based on the measure of occurrence. A modified directed graph is generated. This includes adding a plurality of edges to the initial directed graph. These added edges correspond to n-grams that are not present in the sequence of symbols and are each associated with a multiplicity. Modified n-gram statistics for the modified directed graph are generated. The modified n-gram statistics include, for n-grams represented in the modified directed graph, an associated measure of occurrence.

Type: Application

Filed: May 18, 2015

Publication date: November 24, 2016

Inventor: Matthias Gallé
LANGUAGE IDENTIFICATION ON SOCIAL MEDIA

Publication number: 20160283462

Abstract: A method for language prediction of a social network post includes generating a social network graph which includes nodes connected by edges. Some of the nodes are user nodes representing users of a social network and some of the nodes are social network post nodes representing social network posts. At least some of the users are authors of social network posts represented by respective social network post nodes. Edges of the graph are associated with respective weights. At least one of the social network post nodes is unlabeled. Language labels are predicted for the at least one unlabeled social network post node which includes propagating language labels through the graph. A language of the social network post is predicted based on the predicted language labels for the social network post node representing that social network post and optionally also based on content-based features.

Type: Application

Filed: March 24, 2015

Publication date: September 29, 2016

Inventors: Matthias Gallé, William Radford

1 2 next