Patents by Inventor Ciprian Chelba

Ciprian Chelba has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Dataset Refining with Machine Translation Quality Prediction

Publication number: 20230025739

Abstract: Aspects of the technology employ a machine translation quality prediction (MTQP) model to refine datasets that are used in training machine translation systems. This includes receiving, by a machine translation quality prediction model, a sentence pair of a source sentence and a translated output (802). Then performing feature extraction on the sentence pair using a set of two or more feature extractors, where each feature extractor generates a corresponding feature vector (804). The corresponding feature vectors from the set of feature extractors are concatenated together (806). And the concatenated feature vectors are applied to a feedforward neural network, in which the feedforward neural network generates a machine translation quality prediction score for the translated output (808).

Type: Application

Filed: June 29, 2022

Publication date: January 26, 2023

Inventors: Junpei Zhou, Yuezhang Li, Ciprian Chelba, Fangxiaoyu Feng, Bowen Liang, Pidong Wang
Back-off language model compression

Patent number: 8725509

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, relating to language models stored for digital language processing. In one aspect, a method includes the actions of generating a language model, including: receiving a collection of n-grams from a corpus, each n-gram of the collection having a corresponding first probability of occurring in the corpus, and generating a trie representing the collection of n-grams, the trie being represented using one or more arrays of integers, and compressing an array representation of the trie using block encoding; and using the language model to identify a second probability of a particular string of words occurring.

Type: Grant

Filed: June 17, 2009

Date of Patent: May 13, 2014

Assignee: Google Inc.

Inventors: Boulos Harb, Ciprian Chelba, Jeffrey A. Dean, Sanjay Ghemawat
Applying a structured language model to information extraction

Patent number: 8706491

Abstract: One feature of the present invention uses the parsing capabilities of a structured language model in the information extraction process. During training, the structured language model is first initialized with syntactically annotated training data. The model is then trained by generating parses on semantically annotated training data enforcing annotated constituent boundaries. The syntactic labels in the parse trees generated by the parser are then replaced with joint syntactic and semantic labels. The model is then trained by generating parses on the semantically annotated training data enforcing the semantic tags or labels found in the training data. The trained model can then be used to extract information from test data using the parses generated by the model.

Type: Grant

Filed: August 24, 2010

Date of Patent: April 22, 2014

Assignee: Microsoft Corporation

Inventors: Ciprian Chelba, Milind Mahajan
System for using statistical classifiers for spoken language understanding

Patent number: 8335683

Abstract: The present invention involves using one or more statistical classifiers in order to perform task classification on natural language inputs. In another embodiment, the statistical classifiers can be used in conjunction with a rule-based classifier to perform task classification.

Type: Grant

Filed: January 23, 2003

Date of Patent: December 18, 2012

Assignee: Microsoft Corporation

Inventors: Alejandro Acero, Ciprian Chelba, Ye-Yi Wang, Leon Wong, Brendan Frey
Discriminative training of language models for text and speech classification

Patent number: 8306818

Abstract: Methods are disclosed for estimating language models such that the conditional likelihood of a class given a word string, which is very well correlated with classification accuracy, is maximized. The methods comprise tuning statistical language model parameters jointly for all classes such that a classifier discriminates between the correct class and the incorrect ones for a given training sentence or utterance. Specific embodiments of the present invention pertain to implementation of the rational function growth transform in the context of a discriminative training technique for n-gram classifiers.

Type: Grant

Filed: April 15, 2008

Date of Patent: November 6, 2012

Assignee: Microsoft Corporation

Inventors: Ciprian Chelba, Alejandro Acero, Milind Mahajan
Representing n-gram language models for compact storage and fast retrieval

Patent number: 8175878

Abstract: Systems, methods, and apparatuses, including computer program products, are provided for representing language models. In some implementations, a computer-implemented method is provided. The method includes generating a compact language model including receiving a collection of n-grams from the corpus, each n-gram of the collection having a corresponding first probability of occurring in the corpus and generating a trie representing the collection of n-grams. The method also includes using the language model to identify a second probability of a particular string of words occurring.

Type: Grant

Filed: December 14, 2010

Date of Patent: May 8, 2012

Assignee: Google Inc.

Inventors: Ciprian Chelba, Thorsten Brants
Representing n-gram language models for compact storage and fast retrieval

Patent number: 7877258

Abstract: Systems, methods, and apparatuses, including computer program products, are provided for representing language models. In some implementations, a computer-implemented method is provided. The method includes generating a compact language model including receiving a collection of n-grams from the corpus, each n-gram of the collection having a corresponding first probability of occurring in the corpus and generating a trie representing the collection of n-grams. The method also includes using the language model to identify a second probability of a particular string of words occurring.

Type: Grant

Filed: March 29, 2007

Date of Patent: January 25, 2011

Assignee: Google Inc.

Inventors: Ciprian Chelba, Thorsten Brants
APPLYING A STRUCTURED LANGUAGE MODEL TO INFORMATION EXTRACTION

Publication number: 20100318348

Abstract: One feature of the present invention uses the parsing capabilities of a structured language model in the information extraction process. During training, the structured language model is first initialized with syntactically annotated training data. The model is then trained by generating parses on semantically annotated training data enforcing annotated constituent boundaries. The syntactic labels in the parse trees generated by the parser are then replaced with joint syntactic and semantic labels. The model is then trained by generating parses on the semantically annotated training data enforcing the semantic tags or labels found in the training data. The trained model can then be used to extract information from test data using the parses generated by the model.

Type: Application

Filed: August 24, 2010

Publication date: December 16, 2010

Applicant: MICROSOFT CORPORATION

Inventors: Ciprian Chelba, Milind Mahajan
Applying a structured language model to information extraction

Patent number: 7805302

Abstract: One feature of the present invention uses the parsing capabilities of a structured language model in the information extraction process. During training, the structured language model is first initialized with syntactically annotated training data. The model is then trained by generating parses on semantically annotated training data enforcing annotated constituent boundaries. The syntactic labels in the parse trees generated by the parser are then replaced with joint syntactic and semantic labels. The model is then trained by generating parses on the semantically annotated training data enforcing the semantic tags or labels found in the training data. The trained model can then be used to extract information from test data using the parses generated by the model.

Type: Grant

Filed: May 20, 2002

Date of Patent: September 28, 2010

Assignee: Microsoft Corporation

Inventors: Ciprian Chelba, Milind Mahajan
Generic spelling mnemonics

Patent number: 7765102

Abstract: A system and method for creating a mnemonics Language Model for use with a speech recognition software application, wherein the method includes generating an n-gram Language Model containing a predefined large body of characters, wherein the n-gram Language Model includes at least one character from the predefined large body of characters, constructing a new Language Model (LM) token for each of the at least one character, extracting pronunciations for each of the at least one character responsive to a predefined pronunciation dictionary to obtain a character pronunciation representation, creating at least one alternative pronunciation for each of the at least one character responsive to the character pronunciation representation to create an alternative pronunciation dictionary and compiling the n-gram Language Model for use with the speech recognition software application, wherein compiling the Language Model is responsive to the new Language Model token and the alternative pronunciation dictionary.

Type: Grant

Filed: July 11, 2008

Date of Patent: July 27, 2010

Assignee: Microsoft Corporation

Inventors: David Mowatt, Robert Chambers, Ciprian Chelba, Qiang Wu
Conditional maximum likelihood estimation of naïve bayes probability models

Patent number: 7624006

Abstract: A statistical classifier is constructed by estimating Naïve Bayes classifiers such that the conditional likelihood of class given word sequence is maximized. The classifier is constructed using a rational function growth transform implemented for Naïve Bayes classifiers. The estimation method tunes the model parameters jointly for all classes such that the classifier discriminates between the correct class and the incorrect ones for a given training sentence or utterance. Optional parameter smoothing and/or convergence speedup can be used to improve model performance. The classifier can be integrated into a speech utterance classification system or other natural language processing system.

Type: Grant

Filed: September 15, 2004

Date of Patent: November 24, 2009

Assignee: Microsoft Corporation

Inventors: Ciprian Chelba, Alejandro Acero
Language model adaptation using semantic supervision

Patent number: 7478038

Abstract: A method and apparatus are provided for adapting a language model. The method and apparatus provide supervised class-based adaptation of the language model utilizing in-domain semantic information.

Type: Grant

Filed: March 31, 2004

Date of Patent: January 13, 2009

Assignee: Microsoft Corporation

Inventors: Ciprian Chelba, Milind Mahajan, Alejandro Acero, Yik-Cheung Tam
GENERIC SPELLING MNEMONICS

Publication number: 20080319749

Abstract: A system and method for creating a mnemonics Language Model for use with a speech recognition software application, wherein the method includes generating an n-gram Language Model containing a predefined large body of characters, wherein the n-gram Language Model includes at least one character from the predefined large body of characters, constructing a new Language Model (LM) token for each of the at least one character, extracting pronunciations for each of the at least one character responsive to a predefined pronunciation dictionary to obtain a character pronunciation representation, creating at least one alternative pronunciation for each of the at least one character responsive to the character pronunciation representation to create an alternative pronunciation dictionary and compiling the n-gram Language Model for use with the speech recognition software application, wherein compiling the Language Model is responsive to the new Language Model token and the alternative pronunciation dictionary.

Type: Application

Filed: July 11, 2008

Publication date: December 25, 2008

Applicant: MICROSOFT CORPORATION

Inventors: David Mowatt, Robert Chambers, Ciprian Chelba, Qiang Wu
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR TEXT AND SPEECH CLASSIFICATION

Publication number: 20080215311

Abstract: Methods are disclosed for estimating language models such that the conditional likelihood of a class given a word string, which is very well correlated with classification accuracy, is maximized. The methods comprise tuning statistical language model parameters jointly for all classes such that a classifier discriminates between the correct class and the incorrect ones for a given training sentence or utterance. Specific embodiments of the present invention pertain to implementation of the rational function growth transform in the context of a discriminative training technique for n-gram classifiers.

Type: Application

Filed: April 15, 2008

Publication date: September 4, 2008

Applicant: Microsoft Corporation

Inventors: Ciprian Chelba, Alejandro Acero, Milind Mahajan
Generic spelling mnemonics

Patent number: 7418387

Abstract: A system and method for creating a mnemonics Language Model for use with a speech recognition software application, wherein the method includes generating an n-gram Language Model containing a predefined large body of characters, wherein the n-gram Language Model includes at least one character from the predefined large body of characters, constructing a new language Model (LM) token for each of the at least one character, extracting pronunciations for each of the at least one character responsive to a predefined pronunciation dictionary to obtain a character pronunciation representation, creating at least one alternative pronunciation for each of the at least one character responsive to the character pronunciation representation to create an alternative pronunciation dictionary and compiling the n-gram Language Model for use with the speech recognition software application, wherein compiling the Language Model is responsive to the new Language Model token and the alternative pronunciation dictionary.

Type: Grant

Filed: November 24, 2004

Date of Patent: August 26, 2008

Assignee: Microsoft Corporation

Inventors: David Mowatt, Robert Chambers, Ciprian Chelba, Qiang Wu
Representation of a deleted interpolation N-gram language model in ARPA standard format

Patent number: 7406416

Abstract: A method and apparatus are provided for storing parameters of a deleted interpolation language model as parameters of a backoff language model. In particular, the parameters of the deleted interpolation language model are stored in the standard ARPA format. Under one embodiment, the deleted interpolation language model parameters are formed using fractional counts.

Type: Grant

Filed: March 26, 2004

Date of Patent: July 29, 2008

Assignee: Microsoft Corporation

Inventors: Ciprian Chelba, Milind Mahajan, Alejandro Acero
Discriminative training of language models for text and speech classification

Patent number: 7379867

Abstract: Methods are disclosed for estimating language models such that the conditional likelihood of a class given a word string, which is very well correlated with classification accuracy, is maximized. The methods comprise tuning statistical language model parameters jointly for all classes such that a classifier discriminates between the correct class and the incorrect ones for a given training sentence or utterance. Specific embodiments of the present invention pertain to implementation of the rational function growth transform in the context of a discriminative training technique for n-gram classifiers.

Type: Grant

Filed: June 3, 2003

Date of Patent: May 27, 2008

Assignee: Microsoft Corporation

Inventors: Ciprian Chelba, Alejandro Acero, Milind Mahajan
Time-anchored posterior indexing of speech

Publication number: 20070143110

Abstract: A computer-implemented method of indexing a speech lattice for search of audio corresponding to the speech lattice is provided. The method includes identifying at least two speech recognition hypotheses for a word which have time ranges satisfying a criteria. The method further includes merging the at least two speech recognition hypotheses to generate a merged speech recognition hypothesis for the word.

Type: Application

Filed: December 15, 2005

Publication date: June 21, 2007

Applicant: Microsoft Corporation

Inventors: Alejandro Acero, Asela Gunawardana, Ciprian Chelba, Erik Selberg, Frank Torsten Seide, Patrick Nguyen, Roger Yu
Speech index pruning

Publication number: 20070106512

Abstract: A speech segment is indexed by identifying at least two alternative word sequences for the speech segment. For each word in the alternative sequences, information is placed in an entry for the word in the index. Speech units are eliminated from entries in the index based on a comparison of a probability that the word appears in the speech segment and a threshold value.

Type: Application

Filed: November 9, 2005

Publication date: May 10, 2007

Applicant: Microsoft Corporation

Inventors: Alejandro Acero, Ciprian Chelba, Jorge Sanchez
Indexing and searching speech with text meta-data

Publication number: 20070106509

Abstract: An index for searching spoken documents having speech data and text meta-data is created by obtaining probabilities of occurrence of words and positional information of the words of the speech data and combining it with at least positional information of the words in the text meta-data. A single index can be created because the speech data and the text meta-data are treated the same and considered only different categories.

Type: Application

Filed: November 8, 2005

Publication date: May 10, 2007

Applicant: Microsoft Corporation

Inventors: Alejandro Acero, Ciprian Chelba, Jorge Sanchez

1 2 next