Patents by Inventor Andrew Rosenberg

Andrew Rosenberg has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Using speech recognition to improve cross-language speech synthesis

Patent number: 11990117

Abstract: A method for training a speech recognition model includes obtaining a multilingual text-to-speech (TTS) model. The method also includes generating a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language. The method also includes generating a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language. The method also includes generating a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation. The method also includes determining a consistent loss term based on the first speech recognition result and the second speech recognition result and updating parameters of the speech recognition model based on the consistent loss term.

Type: Grant

Filed: October 20, 2021

Date of Patent: May 21, 2024

Assignee: Google LLC

Inventors: Zhehuai Chen, Bhuvana Ramabhadran, Andrew Rosenberg, Yu Zhang, Pedro J. Moreno Mengibar
Consistency prediction on streaming sequence models

Patent number: 11929060

Abstract: A method for training a speech recognition model includes receiving a set of training utterance pairs each including a non-synthetic speech representation and a synthetic speech representation of a same corresponding utterance. At each of a plurality of output steps for each training utterance pair in the set of training utterance pairs, the method also includes determining a consistent loss term for the corresponding training utterance pair based on a first probability distribution over possible non-synthetic speech recognition hypotheses generated for the corresponding non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses generated for the corresponding synthetic speech representation. The first and second probability distributions are generated for output by the speech recognition model.

Type: Grant

Filed: February 8, 2021

Date of Patent: March 12, 2024

Assignee: Google LLC

Inventors: Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro Jose Moreno Mengibar
Using Aligned Text and Speech Representations to Train Automatic Speech Recognition Models without Transcribed Speech Data

Publication number: 20240029715

Abstract: A method includes receiving training data that includes unspoken textual utterances in a target language. Each unspoken textual utterance not paired with any corresponding spoken utterance of non-synthetic speech. The method also includes generating a corresponding alignment output for each unspoken textual utterance using an alignment model trained on transcribed speech utterance in one or more training languages each different than the target language. The method also includes generating a corresponding encoded textual representation for each alignment output using a text encoder and training a speech recognition model on the encoded textual representations generated for the alignment outputs. Training the speech recognition model teaches the speech recognition model to learn how to recognize speech in the target language.

Type: Application

Filed: July 20, 2023

Publication date: January 25, 2024

Applicant: Google LLC

Inventors: Andrew Rosenberg, Zhehuai Chen, Ankur Bapna, Yu Zhang, Bhuvana Ramabhadran
Improving speech recognition with speech synthesis-based model adapation

Patent number: 11823697

Abstract: A method for training a speech recognition model includes obtaining sample utterances of synthesized speech in a target domain, obtaining transcribed utterances of non-synthetic speech in the target domain, and pre-training the speech recognition model on the sample utterances of synthesized speech in the target domain to attain an initial state for warm-start training. After pre-training the speech recognition model, the method also includes warm-start training the speech recognition model on the transcribed utterances of non-synthetic speech in the target domain to teach the speech recognition model to learn to recognize real/human speech in the target domain.

Type: Grant

Filed: August 20, 2021

Date of Patent: November 21, 2023

Assignee: Google LLC

Inventors: Andrew Rosenberg, Bhuvana Ramabhadran
INSTANTANEOUS LEARNING IN TEXT-TO-SPEECH DURING DIALOG

Publication number: 20230274727

Abstract: A method for instantaneous learning in text-to-speech (TTS) during dialog includes receiving a user pronunciation of a particular word present in a query spoken by a user. The method also includes receiving a TTS pronunciation of the same particular word that is present in a TTS input where the TTS pronunciation of the particular word is different than the user pronunciation of the particular word. The method also includes obtaining user pronunciation-related features and TTS pronunciation related features associated with the particular word. The method also includes generating a pronunciation decision selecting one of the user pronunciation or the TTS pronunciation of the particular word that is associated with a highest confidence. The method also include providing the TTS audio that includes a synthesized speech representation of the response to the query using the user pronunciation or the TTS pronunciation for the particular word.

Type: Application

Filed: May 4, 2023

Publication date: August 31, 2023

Applicant: Google LLC

Inventors: Vijayaditya Peddinti, Bhuvana Ramabhadran, Andrew Rosenberg, Mateusz Golebiewski
Instantaneous learning in text-to-speech during dialog

Patent number: 11676572

Abstract: A method for instantaneous learning in text-to-speech (TTS) during dialog includes receiving a user pronunciation of a particular word present in a query spoken by a user. The method also includes receiving a TTS pronunciation of the same particular word that is present in a TTS input where the TTS pronunciation of the particular word is different than the user pronunciation of the particular word. The method also includes obtaining user pronunciation-related features and TTS pronunciation related features associated with the particular word. The method also includes generating a pronunciation decision selecting one of the user pronunciation or the TTS pronunciation of the particular word that is associated with a highest confidence. The method also include providing the TTS audio that includes a synthesized speech representation of the response to the query using the user pronunciation or the TTS pronunciation for the particular word.

Type: Grant

Filed: March 3, 2021

Date of Patent: June 13, 2023

Assignee: Google LLC

Inventors: Vijayaditya Peddinti, Bhuvana Ramabhadran, Andrew Rosenberg, Mateusz Golebiewski
Guided Data Selection for Masked Speech Modeling

Publication number: 20230103722

Abstract: A method of guided data selection for masked speech modeling includes obtaining a sequence of encoded representations corresponding to an utterance. For each respective encoded representation, the method includes processing the respective encoded representation to generate a corresponding probability distribution over possible speech recognition hypotheses and assigning, to the respective encode representation, a confidence score as a highest probability from the corresponding probability distribution over possible speech recognition hypotheses. The method also includes selecting a set of unmasked encoded representations to mask based on the confidence scores assigned to the sequence of encoded representations. The method also includes generating a set of masked encoded representations by masking the selected set of unmasked encoded representations.

Type: Application

Filed: August 18, 2022

Publication date: April 6, 2023

Applicant: Google LLC

Inventors: Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang, Murali Karthick Baskar
Improving Speech Recognition with Speech Synthesis-based Model Adapation

Publication number: 20230058447

Abstract: A method for training a speech recognition model includes obtaining sample utterances of synthesized speech in a target domain, obtaining transcribed utterances of non-synthetic speech in the target domain, and pre-training the speech recognition model on the sample utterances of synthesized speech in the target domain to attain an initial state for warm-start training. After pre-training the speech recognition model, the method also includes warm-start training the speech recognition model on the transcribed utterances of non-synthetic speech in the target domain to teach the speech recognition model to learn to recognize real/human speech in the target domain.

Type: Application

Filed: August 20, 2021

Publication date: February 23, 2023

Applicant: Google LLC

Inventors: Andrew Rosenberg, Bhuvana Ramabhadran
Advancing the Use of Text and Speech in ASR Pretraining With Consistency and Contrastive Losses

Publication number: 20230013587

Abstract: A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

Type: Application

Filed: April 15, 2022

Publication date: January 19, 2023

Applicant: Google LLC

Inventors: Andrew Rosenberg, Zhehuai Chen, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar, Gary Wang, Yu Zhang
Training Speech Synthesis to Generate Distinct Speech Sounds

Publication number: 20230009613

Abstract: A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.

Type: Application

Filed: December 13, 2019

Publication date: January 12, 2023

Applicant: Google LLC

Inventors: Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Yu Zhang
Generating diverse and natural text-to-speech samples

Patent number: 11475874

Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.

Type: Grant

Filed: January 29, 2021

Date of Patent: October 18, 2022

Assignee: Google LLC

Inventors: Yu Zhang, Bhuvana Ramabhadran, Andrew Rosenberg, Yonghui Wu, Byungha Chun, Ron Weiss, Yuan Cao
Supervised and Unsupervised Training with Contrastive Loss Over Sequences

Publication number: 20220310065

Abstract: A method includes receiving audio data corresponding to an utterance and generating a pair of positive audio data examples. Here, each positive audio data example includes a respective augmented copy of the received audio data. For each respective positive audio data example, the method includes generating a respective sequence of encoder outputs and projecting the respective sequence of encoder outputs for the positive data example into a contrastive loss space. The method also includes determining a L2 distance between each corresponding encoder output in the projected sequences of encoder outputs for the positive audio data examples and determining a per-utterance consistency loss by averaging the L2 distances. The method also includes generating corresponding speech recognition results for each respective positive audio data example. The method also includes updating parameters of the speech recognition model based on a respective supervised loss term and the per-utterance consistency loss.

Type: Application

Filed: March 22, 2022

Publication date: September 29, 2022

Applicant: Google LLC

Inventors: Andrew Rosenberg, Bhuvana Ramabhadran, Zhehuai Chen, Gary Wang, Yu Zhang, Jesse Emond
Instantaneous Learning in Text-To-Speech During Dialog

Publication number: 20220284882

Abstract: A method for instantaneous learning in text-to-speech (TTS) during dialog includes receiving a user pronunciation of a particular word present in a query spoken by a user. The method also includes receiving a TTS pronunciation of the same particular word that is present in a TTS input where the TTS pronunciation of the particular word is different than the user pronunciation of the particular word. The method also includes obtaining user pronunciation-related features and TTS pronunciation related features associated with the particular word. The method also includes generating a pronunciation decision selecting one of the user pronunciation or the TTS pronunciation of the particular word that is associated with a highest confidence. The method also include providing the TTS audio that includes a synthesized speech representation of the response to the query using the user pronunciation or the TTS pronunciation for the particular word.

Type: Application

Filed: March 3, 2021

Publication date: September 8, 2022

Applicant: Google LLC

Inventors: Vijayaditya Peddinti, Bhuvana Ramabhadran, Andrew Rosenberg, Mateusz Golebiewski
Generating Diverse and Natural Text-To-Speech Samples

Publication number: 20220246132

Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.

Type: Application

Filed: January 29, 2021

Publication date: August 4, 2022

Applicant: Google LLC

Inventors: Yu Zhang, Bhuvana Ramabhadran, Andrew Rosenberg, Yonghui Wu, Byungha Chun, Ron Weiss, Yuan Cao
Synthesized data augmentation using voice conversion and speech recognition models

Patent number: 11335324

Abstract: A method for training a speech conversion model personalized for a target speaker with atypical speech includes obtaining a plurality of transcriptions in a set of spoken training utterances and obtaining a plurality of unspoken training text utterances. Each spoken training utterance is spoken by a target speaker associated with atypical speech and includes a corresponding transcription paired with a corresponding non-synthetic speech representation. The method also includes adapting, using the set of spoken training utterances, a text-to-speech (TTS) model to synthesize speech in a voice of the target speaker and that captures the atypical speech. For each unspoken training text utterance, the method also includes generating, as output from the adapted TTS model, a synthetic speech representation that includes the voice of the target speaker and that captures the atypical speech. The method also includes training the speech conversion model based on the synthetic speech representations.

Type: Grant

Filed: August 31, 2020

Date of Patent: May 17, 2022

Assignee: Google LLC

Inventors: Fadi Biadsy, Liyang Jiang, Pedro J. Moreno Mengibar, Andrew Rosenberg
Using Speech Recognition to Improve Cross-Language Speech Synthesis

Publication number: 20220122581

Abstract: A method for training a speech recognition model includes obtaining a multilingual text-to-speech (TTS) model. The method also includes generating a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language. The method also includes generating a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language. The method also includes generating a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation. The method also includes determining a consistent loss term based on the first speech recognition result and the second speech recognition result and updating parameters of the speech recognition model based on the consistent loss term.

Type: Application

Filed: October 20, 2021

Publication date: April 21, 2022

Applicant: Google LLC

Inventors: Zhehuai Chen, Bhuvana Ramabhadran, Andrew Rosenberg, Yu Zhang, Pedro J. Moreno Mengibar
Synthesized Data Augmentation Using Voice Conversion and Speech Recognition Models

Publication number: 20220068257

Abstract: A method for training a speech conversion model personalized for a target speaker with atypical speech includes obtaining a plurality of transcriptions in a set of spoken training utterances and obtaining a plurality of unspoken training text utterances. Each spoken training utterance is spoken by a target speaker associated with atypical speech and includes a corresponding transcription paired with a corresponding non-synthetic speech representation. The method also includes adapting, using the set of spoken training utterances, a text-to-speech (TTS) model to synthesize speech in a voice of the target speaker and that captures the atypical speech. For each unspoken training text utterance, the method also includes generating, as output from the adapted TTS model, a synthetic speech representation that includes the voice of the target speaker and that captures the atypical speech. The method also includes training the speech conversion model based on the synthetic speech representations.

Type: Application

Filed: August 31, 2020

Publication date: March 3, 2022

Applicant: Google LLC

Inventors: Fadi Biadsy, Liyang Jiang, Pedro J. Moreno Mengibar, Andrew Rosenberg
Consistency Prediction On Streaming Sequence Models

Publication number: 20210280170

Abstract: A method for training a speech recognition model includes receiving a set of training utterance pairs each including a non-synthetic speech representation and a synthetic speech representation of a same corresponding utterance. At each of a plurality of output steps for each training utterance pair in the set of training utterance pairs, the method also includes determining a consistent loss term for the corresponding training utterance pair based on a first probability distribution over possible non-synthetic speech recognition hypotheses generated for the corresponding non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses generated for the corresponding synthetic speech representation. The first and second probability distributions are generated for output by the speech recognition model.

Type: Application

Filed: February 8, 2021

Publication date: September 9, 2021

Applicant: Google LLC

Inventors: Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro Jose Moreno Mengibar
Voice-transformation based data augmentation for prosodic classification

Patent number: 10726826

Abstract: A computer system receives a training set of voice data captured from one or more speakers and classified with one or more categorical prosodic labels according to one or more prosodic categories. The computer system transforms the voice data of the training set while preserving one or more portions of the voice data that determine the one or more categorical prosodic labels to automatically form a new training set of voice data automatically classified with the one or more categorical prosodic labels according to the one or more prosodic categories. The computer system augments a database comprising the training set with the new training set for training a speech model of an artificial intelligence system.

Type: Grant

Filed: March 4, 2018

Date of Patent: July 28, 2020

Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION

Inventors: Raul Fernandez, Andrew Rosenberg, Alexander Sorin
VOICE-TRANSFORMATION BASED DATA AUGMENTATION FOR PROSODIC CLASSIFICATION

Publication number: 20190272818

Abstract: A computer system receives a training set of voice data captured from one or more speakers and classified with one or more categorical prosodic labels according to one or more prosodic categories. The computer system transforms the voice data of the training set while preserving one or more portions of the voice data that determine the one or more categorical prosodic labels to automatically form a new training set of voice data automatically classified with the one or more categorical prosodic labels according to the one or more prosodic categories. The computer system augments a database comprising the training set with the new training set for training a speech model of an artificial intelligence system.

Type: Application

Filed: March 4, 2018

Publication date: September 5, 2019

Inventors: Raul Fernandez, Andrew Rosenberg, Alexander Sorin

1 2 next