Patents by Inventor Zhehuai Chen

Zhehuai Chen has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Consistency prediction on streaming sequence models

Patent number: 11929060

Abstract: A method for training a speech recognition model includes receiving a set of training utterance pairs each including a non-synthetic speech representation and a synthetic speech representation of a same corresponding utterance. At each of a plurality of output steps for each training utterance pair in the set of training utterance pairs, the method also includes determining a consistent loss term for the corresponding training utterance pair based on a first probability distribution over possible non-synthetic speech recognition hypotheses generated for the corresponding non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses generated for the corresponding synthetic speech representation. The first and second probability distributions are generated for output by the speech recognition model.

Type: Grant

Filed: February 8, 2021

Date of Patent: March 12, 2024

Assignee: Google LLC

Inventors: Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro Jose Moreno Mengibar
Joint Speech and Text Streaming Model for ASR

Publication number: 20240028829

Abstract: A method includes receiving training data that includes a set of unspoken textual utterances. For each respective unspoken textual utterance, the method includes, tokenizing the respective textual utterance into a sequence of sub-word units, generating a first higher order textual feature representation for a corresponding sub-word unit tokenized from the respective unspoken textual utterance, receiving the first higher order textual feature representation generated by a text encoder, and generating a first probability distribution over possible text units. The method also includes training an encoder based on the first probability distribution over possible text units generated by a first-pass decoder for each respective unspoken textual utterance in the set of unspoken textual utterances.

Type: Application

Filed: July 1, 2023

Publication date: January 25, 2024

Applicant: Google LLC

Inventors: Tara N. Sainath, Zhouyuan Huo, Zhehuai Chen, Yu Zhang, Weiran Wang, Trevor Strohman, Rohit Prakash Prabhavalkar, Bo Li, Ankur Bapna
Using Aligned Text and Speech Representations to Train Automatic Speech Recognition Models without Transcribed Speech Data

Publication number: 20240029715

Abstract: A method includes receiving training data that includes unspoken textual utterances in a target language. Each unspoken textual utterance not paired with any corresponding spoken utterance of non-synthetic speech. The method also includes generating a corresponding alignment output for each unspoken textual utterance using an alignment model trained on transcribed speech utterance in one or more training languages each different than the target language. The method also includes generating a corresponding encoded textual representation for each alignment output using a text encoder and training a speech recognition model on the encoded textual representations generated for the alignment outputs. Training the speech recognition model teaches the speech recognition model to learn how to recognize speech in the target language.

Type: Application

Filed: July 20, 2023

Publication date: January 25, 2024

Applicant: Google LLC

Inventors: Andrew Rosenberg, Zhehuai Chen, Ankur Bapna, Yu Zhang, Bhuvana Ramabhadran
Unsupervised Data Selection via Discrete Speech Representation for Automatic Speech Recognition

Publication number: 20240013777

Abstract: A method includes obtaining a corpus of unlabeled training data including a plurality of spoken utterances, each corresponding spoken utterance of the plurality of spoken utterances includes audio data characterizing the corresponding spoken utterance. The method also includes receiving a target domain. The method also includes selecting, using a contrastive data selection model, a subset of the utterances from the corpus of unlabeled training data that correspond to the target domain. The method includes training an automatic speech recognition (ASR) model on the subset of utterances.

Type: Application

Filed: May 19, 2023

Publication date: January 11, 2024

Applicant: Google LLC

Inventors: Zhiyun Lu, Yu Zhang, Wei Han, Yongqiang Wang, Parisa Haghani, Zhehuai Chen
Speech recognition using unspoken text and speech synthesis

Patent number: 11837216

Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.

Type: Grant

Filed: February 14, 2023

Date of Patent: December 5, 2023

Assignee: Google LLC

Inventors: Zhehuai Chen, Andrew M. Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar
Alignment Prediction to Inject Text into Automatic Speech Recognition Training

Publication number: 20230317059

Abstract: A method includes receiving training data that includes unspoken textual utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken textual utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance paired with a corresponding transcription. The method also includes generating a corresponding alignment output for each unspoken textual utterance of the received training data using an alignment model. The method also includes pre-training an audio encoder on the alignment outputs generated for corresponding to the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

Type: Application

Filed: February 13, 2023

Publication date: October 5, 2023

Applicant: Google LLC

Inventors: Andrew M Rosenberg, Zhehuai Chen, Yu Zhang, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar
Speech Recognition Using Unspoken Text and Speech Synthesis

Publication number: 20230197057

Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.

Type: Application

Filed: February 14, 2023

Publication date: June 22, 2023

Applicant: Google LLC

Inventors: Zhehuai Chen, Andrew M. Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar
Speech recognition using unspoken text and speech synthesis

Patent number: 11605368

Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.

Type: Grant

Filed: November 11, 2021

Date of Patent: March 14, 2023

Assignee: Google LLC

Inventors: Zhehuai Chen, Andrew M. Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar
Advancing the Use of Text and Speech in ASR Pretraining With Consistency and Contrastive Losses

Publication number: 20230013587

Abstract: A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

Type: Application

Filed: April 15, 2022

Publication date: January 19, 2023

Applicant: Google LLC

Inventors: Andrew Rosenberg, Zhehuai Chen, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar, Gary Wang, Yu Zhang
Injecting Text in Self-Supervised Speech Pre-training

Publication number: 20230017892

Abstract: A method includes receiving training data that includes unspoken text utterances and un-transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances and the un-transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

Type: Application

Filed: June 21, 2022

Publication date: January 19, 2023

Applicant: Google LLC

Inventors: Zhehuai Chen, Bhuvana Ramabhadran, Andrew M. Rosenberg, Yu Zhang, Pedro J. Moreno Mengibar
Conformer-based Speech Conversion Model

Publication number: 20220310056

Abstract: A method for speech conversion includes receiving, as input to an encoder of a speech conversion model, an input spectrogram corresponding to an utterance, the encoder including a stack of self-attention blocks. The method further includes generating, as output from the encoder, an encoded spectrogram and receiving, as input to a spectrogram decoder of the speech conversion model, the encoded spectrogram generated as output from the encoder. The method further includes generating, as output from the spectrogram decoder, an output spectrogram corresponding to a synthesized speech representation of the utterance.

Type: Application

Filed: March 16, 2022

Publication date: September 29, 2022

Applicant: Google LLC

Inventors: Bhuvana Ramabhadran, Zhehuai Chen, Fadi Biadsy, Pedro J. Moreno Mengibar
Supervised and Unsupervised Training with Contrastive Loss Over Sequences

Publication number: 20220310065

Abstract: A method includes receiving audio data corresponding to an utterance and generating a pair of positive audio data examples. Here, each positive audio data example includes a respective augmented copy of the received audio data. For each respective positive audio data example, the method includes generating a respective sequence of encoder outputs and projecting the respective sequence of encoder outputs for the positive data example into a contrastive loss space. The method also includes determining a L2 distance between each corresponding encoder output in the projected sequences of encoder outputs for the positive audio data examples and determining a per-utterance consistency loss by averaging the L2 distances. The method also includes generating corresponding speech recognition results for each respective positive audio data example. The method also includes updating parameters of the speech recognition model based on a respective supervised loss term and the per-utterance consistency loss.

Type: Application

Filed: March 22, 2022

Publication date: September 29, 2022

Applicant: Google LLC

Inventors: Andrew Rosenberg, Bhuvana Ramabhadran, Zhehuai Chen, Gary Wang, Yu Zhang, Jesse Emond
Using Speech Recognition to Improve Cross-Language Speech Synthesis

Publication number: 20220122581

Abstract: A method for training a speech recognition model includes obtaining a multilingual text-to-speech (TTS) model. The method also includes generating a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language. The method also includes generating a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language. The method also includes generating a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation. The method also includes determining a consistent loss term based on the first speech recognition result and the second speech recognition result and updating parameters of the speech recognition model based on the consistent loss term.

Type: Application

Filed: October 20, 2021

Publication date: April 21, 2022

Applicant: Google LLC

Inventors: Zhehuai Chen, Bhuvana Ramabhadran, Andrew Rosenberg, Yu Zhang, Pedro J. Moreno Mengibar
Speech Recognition Using Unspoken Text and Speech Synthesis

Publication number: 20220068255

Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.

Type: Application

Filed: November 11, 2021

Publication date: March 3, 2022

Applicant: Google LLC

Inventors: Zhehuai Chen, Andrew M. Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar
Speech recognition using unspoken text and speech synthesis

Patent number: 11222620

Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.

Type: Grant

Filed: May 7, 2020

Date of Patent: January 11, 2022

Assignee: Google LLC

Inventors: Zhehuai Chen, Andrew M. Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar
Speech Recognition Using Unspoken Text and Speech Synthesis

Publication number: 20210350786

Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.

Type: Application

Filed: May 7, 2020

Publication date: November 11, 2021

Applicant: Google LLC

Inventors: Zhehuai Chen, Andrew M. Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar
Consistency Prediction On Streaming Sequence Models

Publication number: 20210280170

Abstract: A method for training a speech recognition model includes receiving a set of training utterance pairs each including a non-synthetic speech representation and a synthetic speech representation of a same corresponding utterance. At each of a plurality of output steps for each training utterance pair in the set of training utterance pairs, the method also includes determining a consistent loss term for the corresponding training utterance pair based on a first probability distribution over possible non-synthetic speech recognition hypotheses generated for the corresponding non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses generated for the corresponding synthetic speech representation. The first and second probability distributions are generated for output by the speech recognition model.

Type: Application

Filed: February 8, 2021

Publication date: September 9, 2021

Applicant: Google LLC

Inventors: Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro Jose Moreno Mengibar
Method and Device for Speech Recognition Decoding

Publication number: 20190057685

Abstract: The present disclosure discloses a method and device for speech recognition and decoding, pertaining to the field of speech processing. The method comprises: receiving speech information, and extracting an acoustic feature; computing information of the acoustic feature according to a connection sequential classification model; when a frame in the acoustic feature information is a non-blank model frame, performing linguistic information searching using a weighted finite state transducer adapting acoustic modeling information and storing historical data, or otherwise, discarding the frame. By establishing the connection sequential classification model, the acoustic modeling is more accurate. By using the weighted finite state transducer, model representation is more efficient, and nearly 50% of computation and memory resource consumption is reduced. By using a phoneme synchronization method during decoding, amount and times of computations are effectively reduced for model searching.

Type: Application

Filed: May 6, 2016

Publication date: February 21, 2019

Inventors: Kai Yu, Weida Zhou, Zhehuai Chen, Wei Deng, Tao Xu