Patents by Inventor Bhuvana Ramabhadran

Bhuvana Ramabhadran has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Advancing the use of text and speech in ASR pretraining with consistency and contrastive losses

Patent number: 12272363

Abstract: A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

Type: Grant

Filed: April 15, 2022

Date of Patent: April 8, 2025

Assignee: Google LLC

Inventors: Andrew Rosenberg, Zhehuai Chen, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar, Yuan Wang, Yu Zhang
Conformer-based speech conversion model

Patent number: 12272348

Abstract: A method for speech conversion includes receiving, as input to an encoder of a speech conversion model, an input spectrogram corresponding to an utterance, the encoder including a stack of self-attention blocks. The method further includes generating, as output from the encoder, an encoded spectrogram and receiving, as input to a spectrogram decoder of the speech conversion model, the encoded spectrogram generated as output from the encoder. The method further includes generating, as output from the spectrogram decoder, an output spectrogram corresponding to a synthesized speech representation of the utterance.

Type: Grant

Filed: March 16, 2022

Date of Patent: April 8, 2025

Assignee: Google LLC

Inventors: Bhuvana Ramabhadran, Zhehuai Chen, Fadi Biadsy, Pedro J. Moreno Mengibar
USING NON-PARALLEL VOICE CONVERSION FOR SPEECH CONVERSION MODELS

Publication number: 20250095639

Abstract: A method includes receiving a set of training utterances each including a non-synthetic speech representation of a corresponding utterance, and for each training utterance, generating a corresponding synthetic speech representation by using a voice conversion model. The non-synthetic speech representation and the synthetic speech representation form a corresponding training utterance pair. At each of a plurality of output steps for each training utterance pair, the method also includes generating, for output by a speech recognition model, a first probability distribution over possible non-synthetic speech recognition hypotheses for the non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses for the synthetic speech representation.

Type: Application

Filed: November 27, 2024

Publication date: March 20, 2025

Applicant: Google LLC

Inventors: Andrew M. Rosenberg, Gary Wang, Bhuvana Ramabhadran, Fadi Biadsy
Multilingual re-scoring models for automatic speech recognition

Patent number: 12254875

Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes: generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis; and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.

Type: Grant

Filed: February 27, 2024

Date of Patent: March 18, 2025

Assignee: Google LLC

Inventors: Neeraj Gaur, Tongzhou Chen, Ehsan Variani, Bhuvana Ramabhadran, Parisa Haghani, Pedro J. Moreno Mengibar
Injecting Text in Self-Supervised Speech Pre-training

Publication number: 20250078807

Abstract: A method includes receiving training data that includes unspoken text utterances and un-transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances and the un-transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

Type: Application

Filed: November 18, 2024

Publication date: March 6, 2025

Applicant: Google LLC

Inventors: Zhehuai Chen, Bhuvana Ramabhadran, Andrew M. Rosenberg, Yu Zhang, Pedro J. Moreno Mengibar
Zero-Shot Task Expansion of ASR Models Using Task Vectors

Publication number: 20250078813

Abstract: A method includes training, using an un-supervised learning technique, an auxiliary ASR model based on a first set of un-transcribed source task speech utterances to determine a first task vector, training, using the un-supervised learning technique, the auxiliary ASR model based on a second set of un-transcribed speech utterances to determine a second task vector, and training, using the un-supervised learning technique, the auxiliary ASR model based on un-transcribed target task speech utterances to determine a target task vector. The method also includes determining a first correlation between the first and target task vectors, determining a second correlation between the second and target task vectors, and adapting parameters of a trained primary ASR model based on the first and second source task vectors and the first and second correlations to teach the primary ASR model to learn how to recognize speech associated with the target task.

Type: Application

Filed: August 27, 2024

Publication date: March 6, 2025

Applicant: Google LLC

Inventors: Kartik Audhkhasi, Gowtham Ramesh, Bhuvana Ramabhadran
Scaling Multilingual Speech Synthesis with Zero Supervision of Found Data

Publication number: 20250078805

Abstract: A method includes receiving training data that includes a plurality of sets of training utterances each associated with a respective language. Each training utterance includes a corresponding reference speech representation paired with a corresponding input text sequence. For each training utterance, the method includes generating a corresponding encoded textual representation for the corresponding input text sequence, generating a corresponding speech encoding for the corresponding reference speech representation, generating a shared encoder output, and determining a text-to-speech (TTS) loss based on the corresponding encoded textual representation, the corresponding speech encoding, and the shared encoder output. The method also includes training a TTS model based on the TTS losses determined for the training utterances in each set of the training utterances to teach the TTS model to learn how to synthesize speech in each of the respective languages.

Type: Application

Filed: September 3, 2024

Publication date: March 6, 2025

Applicant: Google LLC

Inventors: Andrew M Rosenberg, Takaaki Saeki, Francoise Beaufays, Bhuvana Ramabhadran
Supervised and unsupervised training with contrastive loss over sequences

Patent number: 12230249

Abstract: A method includes receiving audio data corresponding to an utterance and generating a pair of positive audio data examples. Here, each positive audio data example includes a respective augmented copy of the received audio data. For each respective positive audio data example, the method includes generating a respective sequence of encoder outputs and projecting the respective sequence of encoder outputs for the positive data example into a contrastive loss space. The method also includes determining a L2 distance between each corresponding encoder output in the projected sequences of encoder outputs for the positive audio data examples and determining a per-utterance consistency loss by averaging the L2 distances. The method also includes generating corresponding speech recognition results for each respective positive audio data example. The method also includes updating parameters of the speech recognition model based on a respective supervised loss term and the per-utterance consistency loss.

Type: Grant

Filed: March 22, 2022

Date of Patent: February 18, 2025

Assignee: Google LLC

Inventors: Andrew Rosenberg, Bhuvana Ramabhadran, Zhehuai Chen, Yuan Wang, Yu Zhang, Jesse Emond
Mixture Model Attention for Flexible Streaming and Non-Streaming Automatic Speech Recognition

Publication number: 20250022458

Abstract: A method for an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition including receiving a sequence of acoustic frames. The method includes generating, using an audio encoder of an automatic speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method further includes generating, using a joint encoder of the ASR model, a probability distribution over possible speech recognition hypothesis at the corresponding time step based on the higher order feature representation generated by the audio encoder at the corresponding time step. The audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window.

Type: Application

Filed: September 25, 2024

Publication date: January 16, 2025

Applicant: Google LLC

Inventors: Kartik Audhkhasi, Bhuvana Ramabhadran, Tongzhou Chen, Pedro J. Moreno Mengibar
Using non-parallel voice conversion for speech conversion models

Patent number: 12190862

Abstract: A method includes receiving a set of training utterances each including a non-synthetic speech representation of a corresponding utterance, and for each training utterance, generating a corresponding synthetic speech representation by using a voice conversion model. The non-synthetic speech representation and the synthetic speech representation form a corresponding training utterance pair. At each of a plurality of output steps for each training utterance pair, the method also includes generating, for output by a speech recognition model, a first probability distribution over possible non-synthetic speech recognition hypotheses for the non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses for the synthetic speech representation.

Type: Grant

Filed: April 25, 2022

Date of Patent: January 7, 2025

Assignee: Google LLC

Inventors: Andrew M. Rosenberg, Gary Wang, Bhuvana Ramabhadran, Fadi Biadsy
Multilingual Re-Scoring Models for Automatic Speech Recognition

Publication number: 20240420692

Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes: generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis; and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.

Type: Application

Filed: August 28, 2024

Publication date: December 19, 2024

Applicant: Google LLC

Inventors: Neeraj Gaur, Tongzhou Chen, Ehsan Variani, Bhuvana Ramabhadran, Parisa Haghani, Pedro J. Moreno Mengibar
MULTILINGUAL SPEECH SYNTHESIS AND CROSS-LANGUAGE VOICE CLONING

Publication number: 20240404506

Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.

Type: Application

Filed: August 8, 2024

Publication date: December 5, 2024

Applicant: Google LLC

Inventors: Yu Zhang, Ron J. Weiss, Byungha Chun, Yonghui Wu, Zhifeng Chen, Russell John Wyatt Skerry-Ryan, Ye Jia, Andrew M. Rosenberg, Bhuvana Ramabhadran
Injecting text in self-supervised speech pre-training

Patent number: 12159617

Abstract: A method includes receiving training data that includes unspoken text utterances and un-transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances and the un-transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

Type: Grant

Filed: June 21, 2022

Date of Patent: December 3, 2024

Assignee: Google LLC

Inventors: Zhehuai Chen, Bhuvana Ramabhadran, Andrew M. Rosenberg, Yu Zhang, Pedro J. Moreno Mengibar
Mixture model attention for flexible streaming and non-streaming automatic speech recognition

Patent number: 12136415

Abstract: A method for an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition including receiving a sequence of acoustic frames. The method includes generating, using an audio encoder of an automatic speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method further includes generating, using a joint encoder of the ASR model, a probability distribution over possible speech recognition hypothesis at the corresponding time step based on the higher order feature representation generated by the audio encoder at the corresponding time step. The audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window.

Type: Grant

Filed: December 15, 2021

Date of Patent: November 5, 2024

Assignee: Google LLC

Inventors: Kartik Audhkhasi, Bhuvana Ramabhadran, Tongzhou Chen, Pedro J. Moreno Mengibar
USING TEXT-INJECTION TO RECOGNIZE SPEECH WITHOUT TRANSCRIPTION

Publication number: 20240304178

Abstract: A method includes receiving training data including transcribed speech utterances spoken in a general domain, modified speech utterances in a target domain, and unspoken textual utterances corresponding to the transcriptions of the modified speech utterances in the target domain. The modified speech utterances include utterances spoken in the target domain that have been modified to obfuscate one or more classes of sensitive information recited in the utterances. The method also includes generating a corresponding alignment output for each unspoken textual utterance of the received training data using an alignment model. The method also includes training a speech recognition model on the alignment outputs generated for the corresponding to the unspoken textual utterances, the un-transcribed speech utterances, and the transcribed speech utterances to teach the speech recognition model to learn to recognize speech in the target domain and phrases within the one or more classes of sensitive information.

Type: Application

Filed: February 12, 2024

Publication date: September 12, 2024

Applicant: Google LLC

Inventors: Andrew M Rosenberg, Yacob Yochai Blau, Bhuvana Ramabhadran, Genady Beryozkin, Gary Wang, Zhehuai Chen, Rohan Agrawal, Parisa Haghani
Multilingual speech synthesis and cross-language voice cloning

Patent number: 12087273

Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.

Type: Grant

Filed: January 30, 2023

Date of Patent: September 10, 2024

Assignee: Google LLC

Inventors: Yu Zhang, Ron J. Weiss, Byungha Chun, Yonghui Wu, Zhifeng Chen, Russell John Wyatt Skerry-Ryan, Ye Jia, Andrew M. Rosenberg, Bhuvana Ramabhadran
Regularizing word segmentation

Patent number: 12087279

Abstract: A method for subword segmentation includes receiving an input word to be segmented into a plurality of subword units. The method also includes executing a subword segmentation routine to segment the input word into a plurality of subword units by accessing a trained vocabulary set of subword units and selecting the plurality of subword units from the input word by greedily finding a longest subword unit from the input word that is present in the trained vocabulary set until an end of the input word is reached.

Type: Grant

Filed: March 23, 2022

Date of Patent: September 10, 2024

Assignee: Google LLC

Inventors: Bhuvana Ramabhadran, Hainan Xu, Kartik Audhkhasi, Yinghui Huang
Training speech synthesis to generate distinct speech sounds

Patent number: 12087272

Abstract: A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.

Type: Grant

Filed: December 13, 2019

Date of Patent: September 10, 2024

Assignee: Google LLC

Inventors: Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Yu Zhang
Self-Training With Oracle And Top-Ranked Hypotheses

Publication number: 20240296832

Abstract: A method includes, for each training sample of a plurality of training samples, processing, using an RNN-T model, a corresponding sequence of acoustic frames to obtain an n-best list of speech recognition hypotheses, and, for each speech recognition hypothesis of the n-best list, determining a corresponding number of word errors relative to a corresponding ground-truth transcription. For a top-ranked hypothesis from the n-best list, the method includes determining a first loss based on the corresponding ground-truth transcription. The method includes identifying, as an oracle hypothesis, the speech recognition hypothesis from the n-best list having the smallest corresponding number of word errors relative to the corresponding ground-truth transcription, and determining a second loss for the oracle hypothesis based on the corresponding ground-truth transcription.

Type: Application

Filed: February 28, 2024

Publication date: September 5, 2024

Applicant: Google LLC

Inventors: Andrew M. Rosenberg, Murali Karthick Baskar, Bhuvana Ramabhadran
MASK-CONFORMER AUGMENTING CONFORMER WITH MASK-PREDICT DECODER UNIFYING SPEECH RECOGNITION AND RESCORING

Publication number: 20240296837

Abstract: A method includes receiving a sequence of acoustic frames characterizing an utterance. During a first pass, the method includes generating first-pass audio encodings based on the sequence of acoustic frames using a stack of mask-conformer blocks of an acoustic encoder, generating a first-pass transcription of the utterance based on the first-pass audio encodings using a speech recognition decoder, and generating a first-pass masked output sequence using a mask-predict decoder of the acoustic encoder. During a second pass, the method includes generating second-pass audio encodings by performing cross-attention on the sequence of acoustic frames and the masked first-pass transcription using the stack of mask-conformer blocks of the acoustic encoder and generating a second-pass transcription of the utterance based on the second-pass audio encodings using the speech recognition decoder.

Type: Application

Filed: February 28, 2024

Publication date: September 5, 2024

Applicant: Google LLC

Inventors: Andrew M. Rosenberg, Yosuke Higuchi, Bhuvana Ramabhadran

1 2 3 4 5 … next