Patents by Inventor Bhuvana Ramabhadran

Bhuvana Ramabhadran has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20230017892
    Abstract: A method includes receiving training data that includes unspoken text utterances and un-transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances and the un-transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.
    Type: Application
    Filed: June 21, 2022
    Publication date: January 19, 2023
    Applicant: Google LLC
    Inventors: Zhehuai Chen, Bhuvana Ramabhadran, Andrew M. Rosenberg, Yu Zhang, Pedro J. Moreno Mengibar
  • Publication number: 20230013587
    Abstract: A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.
    Type: Application
    Filed: April 15, 2022
    Publication date: January 19, 2023
    Applicant: Google LLC
    Inventors: Andrew Rosenberg, Zhehuai Chen, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar, Gary Wang, Yu Zhang
  • Publication number: 20230009613
    Abstract: A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.
    Type: Application
    Filed: December 13, 2019
    Publication date: January 12, 2023
    Applicant: Google LLC
    Inventors: Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Yu Zhang
  • Patent number: 11475874
    Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.
    Type: Grant
    Filed: January 29, 2021
    Date of Patent: October 18, 2022
    Assignee: Google LLC
    Inventors: Yu Zhang, Bhuvana Ramabhadran, Andrew Rosenberg, Yonghui Wu, Byungha Chun, Ron Weiss, Yuan Cao
  • Publication number: 20220310056
    Abstract: A method for speech conversion includes receiving, as input to an encoder of a speech conversion model, an input spectrogram corresponding to an utterance, the encoder including a stack of self-attention blocks. The method further includes generating, as output from the encoder, an encoded spectrogram and receiving, as input to a spectrogram decoder of the speech conversion model, the encoded spectrogram generated as output from the encoder. The method further includes generating, as output from the spectrogram decoder, an output spectrogram corresponding to a synthesized speech representation of the utterance.
    Type: Application
    Filed: March 16, 2022
    Publication date: September 29, 2022
    Applicant: Google LLC
    Inventors: Bhuvana Ramabhadran, Zhehuai Chen, Fadi Biadsy, Pedro J. Moreno Mengibar
  • Publication number: 20220310061
    Abstract: A method for subword segmentation includes receiving an input word to be segmented into a plurality of subword units. The method also includes executing a subword segmentation routine to segment the input word into a plurality of subword units by accessing a trained vocabulary set of subword units and selecting the plurality of subword units from the input word by greedily finding a longest subword unit from the input word that is present in the trained vocabulary set until an end of the input word is reached.
    Type: Application
    Filed: March 23, 2022
    Publication date: September 29, 2022
    Applicant: Google LLC
    Inventors: Bhuvana Ramabhadran, Hainan Xu, Kartik Audhkhasi, Yinghui Huang
  • Publication number: 20220310065
    Abstract: A method includes receiving audio data corresponding to an utterance and generating a pair of positive audio data examples. Here, each positive audio data example includes a respective augmented copy of the received audio data. For each respective positive audio data example, the method includes generating a respective sequence of encoder outputs and projecting the respective sequence of encoder outputs for the positive data example into a contrastive loss space. The method also includes determining a L2 distance between each corresponding encoder output in the projected sequences of encoder outputs for the positive audio data examples and determining a per-utterance consistency loss by averaging the L2 distances. The method also includes generating corresponding speech recognition results for each respective positive audio data example. The method also includes updating parameters of the speech recognition model based on a respective supervised loss term and the per-utterance consistency loss.
    Type: Application
    Filed: March 22, 2022
    Publication date: September 29, 2022
    Applicant: Google LLC
    Inventors: Andrew Rosenberg, Bhuvana Ramabhadran, Zhehuai Chen, Gary Wang, Yu Zhang, Jesse Emond
  • Publication number: 20220309340
    Abstract: A method for distilling one or more trained teacher automatic speech recognition (ASR) models into a multilingual student model includes receiving a plurality of teacher training examples and a plurality of student training examples. The method also includes training one or more teacher automatic speech recognition (ASR) models using the plurality of teacher training examples. Each teacher ASR model is configured to output a respective textual representation of a respective audio input. The method further includes generating a multi-lingual student ASR model by training the multi-lingual student ASR model using the plurality of student training examples and distilling the trained one or more teacher ASR models into the multilingual student ASR model using a tunable distillation loss weight. Each student ASR model is configured to receive an audio input and output a corresponding textual representation of the received audio input.
    Type: Application
    Filed: December 7, 2021
    Publication date: September 29, 2022
    Applicant: Google LLC
    Inventors: Isabel Leal, Neeraj Gaur, Parisa Haghani, Brian Farris, Bhuvana Ramabhadran, Manasa Prasad, Pedro J. Moreno Mengibar, Yun Zhu
  • Publication number: 20220310074
    Abstract: A method for an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition including receiving a sequence of acoustic frames. The method includes generating, using an audio encoder of an automatic speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method further includes generating, using a joint encoder of the ASR model, a probability distribution over possible speech recognition hypothesis at the corresponding time step based on the higher order feature representation generated by the audio encoder at the corresponding time step. The audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window.
    Type: Application
    Filed: December 15, 2021
    Publication date: September 29, 2022
    Applicant: Google LLC
    Inventors: Kartik Audhkhasi, Bhuvana Ramabhadran, Tongzhou Chen, Pedro J. Moreno Mengibar
  • Publication number: 20220310073
    Abstract: A method for an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition including receiving a sequence of acoustic frames. The method includes generating, using an audio encoder of an automatic speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method further includes generating, using a joint encoder of the ASR model, a probability distribution over possible speech recognition hypothesis at the corresponding time step based on the higher order feature representation generated by the audio encoder at the corresponding time step. The audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window.
    Type: Application
    Filed: December 15, 2021
    Publication date: September 29, 2022
    Applicant: Google LLC
    Inventors: Kartik Audhkhasi, Bhuvana Ramabhadran, Tongzhou Chen, Pedro J. Moreno Mengibar
  • Publication number: 20220310081
    Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis, and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.
    Type: Application
    Filed: March 22, 2022
    Publication date: September 29, 2022
    Applicant: Google LLC
    Inventors: Neeraj Gaur, Tongzhou Chen, Ehsan Variani, Bhuvana Ramabhadran, Parisa Haghani, Pedro J. Moreno Mengibar
  • Publication number: 20220284882
    Abstract: A method for instantaneous learning in text-to-speech (TTS) during dialog includes receiving a user pronunciation of a particular word present in a query spoken by a user. The method also includes receiving a TTS pronunciation of the same particular word that is present in a TTS input where the TTS pronunciation of the particular word is different than the user pronunciation of the particular word. The method also includes obtaining user pronunciation-related features and TTS pronunciation related features associated with the particular word. The method also includes generating a pronunciation decision selecting one of the user pronunciation or the TTS pronunciation of the particular word that is associated with a highest confidence. The method also include providing the TTS audio that includes a synthesized speech representation of the response to the query using the user pronunciation or the TTS pronunciation for the particular word.
    Type: Application
    Filed: March 3, 2021
    Publication date: September 8, 2022
    Applicant: Google LLC
    Inventors: Vijayaditya Peddinti, Bhuvana Ramabhadran, Andrew Rosenberg, Mateusz Golebiewski
  • Patent number: 11416741
    Abstract: A technique for constructing a model supporting a plurality of domains is disclosed. In the technique, a plurality of teacher models, each of which is specialized for different one of the plurality of the domains, is prepared. A plurality of training data collections, each of which is collected for different one of the plurality of the domains, is obtained. A plurality of soft label sets is generated by inputting each training data in the plurality of the training data collections into corresponding one of the plurality of the teacher models. A student model is trained using the plurality of the soft label sets.
    Type: Grant
    Filed: June 8, 2018
    Date of Patent: August 16, 2022
    Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Takashi Fukuda, Osamu Ichikawa, Samuel Thomas, Bhuvana Ramabhadran
  • Patent number: 11417322
    Abstract: Methods, systems, and apparatus, including computer programs stored on a computer-readable storage medium, for transliteration for speech recognition training and scoring. In some implementations, language examples are accessed, some of which include words in a first script and words in one or more other scripts. At least portions of some of the language examples are transliterated to the first script to generate a training data set. A language model is generated based on occurrences of the different sequences of words in the training data set in the first script. The language model is used to perform speech recognition for an utterance.
    Type: Grant
    Filed: December 12, 2019
    Date of Patent: August 16, 2022
    Assignee: Google LLC
    Inventors: Bhuvana Ramabhadran, Min Ma, Pedro J. Moreno Mengibar, Jesse Emond, Brian E. Roark
  • Patent number: 11410029
    Abstract: A technique for generating soft labels for training is disclosed. A teacher model having a teacher side class set is prepared. A collection of class pairs for respective data units is obtained. Class pairs includes classes labelled to corresponding data units from the teacher side class set and a student side class set different from the teacher side class set. A training input is fed into the teacher model to obtain a set of outputs for the teacher side class set. A set of soft labels for the student side class set is calculated from the set of the outputs by using at least an output obtained for a class within a subset of the teacher side class set having relevance to the member of the student side class set, based at least in part on observations in the collection of the class pairs.
    Type: Grant
    Filed: January 2, 2018
    Date of Patent: August 9, 2022
    Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Takashi Fukuda, Samuel Thomas, Bhuvana Ramabhadran
  • Publication number: 20220246132
    Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.
    Type: Application
    Filed: January 29, 2021
    Publication date: August 4, 2022
    Applicant: Google LLC
    Inventors: Yu Zhang, Bhuvana Ramabhadran, Andrew Rosenberg, Yonghui Wu, Byungha Chun, Ron Weiss, Yuan Cao
  • Publication number: 20220122581
    Abstract: A method for training a speech recognition model includes obtaining a multilingual text-to-speech (TTS) model. The method also includes generating a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language. The method also includes generating a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language. The method also includes generating a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation. The method also includes determining a consistent loss term based on the first speech recognition result and the second speech recognition result and updating parameters of the speech recognition model based on the consistent loss term.
    Type: Application
    Filed: October 20, 2021
    Publication date: April 21, 2022
    Applicant: Google LLC
    Inventors: Zhehuai Chen, Bhuvana Ramabhadran, Andrew Rosenberg, Yu Zhang, Pedro J. Moreno Mengibar
  • Publication number: 20220068255
    Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.
    Type: Application
    Filed: November 11, 2021
    Publication date: March 3, 2022
    Applicant: Google LLC
    Inventors: Zhehuai Chen, Andrew M. Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar
  • Patent number: 11222620
    Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.
    Type: Grant
    Filed: May 7, 2020
    Date of Patent: January 11, 2022
    Assignee: Google LLC
    Inventors: Zhehuai Chen, Andrew M. Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar
  • Publication number: 20210350786
    Abstract: A method for training a generative adversarial network (GAN)-based text-to-speech (TTS) model and a speech recognition model in unison includes obtaining a plurality of training text utterances. At each of a plurality of output steps for each training text utterance, the method also includes generating, for output by the GAN-Based TTS model, a synthetic speech representation of the corresponding training text utterance, and determining, using an adversarial discriminator of the GAN, an adversarial loss term indicative of an amount of acoustic noise disparity in one of the non-synthetic speech representations selected from the set of spoken training utterances relative to the corresponding synthetic speech representation of the corresponding training text utterance. The method also includes updating parameters of the GAN-based TTS model based on the adversarial loss term determined at each of the plurality of output steps for each training text utterance of the plurality of training text utterances.
    Type: Application
    Filed: May 7, 2020
    Publication date: November 11, 2021
    Applicant: Google LLC
    Inventors: Zhehuai Chen, Andrew M. Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar