Patents by Inventor Pedro J. Moreno Mengibar
Pedro J. Moreno Mengibar has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20230038343Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for an automated calling system are disclosed. In one aspect, a method includes the actions of receiving audio data of an utterance spoken by a user who is having a telephone conversation with a bot. The actions further include determining a context of the telephone conversation. The actions further include determining a user intent of a first previous portion of the telephone conversation spoken by the user and a bot intent of a second previous portion of the telephone conversation outputted by a speech synthesizer of the bot. The actions further include, based on the audio data of the utterance, the context of the telephone conversation, the user intent, and the bot intent, generating synthesized speech of a reply by the bot to the utterance. The actions further include, providing, for output, the synthesized speech.Type: ApplicationFiled: October 12, 2022Publication date: February 9, 2023Inventors: Asaf Aharoni, Arun Narayanan, Nir Shabat, Parisa Haghani, Galen Tsai Chuang, Yaniv Leviathan, Neeraj Gaur, Pedro J. Moreno Mengibar, Rohit Prakash Prabhavalkar, Zhongdi Qu, Austin Severn Waters, Tomer Amiaz, Michiel A.U. Bacchiani
-
Publication number: 20230013587Abstract: A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.Type: ApplicationFiled: April 15, 2022Publication date: January 19, 2023Applicant: Google LLCInventors: Andrew Rosenberg, Zhehuai Chen, Bhuvana Ramabhadran, Pedro J. Moreno Mengibar, Gary Wang, Yu Zhang
-
Publication number: 20230017892Abstract: A method includes receiving training data that includes unspoken text utterances and un-transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances and the un-transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.Type: ApplicationFiled: June 21, 2022Publication date: January 19, 2023Applicant: Google LLCInventors: Zhehuai Chen, Bhuvana Ramabhadran, Andrew M. Rosenberg, Yu Zhang, Pedro J. Moreno Mengibar
-
Patent number: 11532299Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for modulating language model biasing. In some implementations, context data is received. A likely context associated with a user is determined based on at least a portion of the context data. One or more language model biasing parameters based at least on the likely context associated with the user is selected. A context confidence score associated with the likely context based on at least a portion of the context data is determined. One or more language model teasing parameters based at least on the context confidence score is adjusted. A baseline language model based at least on the one or more of the adjusted language model biasing parameters is biased. The baseline language model is provided for use by an automated speech recognizer (ASR).Type: GrantFiled: June 9, 2020Date of Patent: December 20, 2022Assignee: Google LLCInventors: Pedro J. Moreno Mengibar, Petar Aleksic
-
Publication number: 20220383862Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for cross-lingual speech recognition are disclosed. In one aspect, a method includes the actions of determining a context of a second computing device. The actions further include identifying, by a first computing device, an additional pronunciation for a term of multiple terms. The actions further include including the additional pronunciation for the term in the lexicon. The actions further include receiving audio data of an utterance. The actions further include generating a transcription of the utterance by using the lexicon that includes the multiple terms and the pronunciation for each of the multiple terms and the additional pronunciation for the term. The actions further include after generating the transcription of the utterance, removing the additional pronunciation for the term from the lexicon. The actions further include providing, for output, the transcription.Type: ApplicationFiled: August 3, 2022Publication date: December 1, 2022Applicant: Google LLCInventors: Petar Aleksic, Pedro J Moreno Mengibar
-
Patent number: 11495233Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for an automated calling system are disclosed. In one aspect, a method includes the actions of receiving audio data of an utterance spoken by a user who is having a telephone conversation with a bot. The actions further include determining a context of the telephone conversation. The actions further include determining a user intent of a first previous portion of the telephone conversation spoken by the user and a bot intent of a second previous portion of the telephone conversation outputted by a speech synthesizer of the bot. The actions further include, based on the audio data of the utterance, the context of the telephone conversation, the user intent, and the bot intent, generating synthesized speech of a reply by the bot to the utterance. The actions further include, providing, for output, the synthesized speech.Type: GrantFiled: October 20, 2021Date of Patent: November 8, 2022Assignee: GOOGLE LLCInventors: Asaf Aharoni, Arun Narayanan, Nir Shabat, Parisa Haghani, Galen Tsai Chuang, Yaniv Leviathan, Neeraj Gaur, Pedro J. Moreno Mengibar, Rohit Prakash Prabhavalkar, Zhongdi Qu, Austin Severn Waters, Tomer Amiaz, Michiel A.U. Bacchiani
-
Publication number: 20220343915Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for voice recognition. In one aspect, a method includes the actions of receiving a voice input; determining a transcription for the voice input, wherein determining the transcription for the voice input includes, for a plurality of segments of the voice input: obtaining a first candidate transcription for a first segment of the voice input; determining one or more contexts associated with the first candidate transcription; adjusting a respective weight for each of the one or more contexts; and determining a second candidate transcription for a second segment of the voice input based in part on the adjusted weights; and providing the transcription of the plurality of segments of the voice input for output.Type: ApplicationFiled: July 11, 2022Publication date: October 27, 2022Applicant: Google LLCInventors: Petar Aleksic, Pedro J. Moreno Mengibar
-
Publication number: 20220309340Abstract: A method for distilling one or more trained teacher automatic speech recognition (ASR) models into a multilingual student model includes receiving a plurality of teacher training examples and a plurality of student training examples. The method also includes training one or more teacher automatic speech recognition (ASR) models using the plurality of teacher training examples. Each teacher ASR model is configured to output a respective textual representation of a respective audio input. The method further includes generating a multi-lingual student ASR model by training the multi-lingual student ASR model using the plurality of student training examples and distilling the trained one or more teacher ASR models into the multilingual student ASR model using a tunable distillation loss weight. Each student ASR model is configured to receive an audio input and output a corresponding textual representation of the received audio input.Type: ApplicationFiled: December 7, 2021Publication date: September 29, 2022Applicant: Google LLCInventors: Isabel Leal, Neeraj Gaur, Parisa Haghani, Brian Farris, Bhuvana Ramabhadran, Manasa Prasad, Pedro J. Moreno Mengibar, Yun Zhu
-
Publication number: 20220310081Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis, and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.Type: ApplicationFiled: March 22, 2022Publication date: September 29, 2022Applicant: Google LLCInventors: Neeraj Gaur, Tongzhou Chen, Ehsan Variani, Bhuvana Ramabhadran, Parisa Haghani, Pedro J. Moreno Mengibar
-
Publication number: 20220310074Abstract: A method for an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition including receiving a sequence of acoustic frames. The method includes generating, using an audio encoder of an automatic speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method further includes generating, using a joint encoder of the ASR model, a probability distribution over possible speech recognition hypothesis at the corresponding time step based on the higher order feature representation generated by the audio encoder at the corresponding time step. The audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window.Type: ApplicationFiled: December 15, 2021Publication date: September 29, 2022Applicant: Google LLCInventors: Kartik Audhkhasi, Bhuvana Ramabhadran, Tongzhou Chen, Pedro J. Moreno Mengibar
-
Publication number: 20220310056Abstract: A method for speech conversion includes receiving, as input to an encoder of a speech conversion model, an input spectrogram corresponding to an utterance, the encoder including a stack of self-attention blocks. The method further includes generating, as output from the encoder, an encoded spectrogram and receiving, as input to a spectrogram decoder of the speech conversion model, the encoded spectrogram generated as output from the encoder. The method further includes generating, as output from the spectrogram decoder, an output spectrogram corresponding to a synthesized speech representation of the utterance.Type: ApplicationFiled: March 16, 2022Publication date: September 29, 2022Applicant: Google LLCInventors: Bhuvana Ramabhadran, Zhehuai Chen, Fadi Biadsy, Pedro J. Moreno Mengibar
-
Publication number: 20220310082Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for implementing contextual grammar selection are disclosed. In one aspect, a method includes the actions of receiving audio data of an utterance. The actions include generating a word lattice that includes multiple candidate transcriptions of the utterance and that includes transcription confidence scores. The actions include determining a context of the computing device. The actions include based on the context of the computing device, identifying grammars that correspond to the multiple candidate transcriptions. The actions include determining, for each of the multiple candidate transcriptions, grammar confidence scores that reflect a likelihood that a respective grammar is a match for a respective candidate transcription. The actions include selecting, from among the candidate transcriptions, a candidate transcription.Type: ApplicationFiled: June 16, 2022Publication date: September 29, 2022Applicant: Google LLCInventors: Petar Aleksic, Pedro J. Moreno Mengibar, Leonid Velikovich
-
Publication number: 20220310073Abstract: A method for an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition including receiving a sequence of acoustic frames. The method includes generating, using an audio encoder of an automatic speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method further includes generating, using a joint encoder of the ASR model, a probability distribution over possible speech recognition hypothesis at the corresponding time step based on the higher order feature representation generated by the audio encoder at the corresponding time step. The audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window.Type: ApplicationFiled: December 15, 2021Publication date: September 29, 2022Applicant: Google LLCInventors: Kartik Audhkhasi, Bhuvana Ramabhadran, Tongzhou Chen, Pedro J. Moreno Mengibar
-
Patent number: 11437025Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for cross-lingual speech recognition are disclosed. In one aspect, a method includes the actions of determining a context of a second computing device. The actions further include identifying, by a first computing device, an additional pronunciation for a term of multiple terms. The actions further include including the additional pronunciation for the term in the lexicon. The actions further include receiving audio data of an utterance. The actions further include generating a transcription of the utterance by using the lexicon that includes the multiple terms and the pronunciation for each of the multiple terms and the additional pronunciation for the term. The actions further include after generating the transcription of the utterance, removing the additional pronunciation for the term from the lexicon. The actions further include providing, for output, the transcription.Type: GrantFiled: October 4, 2019Date of Patent: September 6, 2022Assignee: Google LLCInventors: Petar Aleksic, Pedro J. Moreno Mengibar
-
Publication number: 20220277749Abstract: A method includes receiving a speech input from a user and obtaining context metadata associated with the speech input. The method also includes generating a raw speech recognition result corresponding to the speech input and selecting a list of one or more denormalizers to apply to the generated raw speech recognition result based on the context metadata associated with the speech input. The generated raw speech recognition result includes normalized text. The method also includes denormalizing the generated raw speech recognition result into denormalized text by applying the list of the one or more denormalizers in sequence to the generated raw speech recognition result.Type: ApplicationFiled: February 28, 2022Publication date: September 1, 2022Applicant: Google LLCInventors: Assaf Hurwitz Michaely, Petar Aleksic, Pedro J. Moreno Mengibar
-
Patent number: 11417322Abstract: Methods, systems, and apparatus, including computer programs stored on a computer-readable storage medium, for transliteration for speech recognition training and scoring. In some implementations, language examples are accessed, some of which include words in a first script and words in one or more other scripts. At least portions of some of the language examples are transliterated to the first script to generate a training data set. A language model is generated based on occurrences of the different sequences of words in the training data set in the first script. The language model is used to perform speech recognition for an utterance.Type: GrantFiled: December 12, 2019Date of Patent: August 16, 2022Assignee: Google LLCInventors: Bhuvana Ramabhadran, Min Ma, Pedro J. Moreno Mengibar, Jesse Emond, Brian E. Roark
-
Patent number: 11410660Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for voice recognition. In one aspect, a method includes the actions of receiving a voice input; determining a transcription for the voice input, wherein determining the transcription for the voice input includes, for a plurality of segments of the voice input: obtaining a first candidate transcription for a first segment of the voice input; determining one or more contexts associated with the first candidate transcription; adjusting a respective weight for each of the one or more contexts; and determining a second candidate transcription for a second segment of the voice input based in part on the adjusted weights; and providing the transcription of the plurality of segments of the voice input for output.Type: GrantFiled: April 1, 2020Date of Patent: August 9, 2022Assignee: Google LLCInventors: Petar Aleksic, Pedro J. Moreno Mengibar
-
Patent number: 11386889Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for implementing contextual grammar selection are disclosed. In one aspect, a method includes the actions of receiving audio data of an utterance. The actions include generating a word lattice that includes multiple candidate transcriptions of the utterance and that includes transcription confidence scores. The actions include determining a context of the computing device. The actions include based on the context of the computing device, identifying grammars that correspond to the multiple candidate transcriptions. The actions include determining, for each of the multiple candidate transcriptions, grammar confidence scores that reflect a likelihood that a respective grammar is a match for a respective candidate transcription. The actions include selecting, from among the candidate transcriptions, a candidate transcription.Type: GrantFiled: November 27, 2019Date of Patent: July 12, 2022Assignee: Google LLCInventors: Petar Aleksic, Pedro J. Moreno Mengibar, Leonid Velikovich
-
Patent number: 11335324Abstract: A method for training a speech conversion model personalized for a target speaker with atypical speech includes obtaining a plurality of transcriptions in a set of spoken training utterances and obtaining a plurality of unspoken training text utterances. Each spoken training utterance is spoken by a target speaker associated with atypical speech and includes a corresponding transcription paired with a corresponding non-synthetic speech representation. The method also includes adapting, using the set of spoken training utterances, a text-to-speech (TTS) model to synthesize speech in a voice of the target speaker and that captures the atypical speech. For each unspoken training text utterance, the method also includes generating, as output from the adapted TTS model, a synthetic speech representation that includes the voice of the target speaker and that captures the atypical speech. The method also includes training the speech conversion model based on the synthetic speech representations.Type: GrantFiled: August 31, 2020Date of Patent: May 17, 2022Assignee: Google LLCInventors: Fadi Biadsy, Liyang Jiang, Pedro J. Moreno Mengibar, Andrew Rosenberg
-
Publication number: 20220130374Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.Type: ApplicationFiled: January 10, 2022Publication date: April 28, 2022Applicant: Google LLCInventors: Zhifeng Chen, Bo Li, Eugene Weinstein, Yonghui Wu, Pedro J. Moreno Mengibar, Ron J. Weiss, Khe Chai Sim, Tara N. Sainath, Patrick An Phu Nguyen