Patents by Inventor Tara N. Sainath

Tara N. Sainath has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Emitting word timings with end-to-end models

Patent number: 12361927

Abstract: A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.

Type: Grant

Filed: May 31, 2024

Date of Patent: July 15, 2025

Assignee: Google LLC

Inventors: Tara N. Sainath, Basilio Garcia Castillo, David Rybach, Trevor Strohman, Ruoming Pang
HIERARCHICAL RECURRENT ADAPTERS FOR EFFICIENT MULTI-TASK ADAPTATION OF LARGE SPEECH MODELS

Publication number: 20250201236

Abstract: A method for implementing hierarchical recurrent adapters for efficient multi-task adaptation of large speech models including obtaining an automatic speech recognition (ASR) model pre-trained on an initial training data set, the ASR model including a plurality of layers. The method includes augmenting the ASR model with a recurrent adapter including a controller and a plurality of adapter heads, wherein the controller and the plurality of adapter heads are shared with each layer of the plurality of layers of the ASR model. The method also includes receiving an adaptation training data set including a plurality of spoken utterances, each respective spoken utterance paired with a respective transcription of the respective spoken utterance. The method includes adapting the ASR model augmented with the recurrent adapter to the adaptation training data set while parameters of the ASR model are frozen.

Type: Application

Filed: October 28, 2024

Publication date: June 19, 2025

Applicant: Google LLC

Inventors: Tsendsuren Munkhdalai, Khe Chai Sim, Tara N. Sainath, Pedro J. Moreno Mengibar
NON-AUTOREGRESSIVE AND MULTILINGUAL LANGUAGE-MODEL-FUSED ASR SYSTEM

Publication number: 20250182753

Abstract: A method includes, for each respective audio segment in a series of audio segments: generating, using a speech recognition model, multiple candidate speech recognition hypotheses for the respective audio segment and concatenating each respective candidate speech recognition hypothesis from the multiple candidate speech recognition hypotheses with a previously generated transcription corresponding to N prior audio segments. Each respective candidate speech recognition hypothesis includes a corresponding probability. For each respective audio segment, the method also includes rescoring, using a large language model (LLM), the corresponding probability of each respective candidate speech recognition hypothesis based on the concatenation and generating a transcription of the respective speech segment by selecting a respective one of the candidate speech recognition hypotheses that includes a highest rescored probability.

Type: Application

Filed: November 6, 2024

Publication date: June 5, 2025

Applicant: Google LLC

Inventors: Wenqian Ronny Huang, Cyril Allauzen, Tongzhou Chen, Shuo-yiin Chang, Tara N. Sainath
Intended query detection using E2E modeling for continued conversation

Patent number: 12315497

Abstract: A method includes receiving, as input to a speech recognition model, audio data corresponding to a spoken utterance. The method also includes performing, using the speech recognition model, speech recognition on the audio data by, at each of a plurality of time steps, encoding, using an audio encoder, the audio data corresponding to the spoken utterance into a corresponding audio encoding, and decoding, using a speech recognition joint network, the corresponding audio encoding into a probability distribution over possible output labels. At each of the plurality of time steps, the method also includes determining, using an intended query (IQ) joint network configured to receive a label history representation associated with a sequence of non-blank symbols output by a final softmax layer, an intended query decision indicating whether or not the spoken utterance includes a query intended for a digital assistant.

Type: Grant

Filed: March 20, 2023

Date of Patent: May 27, 2025

Assignee: Google LLC

Inventors: Shuo-yiin Chang, Guru Prakash Arumugam, Zelin Wu, Tara N. Sainath, Bo Li, Qiao Liang, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman
Disfluency Detection Models for Natural Conversational Voice Systems

Publication number: 20250140239

Abstract: A method includes receiving a sequence of acoustic frames characterizing one or more utterances. At each of a plurality of output steps, the method also includes generating, by an encoder network of a speech recognition model, a higher order feature representation for a corresponding acoustic frame of the sequence of acoustic frames, generating, by a prediction network of the speech recognition model, a hidden representation for a corresponding sequence of non-blank symbols output by a final softmax layer of the speech recognition model, and generating, by a first joint network of the speech recognition model that receives the higher order feature representation generated by the encoder network and the dense representation generated by the prediction network, a probability distribution that the corresponding time step corresponds to a pause and an end of speech.

Type: Application

Filed: January 6, 2025

Publication date: May 1, 2025

Applicant: Google LLC

Inventors: Shuo-yiin Chang, Bo Li, Tara N. Sainath, Trevor Strohman, Chao Zhang
MULTILINGUAL AND CODE-SWITCHING ASR USING LARGE LANGUAGE MODEL GENERATED TEXT

Publication number: 20250095637

Abstract: A method includes receiving a textual prompt in a first language and obtaining a fine-tuned prompt embedding configured to guide a large language model (LLM) to generate text in a target language from textual prompts in the first language. The method also includes processing, using the LLM, the textual prompt conditioned on the fine-tuned prompt embedding to generate output text in the target language and concatenating the textual prompt and the generated output text to provide an unspoken textual utterance. The method also includes training a multilingual automatic speech recognition (ASR) model to learn how to recognize speech in the target language by injecting the unspoken textual utterance into a text encoder associated with the multilingual ASR model.

Type: Application

Filed: September 16, 2024

Publication date: March 20, 2025

Applicant: Google LLC

Inventors: Ke Hu, Tara N. Sainath, Bo Li, Yu Zhang, Yong Cheng, Tao Wang, Yujing Zhang, Frederick Liu
Language Agnostic Multilingual End-To-End Streaming On-Device ASR System

Publication number: 20250095634

Abstract: A method includes receiving a sequence of acoustic frames characterizing one or more utterances as input to a multilingual automated speech recognition (ASR) model. The method also includes generating a higher order feature representation for a corresponding acoustic frame. The method also includes generating a hidden representation based on a sequence of non-blank symbols output by a final softmax layer. The method also includes generating a probability distribution over possible speech recognition hypotheses based on the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps. The method also includes predicting an end of utterance (EOU) token at an end of each utterance. The method also includes classifying each acoustic frame as either speech, initial silence, intermediate silence, or final silence.

Type: Application

Filed: December 2, 2024

Publication date: March 20, 2025

Applicant: Google LLC

Inventors: Bo Li, Tara N. Sainath, Ruoming Pang, Shuo-yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani
Multi-dialect and multilingual speech recognition

Patent number: 12254865

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.

Type: Grant

Filed: January 20, 2024

Date of Patent: March 18, 2025

Assignee: Google LLC

Inventors: Zhifeng Chen, Bo Li, Eugene Weinstein, Yonghui Wu, Pedro J. Moreno Mengibar, Ron J. Weiss, Khe Chai Sim, Tara N. Sainath, Patrick An Phu Nguyen
Joint unsupervised and supervised training for multilingual ASR

Patent number: 12249317

Abstract: A method includes receiving audio features and generating a latent speech representation based on the audio features. The method also includes generating a target quantized vector token and a target token index for a corresponding latent speech representation. The method also includes generating a contrastive context vector for a corresponding unmasked or masked latent speech representation and deriving a contrastive self-supervised loss based on the corresponding contrastive context vector and the corresponding target quantized vector token. The method also include generating a high-level context vector based on the contrastive context vector and, for each high-level context vector, learning to predict the target token index at the corresponding time step using a cross-entropy loss based on the target token index.

Type: Grant

Filed: September 6, 2022

Date of Patent: March 11, 2025

Assignee: Google LLC

Inventors: Bo Li, Junwen Bai, Yu Zhang, Ankur Bapna, Nikhil Siddhartha, Khe Chai Sim, Tara N. Sainath
QUANTIZATION AND SPARSITY AWARE FINE-TUNING FOR SPEECH RECOGNITION WITH UNIVERSAL SPEECH MODELS

Publication number: 20250078815

Abstract: A method includes obtaining a plurality of training samples that each include a respective speech utterance and a respective textual utterance representing a transcription of the respective speech utterance. The method also includes fine-tuning, using quantization and sparsity aware training with native integer operations, a pre-trained automatic speech recognition (ASR) model on the plurality of training samples. Here, the pre-trained ASR model includes a plurality of weights and the fine-tuning includes pruning one or more weights of the plurality of weights using a sparsity mask and quantizing each weight of the plurality of weights based on an integer with a fixed-bit width. The method also includes providing the fine-tuned ASR model to a user device.

Type: Application

Filed: September 5, 2024

Publication date: March 6, 2025

Applicant: Google LLC

Inventors: Shaojin Ding, David Qiu, David Rim, Amir Yazdanbakhsh, Yanzhang He, Zhonglin Han, Rohit Prakash Prabhavalkar, Weiran Wang, Bo Li, Jian Li, Tara N. Sainath, Shivani Agrawal, Oleg Rybakov
Adapter Finetuning with Teacher Pseudo-Labeling for Tail Languages in Streaming Multilingual ASR

Publication number: 20250078830

Abstract: A method includes receiving a sequence of acoustic frames characterizing a spoken utterance in a particular native language. The method also includes generating a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames by a causal encoder that includes an initial stack of multi-head attention layers. The method also includes generating a second higher order feature representation for a corresponding first higher order feature representation by a non-causal encoder that includes a final stack of multi-head attention layers. The method also includes receiving, as input at each corresponding language-dependent adapter (LDA) module, a language ID vector identifying the particular native language to activate corresponding language-dependent weights specific to the particular native language. The method also includes generating a first probability distribution over possible speech recognition hypotheses by a decoder.

Type: Application

Filed: September 6, 2024

Publication date: March 6, 2025

Applicant: Google LLC

Inventors: Junwen Bai, Bo Li, Qiujia Li, Tara N. Sainath, Trevor Strohman
Fusion of acoustic and text representations in RNN-T

Patent number: 12211509

Abstract: A speech recognition model includes an encoder network, a prediction network, and a joint network. The encoder network is configured to receive a sequence of acoustic frames characterizing an input utterance; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The prediction network is configured to: receive a sequence of non-blank symbols output by a final Softmax layer; and generate, at each of the plurality of output steps, a dense representation. The joint network is configured to generate, at each of the plurality of output steps based on the higher order feature representation and the dense representation, a probability distribution over possible speech recognition hypotheses. The joint network includes a stack of gating and bilinear pooling to fuse the dense representation and the higher order feature representation.

Type: Grant

Filed: August 19, 2022

Date of Patent: January 28, 2025

Assignee: Google LLC

Inventors: Chao Zhang, Bo Li, Zhiyun Lu, Tara N. Sainath, Shuo-yiin Chang
Optimizing inference performance for conformer

Patent number: 12190869

Abstract: A computer-implemented method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. Here, the ASR model includes a causal encoder and a decoder. The method also includes generating, by the causal encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by the decoder, a first probability distribution over possible speech recognition hypotheses. Here, the causal encoder includes a stack of causal encoder layers each including a Recurrent Neural Network (RNN) Attention-Performer module that applies linear attention.

Type: Grant

Filed: September 29, 2022

Date of Patent: January 7, 2025

Assignee: Google LLC

Inventors: Tara N. Sainath, Rami Botros, Anmol Gulati, Krzysztof Choromanski, Ruoming Pang, Trevor Strohman, Weiran Wang, Jiahui Yu
Language agnostic multilingual end-to-end streaming on-device ASR system

Patent number: 12183322

Abstract: A method includes receiving a sequence of acoustic frames characterizing one or more utterances as input to a multilingual automated speech recognition (ASR) model. The method also includes generating a higher order feature representation for a corresponding acoustic frame. The method also includes generating a hidden representation based on a sequence of non-blank symbols output by a final softmax layer. The method also includes generating a probability distribution over possible speech recognition hypotheses based on the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps. The method also includes predicting an end of utterance (EOU) token at an end of each utterance. The method also includes classifying each acoustic frame as either speech, initial silence, intermediate silence, or final silence.

Type: Grant

Filed: September 22, 2022

Date of Patent: December 31, 2024

Assignee: Google LLC

Inventors: Bo Li, Tara N. Sainath, Ruoming Pang, Shuo-yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani
Transducer-Based Streaming Deliberation for Cascaded Encoders

Publication number: 20240428786

Abstract: A method includes receiving a sequence of acoustic frames and generating, by a first encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a first pass transducer decoder, a first pass speech recognition hypothesis for a corresponding first higher order feature representation and generating, by a text encoder, a text encoding for a corresponding first pass speech recognition hypothesis. The method also includes generating, by a second encoder, a second higher order feature representation for a corresponding first higher order feature representation. The method also includes generating, by a second pass transducer decoder, a second pass speech recognition hypothesis using a corresponding second higher order feature representation and a corresponding text encoding.

Type: Application

Filed: September 6, 2024

Publication date: December 26, 2024

Applicant: Google LLC

Inventors: Ke Hu, Tara N. Sainath, Arun Narayanan, Ruoming Pang, Trevor Strohman
SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

Publication number: 20240420686

Abstract: A method for performing speech recognition using sequence-to-sequence models includes receiving audio data for an utterance and providing features indicative of acoustic characteristics of the utterance as input to an encoder. The method also includes processing an output of the encoder using an attender to generate a context vector, generating speech recognition scores using the context vector and a decoder trained using a training process, and generating a transcription for the utterance using word elements selected based on the speech recognition scores. The transcription is provided as an output of the ASR system.

Type: Application

Filed: August 26, 2024

Publication date: December 19, 2024

Applicant: Google LLC

Inventors: Rohit Prakash Prabhavalkar, Zhifeng Chen, Bo Li, Chung-Cheng Chiu, Kanury Kanishka Rao, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Michiel A. U. Bacchiani, Tara N. Sainath, Jan Kazimierz Chorowski, Anjuli Patricia Kannan, Ekaterina Gonina, Patrick An Phu Nguyen
TWO-PASS END TO END SPEECH RECOGNITION

Publication number: 20240420687

Abstract: Two-pass automatic speech recognition (ASR) models can be used to perform streaming on-device ASR to generate a text representation of an utterance captured in audio data. Various implementations include a first-pass portion of the ASR model used to generate streaming candidate recognition(s) of an utterance captured in audio data. For example, the first-pass portion can include a recurrent neural network transformer (RNN-T) decoder. Various implementations include a second-pass portion of the ASR model used to revise the streaming candidate recognition(s) of the utterance and generate a text representation of the utterance. For example, the second-pass portion can include a listen attend spell (LAS) decoder. Various implementations include a shared encoder shared between the RNN-T decoder and the LAS decoder.

Type: Application

Filed: August 26, 2024

Publication date: December 19, 2024

Inventors: Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Jean Bruguier, Shuo-yiin Chang, Wei Li
CONTEXTUAL BIASING FOR SPEECH RECOGNITION

Publication number: 20240379095

Abstract: A method includes receiving audio data encoding an utterance and obtaining a set of bias phrases corresponding to a context of the utterance. Each bias phrase includes one or more words. The method also includes processing, using a speech recognition model, acoustic features derived from the audio to generate an output from the speech recognition model. The speech recognition model includes a first encoder configured to receive the acoustic features, a bias encoder configured to receive data indicating the obtained set of bias phrases, a bias encoder, and a decoder configured to determine likelihoods of sequences of speech elements based on output of the first attention module and output of the bias attention module. The method also includes determining a transcript for the utterance based on the likelihoods of sequences of speech elements.

Type: Application

Filed: July 23, 2024

Publication date: November 14, 2024

Applicant: Google LLC

Inventors: Rohit Prakash Prabhavalkar, Golan Pundak, Tara N. Sainath
Transducer-based streaming deliberation for cascaded encoders

Patent number: 12118988

Abstract: A method includes receiving a sequence of acoustic frames and generating, by a first encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a first pass transducer decoder, a first pass speech recognition hypothesis for a corresponding first higher order feature representation and generating, by a text encoder, a text encoding for a corresponding first pass speech recognition hypothesis. The method also includes generating, by a second encoder, a second higher order feature representation for a corresponding first higher order feature representation. The method also includes generating, by a second pass transducer decoder, a second pass speech recognition hypothesis using a corresponding second higher order feature representation and a corresponding text encoding.

Type: Grant

Filed: September 19, 2022

Date of Patent: October 15, 2024

Assignee: Google LLC

Inventors: Ke Hu, Tara N. Sainath, Arun Narayanan, Ruoming Pang, Trevor Strohman
Speech recognition with sequence-to-sequence models

Patent number: 12106749

Abstract: A method for performing speech recognition using sequence-to-sequence models includes receiving audio data for an utterance and providing features indicative of acoustic characteristics of the utterance as input to an encoder. The method also includes processing an output of the encoder using an attender to generate a context vector, generating speech recognition scores using the context vector and a decoder trained using a training process, and generating a transcription for the utterance using word elements selected based on the speech recognition scores. The transcription is provided as an output of the ASR system.

Type: Grant

Filed: September 20, 2021

Date of Patent: October 1, 2024

Assignee: Google LLC

Inventors: Rohit Prakash Prabhavalkar, Zhifeng Chen, Bo Li, Chung-cheng Chiu, Kanury Kanishka Rao, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Michiel A. u. Bacchiani, Tara N. Sainath, Jan Kazimierz Chorowski, Anjuli Patricia Kannan, Ekaterina Gonina, Patrick An Phu Nguyen

1 2 3 4 5 … next