Patents by Inventor Tara N. Sainath

Tara N. Sainath has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Contextual Biasing for Speech Recognition

Publication number: 20230274736

Abstract: A method of biasing speech recognition includes receiving audio data encoding an utterance and obtaining a set of one or more biasing phrases corresponding to a context of the utterance. Each biasing phrase in the set of one or more biasing phrases includes one or more words. The method also includes processing, using a speech recognition model, acoustic features derived from the audio data and grapheme and phoneme data derived from the set of one or more biasing phrases to generate an output of the speech recognition model. The method also includes determining a transcription for the utterance based on the output of the speech recognition model.

Type: Application

Filed: May 4, 2023

Publication date: August 31, 2023

Applicant: Google LLC

Inventors: Rohit Prakash Prabhavalkar, Golan Pundak, Tara N. Sainath, Antoine Jean Bruguier
Compressed recurrent neural network models

Patent number: 11741366

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for implementing long-short term memory layers with compressed gating functions. One of the systems includes a first long short-term memory (LSTM) layer, wherein the first LSTM layer is configured to, for each of the plurality of time steps, generate a new layer state and a new layer output by applying a plurality of gates to a current layer input, a current layer state, and a current layer output, each of the plurality of gates being configured to, for each of the plurality of time steps, generate a respective intermediate gate output vector by multiplying a gate input vector and a gate parameter matrix. The gate parameter matrix for at least one of the plurality of gates is a structured matrix or is defined by a compressed parameter matrix and a projection matrix.

Type: Grant

Filed: December 23, 2019

Date of Patent: August 29, 2023

Assignee: Google LLC

Inventors: Tara N. Sainath, Vikas Sindhwani
Convolutional, long short-term memory, fully connected deep neural networks

Patent number: 11715486

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying the language of a spoken utterance. One of the methods includes receiving input features of an utterance; and processing the input features using an acoustic model that comprises one or more convolutional neural network (CNN) layers, one or more long short-term memory network (LSTM) layers, and one or more fully connected neural network layers to generate a transcription for the utterance.

Type: Grant

Filed: December 31, 2019

Date of Patent: August 1, 2023

Assignee: Google LLC

Inventors: Tara N. Sainath, Andrew W. Senior, Oriol Vinyals, Hasim Sak
MINIMUM WORD ERROR RATE TRAINING FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS

Publication number: 20230237995

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses a set of speech recognition hypothesis samples, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.

Type: Application

Filed: March 31, 2023

Publication date: July 27, 2023

Applicant: Google LLC

Inventors: Rohit Prakash Prabhavalkar, Tara N. Sainath, Younghui Wu, Patrick An Phu Nguyen, Zhifeng Chen, Chung-Cheng Chiu, Anjuli Kannan
Systems and Methods for Training Dual-Mode Machine-Learned Speech Recognition Models

Publication number: 20230237993

Abstract: Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.

Type: Application

Filed: October 1, 2021

Publication date: July 27, 2023

Inventors: Jiahui Yu, Ruoming Pang, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N. Sainath, Yonghui Hu
Emitting Word Timings with End-to-End Models

Publication number: 20230206907

Abstract: A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.

Type: Application

Filed: February 9, 2023

Publication date: June 29, 2023

Applicant: Google LLC

Inventors: Tara N Sainath, Basilio Garcia Castillo, David Rybach, Trevor Strohman, Ruoming Pang
Attention-Based Joint Acoustic and Text On-Device End-to-End Model

Publication number: 20230186901

Abstract: A method includes receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model and determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence. When the training example corresponds to an unpaired text sequence, the method also includes determining a cross entropy loss based on a log probability associated with a context vector of the training example. The method also includes updating the LAS decoder and the context vector based on the determined cross entropy loss.

Type: Application

Filed: February 10, 2023

Publication date: June 15, 2023

Applicant: Google LLC

Inventors: Tara N. Sainath, Ruoming Pang, Ron Weiss, Yanzhang He, Chung-Cheng Chiu, Trevor Strohman
Deliberation Model-Based Two-Pass End-To-End Speech Recognition

Publication number: 20230186907

Abstract: A method of performing speech recognition using a two-pass deliberation architecture includes receiving a first-pass hypothesis and an encoded acoustic frame and encoding the first-pass hypothesis at a hypothesis encoder. The first-pass hypothesis is generated by a recurrent neural network (RNN) decoder model for the encoded acoustic frame. The method also includes generating, using a first attention mechanism attending to the encoded acoustic frame, a first context vector, and generating, using a second attention mechanism attending to the encoded first-pass hypothesis, a second context vector.

Type: Application

Filed: February 6, 2023

Publication date: June 15, 2023

Applicant: Google LLC

Inventors: Ke Hu, Tara N. Sainath, Ruoming Pang, Rohit Prakash Prabhavalkar
Contextual biasing for speech recognition

Patent number: 11664021

Abstract: A method of biasing speech recognition includes receiving audio data encoding an utterance and obtaining a set of one or more biasing phrases corresponding to a context of the utterance. Each biasing phrase in the set of one or more biasing phrases includes one or more words. The method also includes processing, using a speech recognition model, acoustic features derived from the audio data and grapheme and phoneme data derived from the set of one or more biasing phrases to generate an output of the speech recognition model. The method also includes determining a transcription for the utterance based on the output of the speech recognition model.

Type: Grant

Filed: December 9, 2021

Date of Patent: May 30, 2023

Assignee: Google LLC

Inventors: Rohit Prakash Prabhavalkar, Golan Pundak, Tara N. Sainath, Antoine Jean Bruguier
Minimum word error rate training for attention-based sequence-to-sequence models

Patent number: 11646019

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses N-best lists of decoded hypotheses, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.

Type: Grant

Filed: July 27, 2021

Date of Patent: May 9, 2023

Assignee: Google LLC

Inventors: Rohit Prakash Prabhavalkar, Tara N. Sainath, Yonghui Wu, Patrick An Phu Nguyen, Zhifeng Chen, Chung-Cheng Chiu, Anjuli Patricia Kannan
Optimizing Inference Performance for Conformer

Publication number: 20230130634

Abstract: A computer-implemented method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. Here, the ASR model includes a causal encoder and a decoder. The method also includes generating, by the causal encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by the decoder, a first probability distribution over possible speech recognition hypotheses. Here, the causal encoder includes a stack of causal encoder layers each including a Recurrent Neural Network (RNN) Attention-Performer module that applies linear attention.

Type: Application

Filed: September 29, 2022

Publication date: April 27, 2023

Applicant: Google LLC

Inventors: Tara N. Sainath, Rami Botros, Anmol Gulati, Krzysztof Choromanski, Ruoming Pang, Trevor Strohman, Weiran Wang, Jiahui Yu
Deliberation of Streaming RNN-Transducer by Non-Autoregressive Decoding

Publication number: 20230107248

Abstract: A method includes receiving an initial alignment for a candidate hypothesis generated by a transducer decoder model during a first pass. Here, the candidate hypothesis corresponds to a candidate transcription for an utterance and the initial alignment for the candidate hypothesis includes a sequence of output labels. Each output label corresponds to a blank symbol or a hypothesized sub-word unit. The method also include receiving a subsequent sequence of audio encodings characterizing the utterance. During an initial refinement step, the method also includes generating a new alignment for a rescored sequence of output labels using a non-autoregressive decoder. The non-autoregressive decoder is configured to receive the initial alignment for the candidate hypothesis and the subsequent sequence of audio encodings.

Type: Application

Filed: September 16, 2022

Publication date: April 6, 2023

Applicant: Google LLC

Inventors: Weiran Wang, Ke Hu, Tara N. Sainath
Disfluency Detection Models for Natural Conversational Voice Systems

Publication number: 20230107450

Abstract: A method includes receiving a sequence of acoustic frames characterizing one or more utterances. At each of a plurality of output steps, the method also includes generating, by an encoder network of a speech recognition model, a higher order feature representation for a corresponding acoustic frame of the sequence of acoustic frames, generating, by a prediction network of the speech recognition model, a hidden representation for a corresponding sequence of non-blank symbols output by a final softmax layer of the speech recognition model, and generating, by a first joint network of the speech recognition model that receives the higher order feature representation generated by the encoder network and the dense representation generated by the prediction network, a probability distribution that the corresponding time step corresponds to a pause and an end of speech.

Type: Application

Filed: August 26, 2022

Publication date: April 6, 2023

Applicant: Google LLC

Inventors: Shuo-yiin Chang, Bo Li, Tara N. Sainath, Trevor Strohman, Chao Zhang
Transducer-Based Streaming Deliberation for Cascaded Encoders

Publication number: 20230109407

Abstract: A method includes receiving a sequence of acoustic frames and generating, by a first encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a first pass transducer decoder, a first pass speech recognition hypothesis for a corresponding first higher order feature representation and generating, by a text encoder, a text encoding for a corresponding first pass speech recognition hypothesis. The method also includes generating, by a second encoder, a second higher order feature representation for a corresponding first higher order feature representation. The method also includes generating, by a second pass transducer decoder, a second pass speech recognition hypothesis using a corresponding second higher order feature representation and a corresponding text encoding.

Type: Application

Filed: September 19, 2022

Publication date: April 6, 2023

Applicant: Google LLC

Inventors: Ke Hu, Tara N. Sainath, Arun Narayanan, Ruoming Pang, Trevor Strohman
Fusion of Acoustic and Text Representations in RNN-T

Publication number: 20230107695

Abstract: A speech recognition model includes an encoder network, a prediction network, and a joint network. The encoder network is configured to receive a sequence of acoustic frames characterizing an input utterance; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The prediction network is configured to: receive a sequence of non-blank symbols output by a final Softmax layer; and generate, at each of the plurality of output steps, a dense representation. The joint network is configured to generate, at each of the plurality of output steps based on the higher order feature representation and the dense representation, a probability distribution over possible speech recognition hypotheses. The joint network includes a stack of gating and bilinear pooling to fuse the dense representation and the higher order feature representation.

Type: Application

Filed: August 19, 2022

Publication date: April 6, 2023

Applicant: Google LLC

Inventors: Chao Zhang, Bo Li, Zhiyun Lu, Tara N. Sainath, Shuo-yiin Chang
Joint Unsupervised and Supervised Training for Multilingual ASR

Publication number: 20230104228

Abstract: A method includes receiving audio features and generating a latent speech representation based on the audio features. The method also includes generating a target quantized vector token and a target token index for a corresponding latent speech representation. The method also includes generating a contrastive context vector for a corresponding unmasked or masked latent speech representation and deriving a contrastive self-supervised loss based on the corresponding contrastive context vector and the corresponding target quantized vector token. The method also include generating a high-level context vector based on the contrastive context vector and, for each high-level context vector, learning to predict the target token index at the corresponding time step using a cross-entropy loss based on the target token index.

Type: Application

Filed: September 6, 2022

Publication date: April 6, 2023

Applicant: Google LLC

Inventors: Bo Li, Junwen Bai, Yu Zhang, Ankur Bapna, Nikhil Siddhartha, Khe Chai Sim, Tara N. Sainath
Language Agnostic Multilingual End-To-End Streaming On-Device ASR System

Publication number: 20230108275

Abstract: A method includes receiving a sequence of acoustic frames characterizing one or more utterances as input to a multilingual automated speech recognition (ASR) model. The method also includes generating a higher order feature representation for a corresponding acoustic frame. The method also includes generating a hidden representation based on a sequence of non-blank symbols output by a final softmax layer. The method also includes generating a probability distribution over possible speech recognition hypotheses based on the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps. The method also includes predicting an end of utterance (EOU) token at an end of each utterance. The method also includes classifying each acoustic frame as either speech, initial silence, intermediate silence, or final silence.

Type: Application

Filed: September 22, 2022

Publication date: April 6, 2023

Applicant: Google LLC

Inventors: Bo Li, Tara N. Sainath, Ruoming Pang, Shuo-yin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani
Predicting Word Boundaries for On-Device Batching of End-To-End Speech Recognition Models

Publication number: 20230107493

Abstract: A method includes receiving a sequence of input audio frames corresponding to an utterance captured by a user device, the utterance including a plurality of words. For each input audio frame, the method includes predicting, using a word boundary detection model configured receive the sequence of input audio frames as input, whether the input audio frame is a word boundary. The method includes batching the input audio frames into a plurality of batches based on the input audio frames predicted as word boundaries, wherein each batch includes a corresponding plurality of batched input audio frames. For each of the plurality of batches, the method includes processing, using a speech recognition model, the corresponding plurality of batched input audio frames in parallel to generate a speech recognition result.

Type: Application

Filed: September 21, 2022

Publication date: April 6, 2023

Applicant: Google LLC

Inventors: Shaan Jagdeep Patrick Bijwadia, Tara N. Sainath, Jiahui Yu, Shuo-yiin Chang, Yangzhang He
Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Publication number: 20230096821

Abstract: A method of training a language model for rare-word speech recognition includes obtaining a set of training text samples, and obtaining a set of training utterances used for training a speech recognition model. Each training utterance in the plurality of training utterances includes audio data corresponding to an utterance and a corresponding transcription of the utterance. The method also includes applying rare word filtering on the set of training text samples to identify a subset of rare-word training text samples that include words that do not appear in the transcriptions from the set of training utterances or appear in the transcriptions from the set of training utterances less than a threshold number of times. The method further includes training the external language model on the transcriptions from the set of training utterances and the identified subset of rare-word training text samples.

Type: Application

Filed: December 13, 2021

Publication date: March 30, 2023

Inventors: Ronny Huang, Tara N. Sainath
Attention-based joint acoustic and text on-device end-to-end model

Patent number: 11594212

Abstract: A method includes receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model and determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence. When the training example corresponds to an unpaired text sequence, the method also includes determining a cross entropy loss based on a log probability associated with a context vector of the training example. The method also includes updating the LAS decoder and the context vector based on the determined cross entropy loss.

Type: Grant

Filed: January 21, 2021

Date of Patent: February 28, 2023

Assignee: Google LLC

Inventors: Tara N. Sainath, Ruoming Pang, Ron Weiss, Yanzhang He, Chung-Cheng Chiu, Trevor Strohman

prev 1 2 3 4 5 6 … next