Patents by Inventor Tara N. Sainath

Tara N. Sainath has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Publication number: 20230274736
    Abstract: A method of biasing speech recognition includes receiving audio data encoding an utterance and obtaining a set of one or more biasing phrases corresponding to a context of the utterance. Each biasing phrase in the set of one or more biasing phrases includes one or more words. The method also includes processing, using a speech recognition model, acoustic features derived from the audio data and grapheme and phoneme data derived from the set of one or more biasing phrases to generate an output of the speech recognition model. The method also includes determining a transcription for the utterance based on the output of the speech recognition model.
    Type: Application
    Filed: May 4, 2023
    Publication date: August 31, 2023
    Applicant: Google LLC
    Inventors: Rohit Prakash Prabhavalkar, Golan Pundak, Tara N. Sainath, Antoine Jean Bruguier
  • Patent number: 11741366
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for implementing long-short term memory layers with compressed gating functions. One of the systems includes a first long short-term memory (LSTM) layer, wherein the first LSTM layer is configured to, for each of the plurality of time steps, generate a new layer state and a new layer output by applying a plurality of gates to a current layer input, a current layer state, and a current layer output, each of the plurality of gates being configured to, for each of the plurality of time steps, generate a respective intermediate gate output vector by multiplying a gate input vector and a gate parameter matrix. The gate parameter matrix for at least one of the plurality of gates is a structured matrix or is defined by a compressed parameter matrix and a projection matrix.
    Type: Grant
    Filed: December 23, 2019
    Date of Patent: August 29, 2023
    Assignee: Google LLC
    Inventors: Tara N. Sainath, Vikas Sindhwani
  • Patent number: 11715486
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying the language of a spoken utterance. One of the methods includes receiving input features of an utterance; and processing the input features using an acoustic model that comprises one or more convolutional neural network (CNN) layers, one or more long short-term memory network (LSTM) layers, and one or more fully connected neural network layers to generate a transcription for the utterance.
    Type: Grant
    Filed: December 31, 2019
    Date of Patent: August 1, 2023
    Assignee: Google LLC
    Inventors: Tara N. Sainath, Andrew W. Senior, Oriol Vinyals, Hasim Sak
  • Publication number: 20230237995
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses a set of speech recognition hypothesis samples, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.
    Type: Application
    Filed: March 31, 2023
    Publication date: July 27, 2023
    Applicant: Google LLC
    Inventors: Rohit Prakash Prabhavalkar, Tara N. Sainath, Younghui Wu, Patrick An Phu Nguyen, Zhifeng Chen, Chung-Cheng Chiu, Anjuli Kannan
  • Publication number: 20230237993
    Abstract: Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.
    Type: Application
    Filed: October 1, 2021
    Publication date: July 27, 2023
    Inventors: Jiahui Yu, Ruoming Pang, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N. Sainath, Yonghui Hu
  • Publication number: 20230206907
    Abstract: A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.
    Type: Application
    Filed: February 9, 2023
    Publication date: June 29, 2023
    Applicant: Google LLC
    Inventors: Tara N Sainath, Basilio Garcia Castillo, David Rybach, Trevor Strohman, Ruoming Pang
  • Publication number: 20230186901
    Abstract: A method includes receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model and determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence. When the training example corresponds to an unpaired text sequence, the method also includes determining a cross entropy loss based on a log probability associated with a context vector of the training example. The method also includes updating the LAS decoder and the context vector based on the determined cross entropy loss.
    Type: Application
    Filed: February 10, 2023
    Publication date: June 15, 2023
    Applicant: Google LLC
    Inventors: Tara N. Sainath, Ruoming Pang, Ron Weiss, Yanzhang He, Chung-Cheng Chiu, Trevor Strohman
  • Publication number: 20230186907
    Abstract: A method of performing speech recognition using a two-pass deliberation architecture includes receiving a first-pass hypothesis and an encoded acoustic frame and encoding the first-pass hypothesis at a hypothesis encoder. The first-pass hypothesis is generated by a recurrent neural network (RNN) decoder model for the encoded acoustic frame. The method also includes generating, using a first attention mechanism attending to the encoded acoustic frame, a first context vector, and generating, using a second attention mechanism attending to the encoded first-pass hypothesis, a second context vector.
    Type: Application
    Filed: February 6, 2023
    Publication date: June 15, 2023
    Applicant: Google LLC
    Inventors: Ke Hu, Tara N. Sainath, Ruoming Pang, Rohit Prakash Prabhavalkar
  • Patent number: 11664021
    Abstract: A method of biasing speech recognition includes receiving audio data encoding an utterance and obtaining a set of one or more biasing phrases corresponding to a context of the utterance. Each biasing phrase in the set of one or more biasing phrases includes one or more words. The method also includes processing, using a speech recognition model, acoustic features derived from the audio data and grapheme and phoneme data derived from the set of one or more biasing phrases to generate an output of the speech recognition model. The method also includes determining a transcription for the utterance based on the output of the speech recognition model.
    Type: Grant
    Filed: December 9, 2021
    Date of Patent: May 30, 2023
    Assignee: Google LLC
    Inventors: Rohit Prakash Prabhavalkar, Golan Pundak, Tara N. Sainath, Antoine Jean Bruguier
  • Patent number: 11646019
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses N-best lists of decoded hypotheses, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.
    Type: Grant
    Filed: July 27, 2021
    Date of Patent: May 9, 2023
    Assignee: Google LLC
    Inventors: Rohit Prakash Prabhavalkar, Tara N. Sainath, Yonghui Wu, Patrick An Phu Nguyen, Zhifeng Chen, Chung-Cheng Chiu, Anjuli Patricia Kannan
  • Publication number: 20230130634
    Abstract: A computer-implemented method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. Here, the ASR model includes a causal encoder and a decoder. The method also includes generating, by the causal encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by the decoder, a first probability distribution over possible speech recognition hypotheses. Here, the causal encoder includes a stack of causal encoder layers each including a Recurrent Neural Network (RNN) Attention-Performer module that applies linear attention.
    Type: Application
    Filed: September 29, 2022
    Publication date: April 27, 2023
    Applicant: Google LLC
    Inventors: Tara N. Sainath, Rami Botros, Anmol Gulati, Krzysztof Choromanski, Ruoming Pang, Trevor Strohman, Weiran Wang, Jiahui Yu
  • Publication number: 20230107248
    Abstract: A method includes receiving an initial alignment for a candidate hypothesis generated by a transducer decoder model during a first pass. Here, the candidate hypothesis corresponds to a candidate transcription for an utterance and the initial alignment for the candidate hypothesis includes a sequence of output labels. Each output label corresponds to a blank symbol or a hypothesized sub-word unit. The method also include receiving a subsequent sequence of audio encodings characterizing the utterance. During an initial refinement step, the method also includes generating a new alignment for a rescored sequence of output labels using a non-autoregressive decoder. The non-autoregressive decoder is configured to receive the initial alignment for the candidate hypothesis and the subsequent sequence of audio encodings.
    Type: Application
    Filed: September 16, 2022
    Publication date: April 6, 2023
    Applicant: Google LLC
    Inventors: Weiran Wang, Ke Hu, Tara N. Sainath
  • Publication number: 20230107450
    Abstract: A method includes receiving a sequence of acoustic frames characterizing one or more utterances. At each of a plurality of output steps, the method also includes generating, by an encoder network of a speech recognition model, a higher order feature representation for a corresponding acoustic frame of the sequence of acoustic frames, generating, by a prediction network of the speech recognition model, a hidden representation for a corresponding sequence of non-blank symbols output by a final softmax layer of the speech recognition model, and generating, by a first joint network of the speech recognition model that receives the higher order feature representation generated by the encoder network and the dense representation generated by the prediction network, a probability distribution that the corresponding time step corresponds to a pause and an end of speech.
    Type: Application
    Filed: August 26, 2022
    Publication date: April 6, 2023
    Applicant: Google LLC
    Inventors: Shuo-yiin Chang, Bo Li, Tara N. Sainath, Trevor Strohman, Chao Zhang
  • Publication number: 20230109407
    Abstract: A method includes receiving a sequence of acoustic frames and generating, by a first encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a first pass transducer decoder, a first pass speech recognition hypothesis for a corresponding first higher order feature representation and generating, by a text encoder, a text encoding for a corresponding first pass speech recognition hypothesis. The method also includes generating, by a second encoder, a second higher order feature representation for a corresponding first higher order feature representation. The method also includes generating, by a second pass transducer decoder, a second pass speech recognition hypothesis using a corresponding second higher order feature representation and a corresponding text encoding.
    Type: Application
    Filed: September 19, 2022
    Publication date: April 6, 2023
    Applicant: Google LLC
    Inventors: Ke Hu, Tara N. Sainath, Arun Narayanan, Ruoming Pang, Trevor Strohman
  • Publication number: 20230107695
    Abstract: A speech recognition model includes an encoder network, a prediction network, and a joint network. The encoder network is configured to receive a sequence of acoustic frames characterizing an input utterance; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The prediction network is configured to: receive a sequence of non-blank symbols output by a final Softmax layer; and generate, at each of the plurality of output steps, a dense representation. The joint network is configured to generate, at each of the plurality of output steps based on the higher order feature representation and the dense representation, a probability distribution over possible speech recognition hypotheses. The joint network includes a stack of gating and bilinear pooling to fuse the dense representation and the higher order feature representation.
    Type: Application
    Filed: August 19, 2022
    Publication date: April 6, 2023
    Applicant: Google LLC
    Inventors: Chao Zhang, Bo Li, Zhiyun Lu, Tara N. Sainath, Shuo-yiin Chang
  • Publication number: 20230104228
    Abstract: A method includes receiving audio features and generating a latent speech representation based on the audio features. The method also includes generating a target quantized vector token and a target token index for a corresponding latent speech representation. The method also includes generating a contrastive context vector for a corresponding unmasked or masked latent speech representation and deriving a contrastive self-supervised loss based on the corresponding contrastive context vector and the corresponding target quantized vector token. The method also include generating a high-level context vector based on the contrastive context vector and, for each high-level context vector, learning to predict the target token index at the corresponding time step using a cross-entropy loss based on the target token index.
    Type: Application
    Filed: September 6, 2022
    Publication date: April 6, 2023
    Applicant: Google LLC
    Inventors: Bo Li, Junwen Bai, Yu Zhang, Ankur Bapna, Nikhil Siddhartha, Khe Chai Sim, Tara N. Sainath
  • Publication number: 20230108275
    Abstract: A method includes receiving a sequence of acoustic frames characterizing one or more utterances as input to a multilingual automated speech recognition (ASR) model. The method also includes generating a higher order feature representation for a corresponding acoustic frame. The method also includes generating a hidden representation based on a sequence of non-blank symbols output by a final softmax layer. The method also includes generating a probability distribution over possible speech recognition hypotheses based on the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps. The method also includes predicting an end of utterance (EOU) token at an end of each utterance. The method also includes classifying each acoustic frame as either speech, initial silence, intermediate silence, or final silence.
    Type: Application
    Filed: September 22, 2022
    Publication date: April 6, 2023
    Applicant: Google LLC
    Inventors: Bo Li, Tara N. Sainath, Ruoming Pang, Shuo-yin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani
  • Publication number: 20230107493
    Abstract: A method includes receiving a sequence of input audio frames corresponding to an utterance captured by a user device, the utterance including a plurality of words. For each input audio frame, the method includes predicting, using a word boundary detection model configured receive the sequence of input audio frames as input, whether the input audio frame is a word boundary. The method includes batching the input audio frames into a plurality of batches based on the input audio frames predicted as word boundaries, wherein each batch includes a corresponding plurality of batched input audio frames. For each of the plurality of batches, the method includes processing, using a speech recognition model, the corresponding plurality of batched input audio frames in parallel to generate a speech recognition result.
    Type: Application
    Filed: September 21, 2022
    Publication date: April 6, 2023
    Applicant: Google LLC
    Inventors: Shaan Jagdeep Patrick Bijwadia, Tara N. Sainath, Jiahui Yu, Shuo-yiin Chang, Yangzhang He
  • Publication number: 20230096821
    Abstract: A method of training a language model for rare-word speech recognition includes obtaining a set of training text samples, and obtaining a set of training utterances used for training a speech recognition model. Each training utterance in the plurality of training utterances includes audio data corresponding to an utterance and a corresponding transcription of the utterance. The method also includes applying rare word filtering on the set of training text samples to identify a subset of rare-word training text samples that include words that do not appear in the transcriptions from the set of training utterances or appear in the transcriptions from the set of training utterances less than a threshold number of times. The method further includes training the external language model on the transcriptions from the set of training utterances and the identified subset of rare-word training text samples.
    Type: Application
    Filed: December 13, 2021
    Publication date: March 30, 2023
    Inventors: Ronny Huang, Tara N. Sainath
  • Patent number: 11594212
    Abstract: A method includes receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model and determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence. When the training example corresponds to an unpaired text sequence, the method also includes determining a cross entropy loss based on a log probability associated with a context vector of the training example. The method also includes updating the LAS decoder and the context vector based on the determined cross entropy loss.
    Type: Grant
    Filed: January 21, 2021
    Date of Patent: February 28, 2023
    Assignee: Google LLC
    Inventors: Tara N. Sainath, Ruoming Pang, Ron Weiss, Yanzhang He, Chung-Cheng Chiu, Trevor Strohman