Patents by Inventor Tara N. Sainath

Tara N. Sainath has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 11908461
    Abstract: A method of performing speech recognition using a two-pass deliberation architecture includes receiving a first-pass hypothesis and an encoded acoustic frame and encoding the first-pass hypothesis at a hypothesis encoder. The first-pass hypothesis is generated by a recurrent neural network (RNN) decoder model for the encoded acoustic frame. The method also includes generating, using a first attention mechanism attending to the encoded acoustic frame, a first context vector, and generating, using a second attention mechanism attending to the encoded first-pass hypothesis, a second context vector. The method also includes decoding the first context vector and the second context vector at a context vector decoder to form a second-pass hypothesis.
    Type: Grant
    Filed: January 14, 2021
    Date of Patent: February 20, 2024
    Assignee: Google LLC
    Inventors: Ke Hu, Tara N. Sainath, Ruoming Pang, Rohit Prakash Prabhavalkar
  • Patent number: 11900915
    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.
    Type: Grant
    Filed: January 10, 2022
    Date of Patent: February 13, 2024
    Assignee: Google LLC
    Inventors: Zhifeng Chen, Bo Li, Eugene Weinstein, Yonghui Wu, Pedro J. Moreno Mengibar, Ron J. Weiss, Khe Chai Sim, Tara N. Sainath, Patrick An Phu Nguyen
  • Publication number: 20240028829
    Abstract: A method includes receiving training data that includes a set of unspoken textual utterances. For each respective unspoken textual utterance, the method includes, tokenizing the respective textual utterance into a sequence of sub-word units, generating a first higher order textual feature representation for a corresponding sub-word unit tokenized from the respective unspoken textual utterance, receiving the first higher order textual feature representation generated by a text encoder, and generating a first probability distribution over possible text units. The method also includes training an encoder based on the first probability distribution over possible text units generated by a first-pass decoder for each respective unspoken textual utterance in the set of unspoken textual utterances.
    Type: Application
    Filed: July 1, 2023
    Publication date: January 25, 2024
    Applicant: Google LLC
    Inventors: Tara N. Sainath, Zhouyuan Huo, Zhehuai Chen, Yu Zhang, Weiran Wang, Trevor Strohman, Rohit Prakash Prabhavalkar, Bo Li, Ankur Bapna
  • Publication number: 20240029719
    Abstract: A single E2E multitask model includes a speech recognition model and an endpointer model. The speech recognition model includes an audio encoder configured to encode a sequence of audio frames into corresponding higher-order feature representations, and a decoder configured to generate probability distributions over possible speech recognition hypotheses for the sequence of audio frames based on the higher-order feature representations. The endpointer model is configured to operate between a VAD mode and an EOQ detection mode. During the VAD mode, the endpointer model receives input audio frames, and determines, for each input audio frame, whether the input audio frame includes speech. During the EOQ detection mode, the endpointer model receives latent representations for the sequence of audio frames output from the audio encoder, and determines, for each of the latent representation, whether the latent representation includes final silence.
    Type: Application
    Filed: June 23, 2023
    Publication date: January 25, 2024
    Applicant: Google LLC
    Inventors: Shaan Jagdeep Patrick Bijwadia, Shuo-yiin Chang, Bo Li, Yanzhang He, Tara N. Sainath, Chao Zhang
  • Publication number: 20230377564
    Abstract: A method for training a speech recognition model with a minimum word error rate loss function includes receiving a training example comprising a proper noun and generating a plurality of hypotheses corresponding to the training example. Each hypothesis of the plurality of hypotheses represents the proper noun and includes a corresponding probability that indicates a likelihood that the hypothesis represents the proper noun. The method also includes determining that the corresponding probability associated with one of the plurality of hypotheses satisfies a penalty criteria. The penalty criteria indicating that the corresponding probability satisfies a probability threshold, and the associated hypothesis incorrectly represents the proper noun. The method also includes applying a penalty to the minimum word error rate loss function.
    Type: Application
    Filed: July 31, 2023
    Publication date: November 23, 2023
    Applicant: Google LLC
    Inventors: Charles Caleb Peyser, Tara N. Sainath, Golan Pundak
  • Publication number: 20230343332
    Abstract: A joint segmenting and ASR model includes an encoder and decoder. The encoder configured to: receive a sequence of acoustic frames characterizing one or more utterances; and generate, at each output step, a higher order feature representation for a corresponding acoustic frame. The decoder configured to: receive the higher order feature representation and generate, at each output step: a probability distribution over possible speech recognition hypotheses, and an indication of whether the corresponding output step corresponds to an end of speech segment. The j oint segmenting and ASR model trained on a set of training samples, each training sample including: audio data characterizing a spoken utterance; and a corresponding transcription of the spoken utterance, the corresponding transcription having an end of speech segment ground truth token inserted into the corresponding transcription automatically based on a set of heuristic-based rules and exceptions applied to the training sample.
    Type: Application
    Filed: April 20, 2023
    Publication date: October 26, 2023
    Applicant: Google LLC
    Inventors: Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prakash Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Charles Caleb Peyser, Zhiyun Lu
  • Publication number: 20230335117
    Abstract: A method includes receiving, as input to a speech recognition model, audio data corresponding to a spoken utterance. The method also includes performing, using the speech recognition model, speech recognition on the audio data by, at each of a plurality of time steps, encoding, using an audio encoder, the audio data corresponding to the spoken utterance into a corresponding audio encoding, and decoding, using a speech recognition joint network, the corresponding audio encoding into a probability distribution over possible output labels. At each of the plurality of time steps, the method also includes determining, using an intended query (IQ) joint network configured to receive a label history representation associated with a sequence of non-blank symbols output by a final softmax layer, an intended query decision indicating whether or not the spoken utterance includes a query intended for a digital assistant.
    Type: Application
    Filed: March 20, 2023
    Publication date: October 19, 2023
    Applicant: Google LLC
    Inventors: Shuo-yiin Chang, Guru Prakash Arumugam, Zelin Wu, Tara N. Sainath, Bo LI, Qiao Liang, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman
  • Publication number: 20230326461
    Abstract: An automated speech recognition (ASR) model includes a first encoder, a first encoder, a second encoder, and a second decoder. The first encoder receives, as input, a sequence of acoustic frames, and generates, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The first decoder receives, as input, the first higher order feature representation generated by the first encoder, and generates a first probability distribution over possible speech recognition hypotheses. The second encoder receives, as input, the first higher order feature representation generated by the first encoder, and generates a second higher order feature representation for a corresponding first higher order feature frame. The second decoder receives, as input, the second higher order feature representation generated by the second encoder, and generates a second probability distribution over possible speech recognition hypotheses.
    Type: Application
    Filed: March 13, 2023
    Publication date: October 12, 2023
    Applicant: Google LLC
    Inventors: Shaojin Ding, Yangzhang He, Xin Wang, Weiran Wang, Trevor Strohman, Tara N. Sainath, Rohit Parkash Prabhavalkar, Robert David, Rina Panigrahy, Rami Botros, Qiao Liang, Ian Mcgraw, Ding Zhao, Dongseong Hwang
  • Patent number: 11783849
    Abstract: This specification describes computer-implemented methods and systems. One method includes receiving, by a neural network of a speech recognition system, first data representing a first raw audio signal and second data representing a second raw audio signal. The first raw audio signal and the second raw audio signal describe audio occurring at a same period of time. The method further includes generating, by a spatial filtering layer of the neural network, a spatial filtered output using the first data and the second data, and generating, by a spectral filtering layer of the neural network, a spectral filtered output using the spatial filtered output. Generating the spectral filtered output comprises processing frequency-domain data representing the spatial filtered output. The method still further includes processing, by one or more additional layers of the neural network, the spectral filtered output to predict sub-word units encoded in both the first raw audio signal and the second raw audio signal.
    Type: Grant
    Filed: June 8, 2021
    Date of Patent: October 10, 2023
    Assignee: Google LLC
    Inventors: Ehsan Variani, Kevin William Wilson, Ron J. Weiss, Tara N. Sainath, Arun Narayanan
  • Publication number: 20230306958
    Abstract: A method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. The method also includes generating, by a first encoder, a first higher order feature representation for a corresponding acoustic frame. The method also includes generating, by a second encoder, a second higher order feature representation for a corresponding first higher order feature representation. The method also includes generating, by a language identification (ID) predictor, a language prediction representation based on a concatenation of the first higher order feature representation and the second higher order feature representation. The method also includes generating, by a first decoder, a first probability distribution over possible speech recognition hypotheses based on a concatenation of the second higher order feature representation and the language prediction representation.
    Type: Application
    Filed: March 23, 2023
    Publication date: September 28, 2023
    Applicant: Google LLC
    Inventors: Chao Zhang, Bo Li, Tara N. Sainath, Trevor Strohman, Sepand Mavandadi, Shuo-yiin Chang, Parisa Haghani
  • Publication number: 20230298563
    Abstract: A method of text-only and semi-supervised training for deliberation includes receiving training data including unspoken textual utterances that are each not paired with any corresponding spoken utterance of non-synthetic speech, and training a deliberation model that includes a text encoder and a deliberation decoder on the unspoken textual utterances. The method also includes receiving, at the trained deliberation model, first-pass hypotheses and non-causal acoustic embeddings. The first-pass hypotheses is generated by a recurrent neural network-transducer (RNN-T) decoder for the non-causal acoustic embeddings encoded by a non-causal encoder. The method also includes encoding, using the text encoder, the first-pass hypotheses generated by the RNN-T decoder, and generating, using the deliberation decoder attending to both the first-pass hypotheses and the non-causal acoustic embeddings, second-pass hypotheses.
    Type: Application
    Filed: March 18, 2023
    Publication date: September 21, 2023
    Applicant: Google LLC
    Inventors: Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Sepand Mavandadi, Weiran Wang, Trevor Strohman
  • Publication number: 20230298570
    Abstract: A method includes generating, using an audio encoder, a higher-order feature representation for each acoustic frame in a sequence of acoustic frames; generating, using a decoder, based on the higher-order feature representation, a plurality of speech recognition hypotheses, each hypotheses corresponding to a candidate transcription of an utterance and having an associated first likelihood score; generating, using an external language model, for each speech recognition hypothesis, a second likelihood score; determining, using a learnable fusion module, for each speech recognition hypothesis, a set of fusion weights based on the higher-order feature representation and the speech recognition hypothesis; and generating, using the learnable fusion module, for each speech recognition hypothesis, a third likelihood score based on the first likelihood score, the second likelihood score, and the set of fusion weights, the audio encoder and decoder trained using minimum additive error rate training in the presence of t
    Type: Application
    Filed: March 21, 2023
    Publication date: September 21, 2023
    Applicant: Google LLC
    Inventors: Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prakash Prabhavalkar, Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Charles Caleb Peyser, Trevor Strohman, Yangzhang He, David Rybach
  • Patent number: 11756534
    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed. In one aspect, a method includes the actions of receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance. The actions further include generating a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data. The actions further include generating a single combined channel of audio data. The actions further include inputting the audio data to a neural network. The actions further include providing a transcription for the utterance.
    Type: Grant
    Filed: January 26, 2022
    Date of Patent: September 12, 2023
    Assignee: Google LLC
    Inventors: Bo Li, Ron Weiss, Michiel A. U. Bacchiani, Tara N. Sainath, Kevin William Wilson
  • Patent number: 11749259
    Abstract: A method for training a speech recognition model with a minimum word error rate loss function includes receiving a training example comprising a proper noun and generating a plurality of hypotheses corresponding to the training example. Each hypothesis of the plurality of hypotheses represents the proper noun and includes a corresponding probability that indicates a likelihood that the hypothesis represents the proper noun. The method also includes determining that the corresponding probability associated with one of the plurality of hypotheses satisfies a penalty criteria. The penalty criteria indicating that the corresponding probability satisfies a probability threshold, and the associated hypothesis incorrectly represents the proper noun. The method also includes applying a penalty to the minimum word error rate loss function.
    Type: Grant
    Filed: January 15, 2021
    Date of Patent: September 5, 2023
    Assignee: Google LLC
    Inventors: Charles Caleb Peyser, Tara N. Sainath, Golan Pundak
  • Publication number: 20230274736
    Abstract: A method of biasing speech recognition includes receiving audio data encoding an utterance and obtaining a set of one or more biasing phrases corresponding to a context of the utterance. Each biasing phrase in the set of one or more biasing phrases includes one or more words. The method also includes processing, using a speech recognition model, acoustic features derived from the audio data and grapheme and phoneme data derived from the set of one or more biasing phrases to generate an output of the speech recognition model. The method also includes determining a transcription for the utterance based on the output of the speech recognition model.
    Type: Application
    Filed: May 4, 2023
    Publication date: August 31, 2023
    Applicant: Google LLC
    Inventors: Rohit Prakash Prabhavalkar, Golan Pundak, Tara N. Sainath, Antoine Jean Bruguier
  • Patent number: 11741366
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for implementing long-short term memory layers with compressed gating functions. One of the systems includes a first long short-term memory (LSTM) layer, wherein the first LSTM layer is configured to, for each of the plurality of time steps, generate a new layer state and a new layer output by applying a plurality of gates to a current layer input, a current layer state, and a current layer output, each of the plurality of gates being configured to, for each of the plurality of time steps, generate a respective intermediate gate output vector by multiplying a gate input vector and a gate parameter matrix. The gate parameter matrix for at least one of the plurality of gates is a structured matrix or is defined by a compressed parameter matrix and a projection matrix.
    Type: Grant
    Filed: December 23, 2019
    Date of Patent: August 29, 2023
    Assignee: Google LLC
    Inventors: Tara N. Sainath, Vikas Sindhwani
  • Patent number: 11715486
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying the language of a spoken utterance. One of the methods includes receiving input features of an utterance; and processing the input features using an acoustic model that comprises one or more convolutional neural network (CNN) layers, one or more long short-term memory network (LSTM) layers, and one or more fully connected neural network layers to generate a transcription for the utterance.
    Type: Grant
    Filed: December 31, 2019
    Date of Patent: August 1, 2023
    Assignee: Google LLC
    Inventors: Tara N. Sainath, Andrew W. Senior, Oriol Vinyals, Hasim Sak
  • Publication number: 20230237995
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses a set of speech recognition hypothesis samples, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.
    Type: Application
    Filed: March 31, 2023
    Publication date: July 27, 2023
    Applicant: Google LLC
    Inventors: Rohit Prakash Prabhavalkar, Tara N. Sainath, Younghui Wu, Patrick An Phu Nguyen, Zhifeng Chen, Chung-Cheng Chiu, Anjuli Kannan
  • Publication number: 20230237993
    Abstract: Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.
    Type: Application
    Filed: October 1, 2021
    Publication date: July 27, 2023
    Inventors: Jiahui Yu, Ruoming Pang, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N. Sainath, Yonghui Hu
  • Publication number: 20230206907
    Abstract: A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.
    Type: Application
    Filed: February 9, 2023
    Publication date: June 29, 2023
    Applicant: Google LLC
    Inventors: Tara N Sainath, Basilio Garcia Castillo, David Rybach, Trevor Strohman, Ruoming Pang