Patents by Inventor Yanzhang He

Yanzhang He has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

4-bit conformer with accurate quantization training for speech recognition

Patent number: 12374323

Abstract: A method for training a model includes obtaining a plurality of training samples. Each respective training sample of the plurality of training samples includes a respective speech utterance and a respective textual utterance representing a transcription of the respective speech utterance. The method includes training, using quantization aware training with native integer operations, an automatic speech recognition (ASR) model on the plurality of training samples. The method also includes quantizing the trained ASR model to an integer target fixed-bit width. The quantized trained ASR model includes a plurality of weights. Each weight of the plurality of weights includes an integer with the target fixed-bit width. The method includes providing the quantized trained ASR model to a user device.

Type: Grant

Filed: March 20, 2023

Date of Patent: July 29, 2025

Assignee: Google LLC

Inventors: Shaojin Ding, Oleg Rybakov, Phoenix Meadowlark, Shivani Agrawal, Yanzhang He, Lukasz Lew
Optimizing personal VAD for on-device speech recognition

Patent number: 12347438

Abstract: A computer-implemented method includes receiving a sequence of acoustic frames corresponding to an utterance and generating a reference speaker embedding for the utterance. The method also includes receiving a target speaker embedding for a target speaker and generating feature-wise linear modulation (FiLM) parameters including a scaling vector and a shifting vector based on the target speaker embedding. The method also includes generating an affine transformation output that scales and shifts the reference speaker embedding based on the FiLM parameters. The method also includes generating a classification output indicating whether the utterance was spoken by the target speaker based on the affine transformation output.

Type: Grant

Filed: March 17, 2023

Date of Patent: July 1, 2025

Assignee: Google LLC

Inventors: Shaojin Ding, Rajeev Rikhye, Qiao Liang, Yanzhang He, Quan Wang, Arun Narayanan, Tom O'malley, Ian Mcgraw
Language Agnostic Multilingual End-To-End Streaming On-Device ASR System

Publication number: 20250095634

Abstract: A method includes receiving a sequence of acoustic frames characterizing one or more utterances as input to a multilingual automated speech recognition (ASR) model. The method also includes generating a higher order feature representation for a corresponding acoustic frame. The method also includes generating a hidden representation based on a sequence of non-blank symbols output by a final softmax layer. The method also includes generating a probability distribution over possible speech recognition hypotheses based on the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps. The method also includes predicting an end of utterance (EOU) token at an end of each utterance. The method also includes classifying each acoustic frame as either speech, initial silence, intermediate silence, or final silence.

Type: Application

Filed: December 2, 2024

Publication date: March 20, 2025

Applicant: Google LLC

Inventors: Bo Li, Tara N. Sainath, Ruoming Pang, Shuo-yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani
QUANTIZATION AND SPARSITY AWARE FINE-TUNING FOR SPEECH RECOGNITION WITH UNIVERSAL SPEECH MODELS

Publication number: 20250078815

Abstract: A method includes obtaining a plurality of training samples that each include a respective speech utterance and a respective textual utterance representing a transcription of the respective speech utterance. The method also includes fine-tuning, using quantization and sparsity aware training with native integer operations, a pre-trained automatic speech recognition (ASR) model on the plurality of training samples. Here, the pre-trained ASR model includes a plurality of weights and the fine-tuning includes pruning one or more weights of the plurality of weights using a sparsity mask and quantizing each weight of the plurality of weights based on an integer with a fixed-bit width. The method also includes providing the fine-tuned ASR model to a user device.

Type: Application

Filed: September 5, 2024

Publication date: March 6, 2025

Applicant: Google LLC

Inventors: Shaojin Ding, David Qiu, David Rim, Amir Yazdanbakhsh, Yanzhang He, Zhonglin Han, Rohit Prakash Prabhavalkar, Weiran Wang, Bo Li, Jian Li, Tara N. Sainath, Shivani Agrawal, Oleg Rybakov
Joint Acoustic Echo Cancelation, Speech Enhancement, and Voice Separation for Automatic Speech Recognition

Publication number: 20250029624

Abstract: A method for automatic speech recognition using joint acoustic echo cancellation, speech enhancement, and voice separation includes receiving, at a contextual frontend processing model, input speech features corresponding to a target utterance. The method also includes receiving, at the contextual frontend processing model, at least one of a reference audio signal, a contextual noise signal including noise prior to the target utterance, or a speaker embedding including voice characteristics of a target speaker that spoke the target utterance. The method further includes processing, using the contextual frontend processing model, the input speech features and the at least one of the reference audio signal, the contextual noise signal, or the speaker embedding vector to generate enhanced speech features.

Type: Application

Filed: October 4, 2024

Publication date: January 23, 2025

Applicant: Google LLC

Inventors: Arun Narayanan, Tom O'malley, Quan Wang, Alex Park, James Walker, Nathan David Howard, Yanzhang He, Chung-Cheng Chiu
Language agnostic multilingual end-to-end streaming on-device ASR system

Patent number: 12183322

Abstract: A method includes receiving a sequence of acoustic frames characterizing one or more utterances as input to a multilingual automated speech recognition (ASR) model. The method also includes generating a higher order feature representation for a corresponding acoustic frame. The method also includes generating a hidden representation based on a sequence of non-blank symbols output by a final softmax layer. The method also includes generating a probability distribution over possible speech recognition hypotheses based on the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps. The method also includes predicting an end of utterance (EOU) token at an end of each utterance. The method also includes classifying each acoustic frame as either speech, initial silence, intermediate silence, or final silence.

Type: Grant

Filed: September 22, 2022

Date of Patent: December 31, 2024

Assignee: Google LLC

Inventors: Bo Li, Tara N. Sainath, Ruoming Pang, Shuo-yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani
TWO-PASS END TO END SPEECH RECOGNITION

Publication number: 20240420687

Abstract: Two-pass automatic speech recognition (ASR) models can be used to perform streaming on-device ASR to generate a text representation of an utterance captured in audio data. Various implementations include a first-pass portion of the ASR model used to generate streaming candidate recognition(s) of an utterance captured in audio data. For example, the first-pass portion can include a recurrent neural network transformer (RNN-T) decoder. Various implementations include a second-pass portion of the ASR model used to revise the streaming candidate recognition(s) of the utterance and generate a text representation of the utterance. For example, the second-pass portion can include a listen attend spell (LAS) decoder. Various implementations include a shared encoder shared between the RNN-T decoder and the LAS decoder.

Type: Application

Filed: August 26, 2024

Publication date: December 19, 2024

Inventors: Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Jean Bruguier, Shuo-yiin Chang, Wei Li
EFFICIENT STREAMING NON-RECURRENT ON-DEVICE END-TO-END MODEL

Publication number: 20240371363

Abstract: An ASR model includes a first encoder configured to receive a sequence of acoustic frames and generate a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The ASR model also includes a second encoder configured to receive the first higher order feature representation generated by the first encoder at each of the plurality of output steps and generate a second higher order feature representation for a corresponding first higher order feature frame. The ASR model also includes a decoder configured to receive the second higher order feature representation generated by the second encoder at each of the plurality of output steps and generate a first probability distribution over possible speech recognition hypothesis. The ASR model also includes a language model configured to receive the first probability distribution over possible speech hypothesis and generate a rescored probability distribution.

Type: Application

Filed: July 15, 2024

Publication date: November 7, 2024

Applicant: Google LLC

Inventors: Tara Sainath, Arun Narayanan, Rami Botros, Yanzhang He, Ehsan Variani, Cyril Allauzen, David Rybach, Ruoming Pang, Trevor Strohman
VOICE SHORTCUT DETECTION WITH SPEAKER VERIFICATION

Publication number: 20240363122

Abstract: Techniques disclosed herein are directed towards streaming keyphrase detection which can be customized to detect one or more particular keyphrases, without requiring retraining of any model(s) for those particular keyphrase(s). Many implementations include processing audio data using a speaker separation model to generate separated audio data which isolates an utterance spoken by a human speaker from one or more additional sounds not spoken by the human speaker, and processing the separated audio data using a text independent speaker identification model to determine whether a verified and/or registered user spoke a spoken utterance captured in the audio data. Various implementations include processing the audio data and/or the separated audio data using an automatic speech recognition model to generate a text representation of the utterance.

Type: Application

Filed: July 5, 2024

Publication date: October 31, 2024

Inventors: Rajeev Rikhye, Quan Wang, Yanzhang He, Qiao Liang, Ian C. McGraw
Robustness Aware Norm Decay for Quantization Aware Training and Generalization

Publication number: 20240347043

Abstract: A method includes obtaining a plurality of training samples, determining a minimum integer fixed-bit width representing a maximum quantization of an automatic speech recognition (ASR) model, and training the ASR model on the plurality of training samples using a quantity of random noise. The ASR model includes a plurality of weights that each include a respective float value. The quantity of random noise is based on the minimum integer fixed-bit value. After training the ASR model, the method also includes selecting a target integer fixed-bit width greater than or equal to the minimum integer fixed-bit width, and for each respective weight of the plurality of weights, quantizing the respective weight from the respective float value to a respective integer associated with a value of the selected target integer fixed-bit width. The operations also include providing the quantized trained ASR model to a user device.

Type: Application

Filed: April 10, 2024

Publication date: October 17, 2024

Applicant: Google LLC

Inventors: David Qiu, David Rim, Shaojin Ding, Yanzhang He
Joint acoustic echo cancelation, speech enhancement, and voice separation for automatic speech recognition

Patent number: 12119014

Abstract: A method for automatic speech recognition using joint acoustic echo cancellation, speech enhancement, and voice separation includes receiving, at a contextual frontend processing model, input speech features corresponding to a target utterance. The method also includes receiving, at the contextual frontend processing model, at least one of a reference audio signal, a contextual noise signal including noise prior to the target utterance, or a speaker embedding including voice characteristics of a target speaker that spoke the target utterance. The method further includes processing, using the contextual frontend processing model, the input speech features and the at least one of the reference audio signal, the contextual noise signal, or the speaker embedding vector to generate enhanced speech features.

Type: Grant

Filed: December 14, 2021

Date of Patent: October 15, 2024

Assignee: Google LLC

Inventors: Arun Narayanan, Tom O'malley, Quan Wang, Alex Park, James Walker, Nathan David Howard, Yanzhang He, Chung-Cheng Chiu
Fast emit low-latency streaming ASR with sequence-level emission regularization utilizing forward and backward probabilities between nodes of an alignment lattice

Patent number: 12094453

Abstract: A computer-implemented method of training a streaming speech recognition model that includes receiving, as input to the streaming speech recognition model, a sequence of acoustic frames. The streaming speech recognition model is configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of vocabulary tokens. The vocabulary tokens include a plurality of label tokens and a blank token. At each output step, the method includes determining a first probability of emitting one of the label tokens and determining a second probability of emitting the blank token. The method also includes generating the alignment probability at a sequence level based on the first probability and the second probability. The method also includes applying a tuning parameter to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens.

Type: Grant

Filed: September 9, 2021

Date of Patent: September 17, 2024

Assignee: Google LLC

Inventors: Jiahui Yu, Chung-cheng Chiu, Bo Li, Shuo-yiin Chang, Tara Sainath, Wei Han, Anmol Gulati, Yanzhang He, Arun Narayanan, Yonghui Wu, Ruoming Pang
Two-pass end to end speech recognition

Patent number: 12073824

Abstract: Two-pass automatic speech recognition (ASR) models can be used to perform streaming on-device ASR to generate a text representation of an utterance captured in audio data. Various implementations include a first-pass portion of the ASR model used to generate streaming candidate recognition(s) of an utterance captured in audio data. For example, the first-pass portion can include a recurrent neural network transformer (RNN-T) decoder. Various implementations include a second-pass portion of the ASR model used to revise the streaming candidate recognition(s) of the utterance and generate a text representation of the utterance. For example, the second-pass portion can include a listen attend spell (LAS) decoder. Various implementations include a shared encoder shared between the RNN-T decoder and the LAS decoder.

Type: Grant

Filed: December 3, 2020

Date of Patent: August 27, 2024

Assignee: GOOGLE LLC

Inventors: Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Jean Bruguier, Shuo-Yiin Chang, Wei Li
Efficient streaming non-recurrent on-device end-to-end model

Patent number: 12051404

Abstract: An ASR model includes a first encoder configured to receive a sequence of acoustic frames and generate a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The ASR model also includes a second encoder configured to receive the first higher order feature representation generated by the first encoder at each of the plurality of output steps and generate a second higher order feature representation for a corresponding first higher order feature frame. The ASR model also includes a decoder configured to receive the second higher order feature representation generated by the second encoder at each of the plurality of output steps and generate a first probability distribution over possible speech recognition hypothesis. The ASR model also includes a language model configured to receive the first probability distribution over possible speech hypothesis and generate a rescored probability distribution.

Type: Grant

Filed: June 16, 2023

Date of Patent: July 30, 2024

Assignee: Google LLC

Inventors: Tara Sainath, Arun Narayanan, Rami Botros, Yanzhang He, Ehsan Variani, Cyril Allauzen, David Rybach, Ruoming Pang, Trevor Strohman
Voice shortcut detection with speaker verification

Patent number: 12033641

Abstract: Techniques disclosed herein are directed towards streaming keyphrase detection which can be customized to detect one or more particular keyphrases, without requiring retraining of any model(s) for those particular keyphrase(s). Many implementations include processing audio data using a speaker separation model to generate separated audio data which isolates an utterance spoken by a human speaker from one or more additional sounds not spoken by the human speaker, and processing the separated audio data using a text independent speaker identification model to determine whether a verified and/or registered user spoke a spoken utterance captured in the audio data. Various implementations include processing the audio data and/or the separated audio data using an automatic speech recognition model to generate a text representation of the utterance.

Type: Grant

Filed: January 30, 2023

Date of Patent: July 9, 2024

Assignee: GOOGLE LLC

Inventors: Rajeev Rikhye, Quan Wang, Yanzhang He, Qiao Liang, Ian C. McGraw
KEY PHRASE SPOTTING

Publication number: 20240221750

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for detecting utterances of a key phrase in an audio signal. One of the methods includes receiving, by a key phrase spotting system, an audio signal encoding one or more utterances; while continuing to receive the audio signal, generating, by the key phrase spotting system, an attention output using an attention mechanism that is configured to compute the attention output based on a series of encodings generated by an encoder comprising one or more neural network layers; generating, by the key phrase spotting system and using attention output, output that indicates whether the audio signal likely encodes the key phrase; and providing, by the key phrase spotting system, the output that indicates whether the audio signal likely encodes the key phrase.

Type: Application

Filed: March 19, 2024

Publication date: July 4, 2024

Applicant: Google LLC

Inventors: Wei Li, Rohit Prakash Prabhavalkar, Kanury Kanishka Rao, Yanzhang He, Ian C. McGraw, Anton Bakhtin
End-To-End Segmentation in a Two-Pass Cascaded Encoder Automatic Speech Recognition Model

Publication number: 20240169981

Abstract: A unified end-to-end segmenter and two-pass automatic speech recognition (ASR) model includes a first encoder, a first decoder, a second encoder, and a second decoder. The first encoder is configured to receive a sequence of acoustic frames and generate a first higher order feature representation. The first decoder is configured to receive the first higher order feature representation and generate, at each of a plurality of output steps, a first probability distribution and an indication of whether the output step corresponds to an end of speech segment, and emit an end of speech timestamp. The second encoder is configured to receive the first higher order feature representation and the end of speech timestamp, and generate a second higher order feature representation. The second decoder is configured to receive the second higher order feature representation and generate a second probability distribution.

Type: Application

Filed: November 17, 2023

Publication date: May 23, 2024

Applicant: Google LLC

Inventors: Wenqian Ronny Huang, Shuo-yiin Chang, Tara N. Sainath, Yanzhang He
Multi-Output Decoders for Multi-Task Learning of ASR and Auxiliary Tasks

Publication number: 20240153495

Abstract: A method includes receiving a training dataset that includes one or more spoken training utterances for training an automatic speech recognition (ASR) model. Each spoken training utterance in the training dataset paired with a corresponding transcription and a corresponding target sequence of auxiliary tokens. For each spoken training utterance, the method includes generating a speech recognition hypothesis for a corresponding spoken training utterance, determining a speech recognition loss based on the speech recognition hypothesis and the corresponding transcription, generating a predicted auxiliary token for the corresponding spoken training utterance, and determining an auxiliary task loss based on the predicted auxiliary token and the corresponding target sequence of auxiliary tokens. The method also includes the ASR model jointly on the speech recognition loss and the auxiliary task loss determined for each spoken training utterance.

Type: Application

Filed: October 26, 2023

Publication date: May 9, 2024

Applicant: Google LLC

Inventors: Weiran Wang, Ding Zhao, Shaojin Ding, Hao Zhang, Shuo-yiin Chang, David Johannes Rybach, Tara N. Sainath, Yanzhang He, Ian McGraw, Shankar Kumar
Key phrase spotting

Patent number: 11948570

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for detecting utterances of a key phrase in an audio signal. One of the methods includes receiving, by a key phrase spotting system, an audio signal encoding one or more utterances; while continuing to receive the audio signal, generating, by the key phrase spotting system, an attention output using an attention mechanism that is configured to compute the attention output based on a series of encodings generated by an encoder comprising one or more neural network layers; generating, by the key phrase spotting system and using attention output, output that indicates whether the audio signal likely encodes the key phrase; and providing, by the key phrase spotting system, the output that indicates whether the audio signal likely encodes the key phrase.

Type: Grant

Filed: March 9, 2022

Date of Patent: April 2, 2024

Assignee: Google LLC

Inventors: Wei Li, Rohit Prakash Prabhavalkar, Kanury Kanishka Rao, Yanzhang He, Ian C. Mcgraw, Anton Bakhtin
Unified End-To-End Speech Recognition And Endpointing Using A Switch Connection

Publication number: 20240029719

Abstract: A single E2E multitask model includes a speech recognition model and an endpointer model. The speech recognition model includes an audio encoder configured to encode a sequence of audio frames into corresponding higher-order feature representations, and a decoder configured to generate probability distributions over possible speech recognition hypotheses for the sequence of audio frames based on the higher-order feature representations. The endpointer model is configured to operate between a VAD mode and an EOQ detection mode. During the VAD mode, the endpointer model receives input audio frames, and determines, for each input audio frame, whether the input audio frame includes speech. During the EOQ detection mode, the endpointer model receives latent representations for the sequence of audio frames output from the audio encoder, and determines, for each of the latent representation, whether the latent representation includes final silence.

Type: Application

Filed: June 23, 2023

Publication date: January 25, 2024

Applicant: Google LLC

Inventors: Shaan Jagdeep Patrick Bijwadia, Shuo-yiin Chang, Bo Li, Yanzhang He, Tara N. Sainath, Chao Zhang

1 2 3 next