Patents by Inventor Zhiyun Lu

Zhiyun Lu has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Fusion of acoustic and text representations in RNN-T

Patent number: 12211509

Abstract: A speech recognition model includes an encoder network, a prediction network, and a joint network. The encoder network is configured to receive a sequence of acoustic frames characterizing an input utterance; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The prediction network is configured to: receive a sequence of non-blank symbols output by a final Softmax layer; and generate, at each of the plurality of output steps, a dense representation. The joint network is configured to generate, at each of the plurality of output steps based on the higher order feature representation and the dense representation, a probability distribution over possible speech recognition hypotheses. The joint network includes a stack of gating and bilinear pooling to fuse the dense representation and the higher order feature representation.

Type: Grant

Filed: August 19, 2022

Date of Patent: January 28, 2025

Assignee: Google LLC

Inventors: Chao Zhang, Bo Li, Zhiyun Lu, Tara N. Sainath, Shuo-yiin Chang
Streaming Automatic Speech Recognition With Non-Streaming Model Distillation

Publication number: 20240029716

Abstract: A method for training a streaming automatic speech recognition student model includes receiving a plurality of unlabeled student training utterances. The method also includes, for each unlabeled student training utterance, generating a transcription corresponding to the respective unlabeled student training utterance using a plurality of non-streaming automated speech recognition (ASR) teacher models. The method further includes distilling a streaming ASR student model from the plurality of non-streaming ASR teacher models by training the streaming ASR student model using the plurality of unlabeled student training utterances paired with the corresponding transcriptions generated by the plurality of non-streaming ASR teacher models.

Type: Application

Filed: October 4, 2023

Publication date: January 25, 2024

Applicant: Google LLC

Inventors: Thibault Doutre, Wei Han, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Arun Narayanan, Ananya Misra, Yu Zhang, Liangliang Cao
Unsupervised Data Selection via Discrete Speech Representation for Automatic Speech Recognition

Publication number: 20240013777

Abstract: A method includes obtaining a corpus of unlabeled training data including a plurality of spoken utterances, each corresponding spoken utterance of the plurality of spoken utterances includes audio data characterizing the corresponding spoken utterance. The method also includes receiving a target domain. The method also includes selecting, using a contrastive data selection model, a subset of the utterances from the corpus of unlabeled training data that correspond to the target domain. The method includes training an automatic speech recognition (ASR) model on the subset of utterances.

Type: Application

Filed: May 19, 2023

Publication date: January 11, 2024

Applicant: Google LLC

Inventors: Zhiyun Lu, Yu Zhang, Wei Han, Yongqiang Wang, Parisa Haghani, Zhehuai Chen
Streaming automatic speech recognition with non-streaming model distillation

Patent number: 11804212

Abstract: A method for training a streaming automatic speech recognition student model includes receiving a plurality of unlabeled student training utterances. The method also includes, for each unlabeled student training utterance, generating a transcription corresponding to the respective unlabeled student training utterance using a plurality of non-streaming automated speech recognition (ASR) teacher models. The method further includes distilling a streaming ASR student model from the plurality of non-streaming ASR teacher models by training the streaming ASR student model using the plurality of unlabeled student training utterances paired with the corresponding transcriptions generated by the plurality of non-streaming ASR teacher models.

Type: Grant

Filed: June 15, 2021

Date of Patent: October 31, 2023

Assignee: Google LLC

Inventors: Thibault Doutre, Wei Han, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Arun Narayanan, Ananya Misra, Yu Zhang, Liangliang Cao
Joint Segmenting and Automatic Speech Recognition

Publication number: 20230343332

Abstract: A joint segmenting and ASR model includes an encoder and decoder. The encoder configured to: receive a sequence of acoustic frames characterizing one or more utterances; and generate, at each output step, a higher order feature representation for a corresponding acoustic frame. The decoder configured to: receive the higher order feature representation and generate, at each output step: a probability distribution over possible speech recognition hypotheses, and an indication of whether the corresponding output step corresponds to an end of speech segment. The j oint segmenting and ASR model trained on a set of training samples, each training sample including: audio data characterizing a spoken utterance; and a corresponding transcription of the spoken utterance, the corresponding transcription having an end of speech segment ground truth token inserted into the corresponding transcription automatically based on a set of heuristic-based rules and exceptions applied to the training sample.

Type: Application

Filed: April 20, 2023

Publication date: October 26, 2023

Applicant: Google LLC

Inventors: Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prakash Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Charles Caleb Peyser, Zhiyun Lu
Fusion of Acoustic and Text Representations in RNN-T

Publication number: 20230107695

Abstract: A speech recognition model includes an encoder network, a prediction network, and a joint network. The encoder network is configured to receive a sequence of acoustic frames characterizing an input utterance; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The prediction network is configured to: receive a sequence of non-blank symbols output by a final Softmax layer; and generate, at each of the plurality of output steps, a dense representation. The joint network is configured to generate, at each of the plurality of output steps based on the higher order feature representation and the dense representation, a probability distribution over possible speech recognition hypotheses. The joint network includes a stack of gating and bilinear pooling to fuse the dense representation and the higher order feature representation.

Type: Application

Filed: August 19, 2022

Publication date: April 6, 2023

Applicant: Google LLC

Inventors: Chao Zhang, Bo Li, Zhiyun Lu, Tara N. Sainath, Shuo-yiin Chang
TRAINING FOR LONG-FORM SPEECH RECOGNITION

Publication number: 20230103382

Abstract: A method includes obtaining a set of training samples, wherein each training sample includes a corresponding sequence of speech segments corresponding to a training utterance and a corresponding sequence of ground-truth transcriptions for the sequence of speech segments, and wherein each ground-truth transcription includes a start time and an end time of a corresponding speech segment. For each training sample in the set of training samples, the method includes processing, using a speech recognition model, the corresponding sequence of speech segments to obtain one or more speech recognition hypotheses for the training utterance; and, for each speech recognition hypothesis obtained for the training utterance, identifying a respective number of word errors relative to the corresponding sequence of ground-truth transcriptions.

Type: Application

Filed: September 27, 2022

Publication date: April 6, 2023

Applicant: Google LLC

Inventors: Zhiyun Lu, Thibault Doutre, Yanwei Pan, Liangliang Cao, Rohit Prabhavalkar, Trevor Strohman, Chao Zhang
Streaming Automatic Speech Recognition With Non-Streaming Model Distillation

Publication number: 20220343894

Abstract: A method for training a streaming automatic speech recognition student model includes receiving a plurality of unlabeled student training utterances. The method also includes, for each unlabeled student training utterance, generating a transcription corresponding to the respective unlabeled student training utterance using a plurality of non-streaming automated speech recognition (ASR) teacher models. The method further includes distilling a streaming ASR student model from the plurality of non-streaming ASR teacher models by training the streaming ASR student model using the plurality of unlabeled student training utterances paired with the corresponding transcriptions generated by the plurality of non-streaming ASR teacher models.

Type: Application

Filed: June 15, 2021

Publication date: October 27, 2022

Applicant: Google LLC

Inventors: Thibault Doutre, Wei Han, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Arun Narayanan, Ananya Misra, Yu Zhang, Liangliang Cao