Patents by Inventor Ruoming Pang
Ruoming Pang has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 12079703Abstract: Systems and methods can utilize a conformer model to process a data set for various data processing tasks, including, but not limited to, speech recognition, sound separation, protein synthesis determination, video or other image set analysis, and natural language processing. The conformer model can use feed-forward blocks, a self-attention block, and a convolution block to process data to learn global interactions and relative-offset-based local correlations of the input data.Type: GrantFiled: December 31, 2020Date of Patent: September 3, 2024Assignee: GOOGLE LLCInventors: Anmol Gulati, Ruoming Pang, Niki Parmar, Jiahui Yu, Wei Han, Chung-Cheng Chiu, Yu Zhang, Yonghui Wu, Shibo Wang, Weikeng Qin, Zhengdong Zhang
-
Patent number: 12073824Abstract: Two-pass automatic speech recognition (ASR) models can be used to perform streaming on-device ASR to generate a text representation of an utterance captured in audio data. Various implementations include a first-pass portion of the ASR model used to generate streaming candidate recognition(s) of an utterance captured in audio data. For example, the first-pass portion can include a recurrent neural network transformer (RNN-T) decoder. Various implementations include a second-pass portion of the ASR model used to revise the streaming candidate recognition(s) of the utterance and generate a text representation of the utterance. For example, the second-pass portion can include a listen attend spell (LAS) decoder. Various implementations include a shared encoder shared between the RNN-T decoder and the LAS decoder.Type: GrantFiled: December 3, 2020Date of Patent: August 27, 2024Assignee: GOOGLE LLCInventors: Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Jean Bruguier, Shuo-Yiin Chang, Wei Li
-
Publication number: 20240273336Abstract: The present disclosure is directed to an automated neural architecture search approach for designing new neural network architectures such as, for example, resource-constrained mobile CNN models. In particular, the present disclosure provides systems and methods to perform neural architecture search using a novel factorized hierarchical search space that permits layer diversity throughout the network, thereby striking the right balance between flexibility and search space size. The resulting neural architectures are able to be run relatively faster and using relatively fewer computing resources (e.g., less processing power, less memory usage, less power consumption, etc.), all while remaining competitive with or even exceeding the performance (e.g., accuracy) of current state-of-the-art mobile-optimized models.Type: ApplicationFiled: February 1, 2024Publication date: August 15, 2024Inventors: Mingxing Tan, Quoc Le, Bo Chen, Vijay Vasudevan, Ruoming Pang
-
Patent number: 12051404Abstract: An ASR model includes a first encoder configured to receive a sequence of acoustic frames and generate a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The ASR model also includes a second encoder configured to receive the first higher order feature representation generated by the first encoder at each of the plurality of output steps and generate a second higher order feature representation for a corresponding first higher order feature frame. The ASR model also includes a decoder configured to receive the second higher order feature representation generated by the second encoder at each of the plurality of output steps and generate a first probability distribution over possible speech recognition hypothesis. The ASR model also includes a language model configured to receive the first probability distribution over possible speech hypothesis and generate a rescored probability distribution.Type: GrantFiled: June 16, 2023Date of Patent: July 30, 2024Assignee: Google LLCInventors: Tara Sainath, Arun Narayanan, Rami Botros, Yanzhang He, Ehsan Variani, Cyril Allauzen, David Rybach, Ruoming Pang, Trevor Strohman
-
Patent number: 12027158Abstract: A method of performing speech recognition using a two-pass deliberation architecture includes receiving a first-pass hypothesis and an encoded acoustic frame and encoding the first-pass hypothesis at a hypothesis encoder. The first-pass hypothesis is generated by a recurrent neural network (RNN) decoder model for the encoded acoustic frame. The method also includes generating, using a first attention mechanism attending to the encoded acoustic frame, a first context vector, and generating, using a second attention mechanism attending to the encoded first-pass hypothesis, a second context vector. The method also includes decoding the first context vector and the second context vector at a context vector decoder to form a second-pass hypothesis.Type: GrantFiled: February 6, 2023Date of Patent: July 2, 2024Assignee: Google LLCInventors: Ke Hu, Tara N. Sainath, Ruoming Pang, Rohit Prakash Prabhavalkar
-
Patent number: 12027154Abstract: A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.Type: GrantFiled: February 9, 2023Date of Patent: July 2, 2024Assignee: Google LLCInventors: Tara N. Sainath, Basilio Garcia Castillo, David Rybach, Trevor Strohman, Ruoming Pang
-
Patent number: 12027151Abstract: A linguistic content and speaking style disentanglement model includes a content encoder, a style encoder, and a decoder. The content encoder is configured to receive input speech as input and generate a latent representation of linguistic content for the input speech output. The content encoder is trained to disentangle speaking style information from the latent representation of linguistic content. The style encoder is configured to receive the input speech as input and generate a latent representation of speaking style for the input speech as output. The style encoder is trained to disentangle linguistic content information from the latent representation of speaking style. The decoder is configured to generate output speech based on the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech.Type: GrantFiled: November 18, 2021Date of Patent: July 2, 2024Assignee: Google LLCInventors: Ruoming Pang, Andros Tjandra, Yu Zhang, Shigeki Karita
-
Publication number: 20240112667Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.Type: ApplicationFiled: November 30, 2023Publication date: April 4, 2024Applicant: Google LLCInventors: Ye Jia, Zhifeng Chen, Yonghui Wu, Jonathan Shen, Ruoming Pang, Ron J. Weiss, Ignacio Lopez Moreno, Fei Ren, Yu Zhang, Quan Wang, Patrick An Phu Nguyen
-
Publication number: 20240104352Abstract: Provided are improved end-to-end self-supervised pre-training frameworks that leverage a combination of contrastive and masked modeling loss terms. In particular, the present disclosure provides framework that combines contrastive learning and masked modeling, where the former trains the model to discretize input data (e.g., continuous signals such as continuous speech signals) into a finite set of discriminative tokens, and the latter trains the model to learn contextualized representations via solving a masked prediction task consuming the discretized tokens. In contrast to certain existing masked modeling-based pre-training frameworks which rely on an iterative re-clustering and re-training process or other existing frameworks which concatenate two separately trained modules, the proposed framework can enable a model to be optimized in an end-to-end fashion by solving the two self-supervised tasks (the contrastive task and masked modeling) simultaneously.Type: ApplicationFiled: July 28, 2022Publication date: March 28, 2024Inventors: Yu Zhang, Yu-An Chung, Wei Han, Chung-Cheng Chiu, Weikeng Qin, Ruoming Pang, Yonghui Wu
-
Patent number: 11928574Abstract: The present disclosure is directed to an automated neural architecture search approach for designing new neural network architectures such as, for example, resource-constrained mobile CNN models. In particular, the present disclosure provides systems and methods to perform neural architecture search using a novel factorized hierarchical search space that permits layer diversity throughout the network, thereby striking the right balance between flexibility and search space size. The resulting neural architectures are able to be run relatively faster and using relatively fewer computing resources (e.g., less processing power, less memory usage, less power consumption, etc.), all while remaining competitive with or even exceeding the performance (e.g., accuracy) of current state-of-the-art mobile-optimized models.Type: GrantFiled: January 13, 2023Date of Patent: March 12, 2024Assignee: GOOGLE LLCInventors: Mingxing Tan, Quoc Le, Bo Chen, Vijay Vasudevan, Ruoming Pang
-
Patent number: 11908461Abstract: A method of performing speech recognition using a two-pass deliberation architecture includes receiving a first-pass hypothesis and an encoded acoustic frame and encoding the first-pass hypothesis at a hypothesis encoder. The first-pass hypothesis is generated by a recurrent neural network (RNN) decoder model for the encoded acoustic frame. The method also includes generating, using a first attention mechanism attending to the encoded acoustic frame, a first context vector, and generating, using a second attention mechanism attending to the encoded first-pass hypothesis, a second context vector. The method also includes decoding the first context vector and the second context vector at a context vector decoder to form a second-pass hypothesis.Type: GrantFiled: January 14, 2021Date of Patent: February 20, 2024Assignee: Google LLCInventors: Ke Hu, Tara N. Sainath, Ruoming Pang, Rohit Prakash Prabhavalkar
-
Publication number: 20240029716Abstract: A method for training a streaming automatic speech recognition student model includes receiving a plurality of unlabeled student training utterances. The method also includes, for each unlabeled student training utterance, generating a transcription corresponding to the respective unlabeled student training utterance using a plurality of non-streaming automated speech recognition (ASR) teacher models. The method further includes distilling a streaming ASR student model from the plurality of non-streaming ASR teacher models by training the streaming ASR student model using the plurality of unlabeled student training utterances paired with the corresponding transcriptions generated by the plurality of non-streaming ASR teacher models.Type: ApplicationFiled: October 4, 2023Publication date: January 25, 2024Applicant: Google LLCInventors: Thibault Doutre, Wei Han, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Arun Narayanan, Ananya Misra, Yu Zhang, Liangliang Cao
-
Patent number: 11848002Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.Type: GrantFiled: July 19, 2022Date of Patent: December 19, 2023Assignee: Google LLCInventors: Ye Jia, Zhifeng Chen, Yonghui Wu, Jonathan Shen, Ruoming Pang, Ron J. Weiss, Ignacio Lopez Moreno, Fei Ren, Yu Zhang, Quan Wang, Patrick An Phu Nguyen
-
Patent number: 11804212Abstract: A method for training a streaming automatic speech recognition student model includes receiving a plurality of unlabeled student training utterances. The method also includes, for each unlabeled student training utterance, generating a transcription corresponding to the respective unlabeled student training utterance using a plurality of non-streaming automated speech recognition (ASR) teacher models. The method further includes distilling a streaming ASR student model from the plurality of non-streaming ASR teacher models by training the streaming ASR student model using the plurality of unlabeled student training utterances paired with the corresponding transcriptions generated by the plurality of non-streaming ASR teacher models.Type: GrantFiled: June 15, 2021Date of Patent: October 31, 2023Assignee: Google LLCInventors: Thibault Doutre, Wei Han, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Arun Narayanan, Ananya Misra, Yu Zhang, Liangliang Cao
-
Publication number: 20230343328Abstract: An ASR model includes a first encoder configured to receive a sequence of acoustic frames and generate a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The ASR model also includes a second encoder configured to receive the first higher order feature representation generated by the first encoder at each of the plurality of output steps and generate a second higher order feature representation for a corresponding first higher order feature frame. The ASR model also includes a decoder configured to receive the second higher order feature representation generated by the second encoder at each of the plurality of output steps and generate a first probability distribution over possible speech recognition hypothesis. The ASR model also includes a language model configured to receive the first probability distribution over possible speech hypothesis and generate a rescored probability distribution.Type: ApplicationFiled: June 16, 2023Publication date: October 26, 2023Applicant: Google LLCInventors: Tara Sainath, Arun Narayanan, Rami Botros, Yanzhang He, Ehsan Variani, Cyril Allauzen, David Rybach, Ruoming Pang, Trevor Strohman
-
Publication number: 20230244904Abstract: The present disclosure is directed to an automated neural architecture search approach for designing new neural network architectures such as, for example, resource-constrained mobile CNN models. In particular, the present disclosure provides systems and methods to perform neural architecture search using a novel factorized hierarchical search space that permits layer diversity throughout the network, thereby striking the right balance between flexibility and search space size. The resulting neural architectures are able to be run relatively faster and using relatively fewer computing resources (e.g., less processing power, less memory usage, less power consumption, etc.), all while remaining competitive with or even exceeding the performance (e.g., accuracy) of current state-of-the-art mobile-optimized models.Type: ApplicationFiled: January 13, 2023Publication date: August 3, 2023Inventors: Mingxing Tan, Quoc Le, Bo Chen, Vijay Vasudevan, Ruoming Pang
-
Patent number: 11715458Abstract: An ASR model includes a first encoder configured to receive a sequence of acoustic frames and generate a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The ASR model also includes a second encoder configured to receive the first higher order feature representation generated by the first encoder at each of the plurality of output steps and generate a second higher order feature representation for a corresponding first higher order feature frame. The ASR model also includes a decoder configured to receive the second higher order feature representation generated by the second encoder at each of the plurality of output steps and generate a first probability distribution over possible speech recognition hypothesis. The ASR model also includes a language model configured to receive the first probability distribution over possible speech hypothesis and generate a rescored probability distribution.Type: GrantFiled: May 10, 2021Date of Patent: August 1, 2023Assignee: Google LLCInventors: Tara Sainath, Arun Narayanan, Rami Botros, Yanzhang He, Ehsan Variani, Cyril Allauzen, David Rybach, Ruoming Pang, Trevor Strohman
-
Publication number: 20230237993Abstract: Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.Type: ApplicationFiled: October 1, 2021Publication date: July 27, 2023Inventors: Jiahui Yu, Ruoming Pang, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N. Sainath, Yonghui Hu
-
Publication number: 20230206907Abstract: A method includes receiving a training example that includes audio data representing a spoken utterance and a ground truth transcription. For each word in the spoken utterance, the method also includes inserting a placeholder symbol before the respective word identifying a respective ground truth alignment for a beginning and an end of the respective word, determining a beginning word piece and an ending word piece, and generating a first constrained alignment for the beginning word piece and a second constrained alignment for the ending word piece. The first constrained alignment is aligned with the ground truth alignment for the beginning of the respective word and the second constrained alignment is aligned with the ground truth alignment for the ending of the respective word. The method also includes constraining an attention head of a second pass decoder by applying the first and second constrained alignments.Type: ApplicationFiled: February 9, 2023Publication date: June 29, 2023Applicant: Google LLCInventors: Tara N Sainath, Basilio Garcia Castillo, David Rybach, Trevor Strohman, Ruoming Pang
-
Publication number: 20230186907Abstract: A method of performing speech recognition using a two-pass deliberation architecture includes receiving a first-pass hypothesis and an encoded acoustic frame and encoding the first-pass hypothesis at a hypothesis encoder. The first-pass hypothesis is generated by a recurrent neural network (RNN) decoder model for the encoded acoustic frame. The method also includes generating, using a first attention mechanism attending to the encoded acoustic frame, a first context vector, and generating, using a second attention mechanism attending to the encoded first-pass hypothesis, a second context vector.Type: ApplicationFiled: February 6, 2023Publication date: June 15, 2023Applicant: Google LLCInventors: Ke Hu, Tara N. Sainath, Ruoming Pang, Rohit Prakash Prabhavalkar