Patents by Inventor Anshuman Tripathi

Anshuman Tripathi has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

End-to-end multi-talker overlapping speech recognition

Patent number: 12266347

Abstract: A method for training a speech recognition model with a loss function includes receiving an audio signal including a first segment corresponding to audio spoken by a first speaker, a second segment corresponding to audio spoken by a second speaker, and an overlapping region where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding for each of the first and second speakers. The method also includes applying a masking loss after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.

Type: Grant

Filed: November 15, 2022

Date of Patent: April 1, 2025

Assignee: Google LLC

Inventors: Anshuman Tripathi, Han Lu, Hasim Sak
SYSTEMS AND METHODS FOR ADAPTIVE DEAD TIME CONTROL OF A DEVICE INTEGRATED WITH CONVERTERS THAT IMPLEMENT SOFT SWITCHING

Publication number: 20250105726

Abstract: A system comprises one or more sensors for determining sensor data, the sensor data including ambient parameters internal to the enclosure; a controller comprising one or more sensor interfaces configured to communicate with one or more sensors to receive the sensor data; one or more processors; and memory storing computer instructions configured to perform: determining, based on the sensor data, an existence or probability of condensation within the enclosure; and decreasing a dead time of the one or more soft switching mechanisms based on the existence or probability of condensation within the enclosure, the decreasing the dead time increasing heat in the converter circuitry to assist in addressing the existence or probability of condensation.

Type: Application

Filed: February 23, 2024

Publication date: March 27, 2025

Inventors: Gil Lampong OPINA, JR., Howe Li YEO, Anshuman TRIPATHI
One model unifying streaming and non-streaming speech recognition

Patent number: 12254869

Abstract: A transformer-transducer model for unifying streaming and non-streaming speech recognition includes an audio encoder, a label encoder, and a joint network. The audio encoder receives a sequence of acoustic frames, and generates, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame. The label encoder receives a sequence of non-blank symbols output by a final softmax layer, and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypothesis. The audio encoder of the model further includes a neural network having an initial stack of transformer layers trained with zero look ahead audio context, and a final stack of transformer layers trained with a variable look ahead audio context.

Type: Grant

Filed: July 24, 2023

Date of Patent: March 18, 2025

Assignee: Google LLC

Inventors: Anshuman Tripathi, Hasim Sak, Han Lu, Qian Zhang, Jaeyoung Kim
Reducing Streaming ASR Model Delay With Self Alignment

Publication number: 20240371379

Abstract: A streaming speech recognition model includes an audio encoder configured to receive a sequence of acoustic frames and generate a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder configured to receive a sequence of non-blank symbols output by a final softmax layer and generate a dense representation. The streaming speech recognition model also includes a joint network configured to receive the higher order feature representation generated by the audio encoder and the dense representation generated by the label encoder and generate a probability distribution over possible speech recognition hypotheses. Here, the streaming speech recognition model is trained using self-alignment to reduce prediction delay by encouraging an alignment path that is one frame left from a reference forced-alignment frame.

Type: Application

Filed: July 17, 2024

Publication date: November 7, 2024

Applicant: Google LLC

Inventors: Jaeyoung Kim, Han Lu, Anshuman Tripathi, Qian Zhang, Hasim Sak
CLUSTERING AND MINING ACCENTED SPEECH FOR INCLUSIVE AND FAIR SPEECH RECOGNITION

Publication number: 20240290322

Abstract: A method of training an accent recognition model includes receiving a corpus of training utterances spoken across various accents, each training utterance in the corpus including training audio features characterizing the training utterance, and executing a training process to train the accent recognition model on the corpus of training utterances to teach the accent recognition model to learn how to predict accent representations from the training audio features. The accent recognition model includes one or more strided convolution layers, a stack of multi-headed attention layers, and a pooling layer configured to generate a corresponding accent representation.

Type: Application

Filed: February 26, 2024

Publication date: August 29, 2024

Applicant: Google LLC

Inventors: JAEYOUNG Kim, Han Lu, Soheil Khorram, Anshuman Tripathi, Qian Zhang, Hasim Sak
Reducing streaming ASR model delay with self alignment

Patent number: 12057124

Abstract: A streaming speech recognition model includes an audio encoder configured to receive a sequence of acoustic frames and generate a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder configured to receive a sequence of non-blank symbols output by a final softmax layer and generate a dense representation. The streaming speech recognition model also includes a joint network configured to receive the higher order feature representation generated by the audio encoder and the dense representation generated by the label encoder and generate a probability distribution over possible speech recognition hypotheses. Here, the streaming speech recognition model is trained using self-alignment to reduce prediction delay by encouraging an alignment path that is one frame left from a reference forced-alignment frame.

Type: Grant

Filed: December 15, 2021

Date of Patent: August 6, 2024

Assignee: Google LLC

Inventors: Jaeyoung Kim, Han Lu, Anshuman Tripathi, Qian Zhang, Hasim Sak
Contrastive Siamese Network for Semi-supervised Speech Recognition

Publication number: 20240242712

Abstract: A method includes receiving a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions. At a target branch of a contrastive Siamese network, the method also includes generating a sequence of encoder outputs for the plurality of unlabeled audio samples and modifying time characteristics of the encoder outputs to generate a sequence of target branch outputs. At an augmentation branch of a contrastive Siamese network, the method also includes performing augmentation on the unlabeled audio samples, generating a sequence of augmented encoder outputs for the augmented unlabeled audio samples, and generating predictions of the sequence of target branch outputs generated at the target branch. The method also includes determining an unsupervised loss term based on target branch outputs and predictions of the sequence of target branch outputs. The method also includes updating parameters of the audio encoder based on the unsupervised loss term.

Type: Application

Filed: March 28, 2024

Publication date: July 18, 2024

Applicant: Google LLC

Inventors: Jaeyoung Kim, Soheil Khorram, Hasim Sak, Anshuman Tripathi, Han Lu, Qian Zhang
Semi-Supervised Training Scheme For Speech Recognition

Publication number: 20240203406

Abstract: A method includes receiving a sequence of acoustic frames extracted from unlabeled audio samples that correspond to spoken utterances not paired with any corresponding transcriptions. The method also includes generating, using a supervised audio encoder, a target higher order feature representation for a corresponding acoustic frame. The method also includes augmenting the sequence of acoustic frames and generating, as output form an unsupervised audio encoder, a predicted higher order feature representation for a corresponding augmented acoustic frame in the sequence of augmented acoustic frames. The method also includes determining an unsupervised loss term based on the target higher order feature representation and the predicted higher order feature representation and updating parameters of the speech recognition model based on the unsupervised loss term.

Type: Application

Filed: December 14, 2022

Publication date: June 20, 2024

Applicant: Google LLC

Inventors: Soheil Khorram, Anshuman Tripathi, Kim Jaeyoung, Han Lu, Qian Zhang, Hasim Sak
Monte Carlo Self-Training for Speech Recognition

Publication number: 20240177706

Abstract: A method for training a sequence transduction model includes receiving a sequence of unlabeled input features extracted from unlabeled input samples. Using a teacher branch of an unsupervised subnetwork, the method includes processing the sequence of input features to predict probability distributions over possible teacher branch output labels, sampling one or more sequences of teacher branch output labels, and determining a sequence of pseudo output labels based on the one or more sequences of teacher branch output labels. Using a student branch that includes a student encoder of the unsupervised subnetwork, the method includes processing the sequence of input 10 features to predict probability distributions over possible student branch output labels, determining a negative log likelihood term based on the predicted probability distributions over possible student branch output labels and the sequence of pseudo output labels, and updating parameters of the student encoder.

Type: Application

Filed: November 20, 2023

Publication date: May 30, 2024

Applicant: Google LLC

Inventors: Anshuman Tripathi, Soheil Khorram, Hasim Sak, Han Lu, Jaeyoung Kim, Qian Zhang
Contrastive Siamese network for semi-supervised speech recognition

Patent number: 11961515

Abstract: A method includes receiving a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions. At a target branch of a contrastive Siamese network, the method also includes generating a sequence of encoder outputs for the plurality of unlabeled audio samples and modifying time characteristics of the encoder outputs to generate a sequence of target branch outputs. At an augmentation branch of a contrastive Siamese network, the method also includes performing augmentation on the unlabeled audio samples, generating a sequence of augmented encoder outputs for the augmented unlabeled audio samples, and generating predictions of the sequence of target branch outputs generated at the target branch. The method also includes determining an unsupervised loss term based on target branch outputs and predictions of the sequence of target branch outputs. The method also includes updating parameters of the audio encoder based on the unsupervised loss term.

Type: Grant

Filed: December 14, 2021

Date of Patent: April 16, 2024

Assignee: Google LLC

Inventors: Jaeyoung Kim, Soheil Khorram, Hasim Sak, Anshuman Tripathi, Han Lu, Qian Zhang
SOLID STATE TRANSFORMER CONTROLLER

Publication number: 20230402936

Abstract: Disclosed herein is a system for controlling a solid state transformer (SST), the SST comprising an AC-to-DC stage, a DC-to-AC stage, and a DC-to-DC stage coupled between the AC-to-DC stage and the DC-to-AC stage, the DC-to-DC stage comprising one or more DC-to-DC converters. The system comprises a stored energy controller coupled to the AC-to-DC stage, the energy controller configured to control the total amount of stored energy within the capacitors of the SST; a power flow controller coupled to the DC-to-AC stage, the power flow controller configured to control power flow in the SST; and one or more energy balancing controllers each coupled to a corresponding DC-to-DC converter, each energy balancing controller configured to balance energy in the corresponding DC-to-DC converter. In some embodiments, the stored energy controller, the power flow controller and the one or more energy balancing controllers are decoupled from one another.

Type: Application

Filed: November 3, 2021

Publication date: December 14, 2023

Inventors: Glen Ghias FARIVAR, Howe Li YEO, Radhika SARDA, Fengjiao CUI, Abishek SETHUPANDI, Haonan TIAN, Madasamy Palvesha THEVAR, Brihadeeswara Sriram VAISAMBHAYANA, Anshuman TRIPATHI
ONE MODEL UNIFYING STREAMING AND NON-STREAMING SPEECH RECOGNITION

Publication number: 20230368779

Abstract: A transformer-transducer model for unifying streaming and non-streaming speech recognition includes an audio encoder, a label encoder, and a joint network. The audio encoder receives a sequence of acoustic frames, and generates, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame. The label encoder receives a sequence of non-blank symbols output by a final softmax layer, and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypothesis. The audio encoder of the model further includes a neural network having an initial stack of transformer layers trained with zero look ahead audio context, and a final stack of transformer layers trained with a variable look ahead audio context.

Type: Application

Filed: July 24, 2023

Publication date: November 16, 2023

Applicant: Google LLC

Inventors: Anshuman Tripathi, Hasim Sak, Han Lu, Qian Zhang, Jaeyoung Kim
Transformer transducer: one model unifying streaming and non-streaming speech recognition

Patent number: 11741947

Abstract: A transformer-transducer model for unifying streaming and non-streaming speech recognition includes an audio encoder, a label encoder, and a joint network. The audio encoder receives a sequence of acoustic frames, and generates, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame. The label encoder receives a sequence of non-blank symbols output by a final softmax layer, and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypothesis. The audio encoder of the model further includes a neural network having an initial stack of transformer layers trained with zero look ahead audio context, and a final stack of transformer layers trained with a variable look ahead audio context.

Type: Grant

Filed: March 23, 2021

Date of Patent: August 29, 2023

Assignee: Google LLC

Inventors: Anshuman Tripathi, Hasim Sak, Han Lu, Qian Zhang, Jaeyoung Kim
Contrastive Siamese Network for Semi-supervised Speech Recognition

Publication number: 20230096805

Abstract: A method includes receiving a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions. At a target branch of a contrastive Siamese network, the method also includes generating a sequence of encoder outputs for the plurality of unlabeled audio samples and modifying time characteristics of the encoder outputs to generate a sequence of target branch outputs. At an augmentation branch of a contrastive Siamese network, the method also includes performing augmentation on the unlabeled audio samples, generating a sequence of augmented encoder outputs for the augmented unlabeled audio samples, and generating predictions of the sequence of target branch outputs generated at the target branch. The method also includes determining an unsupervised loss term based on target branch outputs and predictions of the sequence of target branch outputs. The method also includes updating parameters of the audio encoder based on the unsupervised loss term.

Type: Application

Filed: December 14, 2021

Publication date: March 30, 2023

Applicant: Google LLC

Inventors: Jaeyoung Kim, Soheil Khorram, Hasim Sak, Anshuman Tripathi, Han Lu, Qian Zhang
Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering

Publication number: 20230089308

Abstract: A method includes receiving an input audio signal that corresponds to utterances spoken by multiple speakers. The method also includes processing the input audio to generate a transcription of the utterances and a sequence of speaker turn tokens each indicating a location of a respective speaker turn. The method also includes segmenting the input audio signal into a plurality of speaker segments based on the sequence of speaker tokens. The method also includes extracting a speaker-discriminative embedding from each speaker segment and performing spectral clustering on the speaker-discriminative embeddings to cluster the plurality of speaker segments into k classes. The method also includes assigning a respective speaker label to each speaker segment clustered into the respective class that is different than the respective speaker label assigned to the speaker segments clustered into each other class of the k classes.

Type: Application

Filed: December 14, 2021

Publication date: March 23, 2023

Applicant: Google LLC

Inventors: Quan Wang, Han Lu, Evan Clark, Ignacio Lopez Moreno, Hasim Sak, Wei Xia, Taral Joglekar, Anshuman Tripathi
End-To-End Multi-Talker Overlapping Speech Recognition

Publication number: 20230084758

Abstract: A method for training a speech recognition model with a loss function includes receiving an audio signal including a first segment corresponding to audio spoken by a first speaker, a second segment corresponding to audio spoken by a second speaker, and an overlapping region where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding for each of the first and second speakers. The method also includes applying a masking loss after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.

Type: Application

Filed: November 15, 2022

Publication date: March 16, 2023

Applicant: Google LLC

Inventors: Anshuman Tripathi, Han Liu, Hasim Sak
End-to-end multi-talker overlapping speech recognition

Patent number: 11521595

Abstract: A method for training a speech recognition model with a loss function includes receiving an audio signal including a first segment corresponding to audio spoken by a first speaker, a second segment corresponding to audio spoken by a second speaker, and an overlapping region where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding for each of the first and second speakers. The method also includes applying a masking loss after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.

Type: Grant

Filed: May 1, 2020

Date of Patent: December 6, 2022

Assignee: Google LLC

Inventors: Anshuman Tripathi, Han Lu, Hasim Sak
Reducing Streaming ASR Model Delay With Self Alignment

Publication number: 20220310097

Abstract: A streaming speech recognition model includes an audio encoder configured to receive a sequence of acoustic frames and generate a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder configured to receive a sequence of non-blank symbols output by a final softmax layer and generate a dense representation. The streaming speech recognition model also includes a joint network configured to receive the higher order feature representation generated by the audio encoder and the dense representation generated by the label encoder and generate a probability distribution over possible speech recognition hypotheses. Here, the streaming speech recognition model is trained using self-alignment to reduce prediction delay by encouraging an alignment path that is one frame left from a reference forced-alignment frame.

Type: Application

Filed: December 15, 2021

Publication date: September 29, 2022

Applicant: Google LLC

Inventors: Jaeyoung Kim, Han Lu, Anshuman Tripathi, Qian Zhang, Hasim Sak
Transformer Transducer: One Model Unifying Streaming And Non-Streaming Speech Recognition

Publication number: 20220108689

Abstract: A transformer-transducer model for unifying streaming and non-streaming speech recognition includes an audio encoder, a label encoder, and a joint network. The audio encoder receives a sequence of acoustic frames, and generates, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame. The label encoder receives a sequence of non-blank symbols output by a final softmax layer, and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypothesis. The audio encoder of the model further includes a neural network having an initial stack of transformer layers trained with zero look ahead audio context, and a final stack of transformer layers trained with a variable look ahead audio context.

Type: Application

Filed: March 23, 2021

Publication date: April 7, 2022

Applicant: Google LLC

Inventors: Anshuman Tripathi, Hasim Sak, Han Lu, Qian Zhang, Jaeyoung Kim
End-To-End Multi-Talker Overlapping Speech Recognition

Publication number: 20210343273

Abstract: A method for training a speech recognition model with a loss function includes receiving an audio signal including a first segment corresponding to audio spoken by a first speaker, a second segment corresponding to audio spoken by a second speaker, and an overlapping region where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding for each of the first and second speakers. The method also includes applying a masking loss after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.

Type: Application

Filed: May 1, 2020

Publication date: November 4, 2021

Applicant: Google LLC

Inventors: Anshuman Tripathi, Han Lu, Hasim Sak

1 2 3 4 5 next