Patents by Inventor Hasim Sak
Hasim Sak has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20230096805Abstract: A method includes receiving a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions. At a target branch of a contrastive Siamese network, the method also includes generating a sequence of encoder outputs for the plurality of unlabeled audio samples and modifying time characteristics of the encoder outputs to generate a sequence of target branch outputs. At an augmentation branch of a contrastive Siamese network, the method also includes performing augmentation on the unlabeled audio samples, generating a sequence of augmented encoder outputs for the augmented unlabeled audio samples, and generating predictions of the sequence of target branch outputs generated at the target branch. The method also includes determining an unsupervised loss term based on target branch outputs and predictions of the sequence of target branch outputs. The method also includes updating parameters of the audio encoder based on the unsupervised loss term.Type: ApplicationFiled: December 14, 2021Publication date: March 30, 2023Applicant: Google LLCInventors: Jaeyoung Kim, Soheil Khorram, Hasim Sak, Anshuman Tripathi, Han Lu, Qian Zhang
-
Publication number: 20230089308Abstract: A method includes receiving an input audio signal that corresponds to utterances spoken by multiple speakers. The method also includes processing the input audio to generate a transcription of the utterances and a sequence of speaker turn tokens each indicating a location of a respective speaker turn. The method also includes segmenting the input audio signal into a plurality of speaker segments based on the sequence of speaker tokens. The method also includes extracting a speaker-discriminative embedding from each speaker segment and performing spectral clustering on the speaker-discriminative embeddings to cluster the plurality of speaker segments into k classes. The method also includes assigning a respective speaker label to each speaker segment clustered into the respective class that is different than the respective speaker label assigned to the speaker segments clustered into each other class of the k classes.Type: ApplicationFiled: December 14, 2021Publication date: March 23, 2023Applicant: Google LLCInventors: Quan Wang, Han Lu, Evan Clark, Ignacio Lopez Moreno, Hasim Sak, Wei Xia, Taral Joglekar, Anshuman Tripathi
-
Publication number: 20230084758Abstract: A method for training a speech recognition model with a loss function includes receiving an audio signal including a first segment corresponding to audio spoken by a first speaker, a second segment corresponding to audio spoken by a second speaker, and an overlapping region where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding for each of the first and second speakers. The method also includes applying a masking loss after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.Type: ApplicationFiled: November 15, 2022Publication date: March 16, 2023Applicant: Google LLCInventors: Anshuman Tripathi, Han Liu, Hasim Sak
-
Patent number: 11521595Abstract: A method for training a speech recognition model with a loss function includes receiving an audio signal including a first segment corresponding to audio spoken by a first speaker, a second segment corresponding to audio spoken by a second speaker, and an overlapping region where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding for each of the first and second speakers. The method also includes applying a masking loss after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.Type: GrantFiled: May 1, 2020Date of Patent: December 6, 2022Assignee: Google LLCInventors: Anshuman Tripathi, Han Lu, Hasim Sak
-
Publication number: 20220310097Abstract: A streaming speech recognition model includes an audio encoder configured to receive a sequence of acoustic frames and generate a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder configured to receive a sequence of non-blank symbols output by a final softmax layer and generate a dense representation. The streaming speech recognition model also includes a joint network configured to receive the higher order feature representation generated by the audio encoder and the dense representation generated by the label encoder and generate a probability distribution over possible speech recognition hypotheses. Here, the streaming speech recognition model is trained using self-alignment to reduce prediction delay by encouraging an alignment path that is one frame left from a reference forced-alignment frame.Type: ApplicationFiled: December 15, 2021Publication date: September 29, 2022Applicant: Google LLCInventors: Jaeyoung Kim, Han Lu, Anshuman Tripathi, Qian Zhang, Hasim Sak
-
Publication number: 20220262350Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.Type: ApplicationFiled: May 3, 2022Publication date: August 18, 2022Applicant: Google LLCInventors: Kanury Kanishka Rao, Andrew W. Senior, Hasim Sak
-
Patent number: 11341958Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.Type: GrantFiled: September 16, 2020Date of Patent: May 24, 2022Assignee: Google LLCInventors: Kanury Kanishka Rao, Andrew W. Senior, Hasim Sak
-
Publication number: 20220108689Abstract: A transformer-transducer model for unifying streaming and non-streaming speech recognition includes an audio encoder, a label encoder, and a joint network. The audio encoder receives a sequence of acoustic frames, and generates, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame. The label encoder receives a sequence of non-blank symbols output by a final softmax layer, and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypothesis. The audio encoder of the model further includes a neural network having an initial stack of transformer layers trained with zero look ahead audio context, and a final stack of transformer layers trained with a variable look ahead audio context.Type: ApplicationFiled: March 23, 2021Publication date: April 7, 2022Applicant: Google LLCInventors: Anshuman Tripathi, Hasim Sak, Han Lu, Qian Zhang, Jaeyoung Kim
-
Publication number: 20210343273Abstract: A method for training a speech recognition model with a loss function includes receiving an audio signal including a first segment corresponding to audio spoken by a first speaker, a second segment corresponding to audio spoken by a second speaker, and an overlapping region where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding for each of the first and second speakers. The method also includes applying a masking loss after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.Type: ApplicationFiled: May 1, 2020Publication date: November 4, 2021Applicant: Google LLCInventors: Anshuman Tripathi, Han Lu, Hasim Sak
-
Publication number: 20210134275Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating representation of acoustic sequences. One of the methods includes: receiving an acoustic sequence, the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; processing the acoustic feature representation at an initial time step using an acoustic modeling neural network; for each subsequent time step of the plurality of time steps: receiving an output generated by the acoustic modeling neural network for a preceding time step, generating a modified input from the output generated by the acoustic modeling neural network for the preceding time step and the acoustic representation for the time step, and processing the modified input using the acoustic modeling neural network to generate an output for the time step; and generating a phoneme representation for the utterance from the outputs for each of the time steps.Type: ApplicationFiled: January 8, 2021Publication date: May 6, 2021Applicant: Google LLCInventors: Hasim Sak, Andrew W. Senior
-
Patent number: 10923112Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating representation of acoustic sequences. One of the methods includes: receiving an acoustic sequence, the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; processing the acoustic feature representation at an initial time step using an acoustic modeling neural network; for each subsequent time step of the plurality of time steps: receiving an output generated by the acoustic modeling neural network for a preceding time step, generating a modified input from the output generated by the acoustic modeling neural network for the preceding time step and the acoustic representation for the time step, and processing the modified input using the acoustic modeling neural network to generate an output for the time step; and generating a phoneme representation for the utterance from the outputs for each of the time steps.Type: GrantFiled: December 5, 2019Date of Patent: February 16, 2021Assignee: Google LLCInventors: Hasim Sak, Andrew W. Senior
-
Publication number: 20210005184Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.Type: ApplicationFiled: September 16, 2020Publication date: January 7, 2021Applicant: Google LLCInventors: Kanury Kanishka Rao, Andrew W. Senior, Hasim Sak
-
Publication number: 20200365142Abstract: Methods, systems, and apparatus for performing speech recognition. In some implementations, acoustic data representing an utterance is obtained. The acoustic data corresponds to time steps in a series of time steps. One or more computers process scores indicative of the acoustic data using a recurrent neural network to generate a sequence of outputs. The sequence of outputs indicates a likely output label from among a predetermined set of output labels. The predetermined set of output labels includes output labels that respectively correspond to different linguistic units and to a placeholder label that does not represent a classification of acoustic data. The recurrent neural network is configured to use an output label indicated for a previous time step to determine an output label for the current time step. The generated sequence of outputs is processed to generate a transcription of the utterance, and the transcription of the utterance is provided.Type: ApplicationFiled: May 28, 2020Publication date: November 19, 2020Inventors: Hasim Sak, Sean Matthew Shannon
-
Publication number: 20200335093Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media for acoustic modeling of audio data. One method includes receiving audio data representing a portion of an utterance, providing the audio data to a trained recurrent neural network that has been trained to indicate the occurrence of a phone at any of multiple time frames within a maximum delay of receiving audio data corresponding to the phone, receiving, within the predetermined maximum delay of providing the audio data to the trained recurrent neural network, output of the trained neural network indicating a phone corresponding to the provided audio data using output of the trained neural network to determine a transcription for the utterance, and providing the transcription for the utterance.Type: ApplicationFiled: July 1, 2020Publication date: October 22, 2020Applicant: Google LLCInventors: Andrew W Senior, Hasim Sak, Kanury Kanishka Rao
-
Patent number: 10803855Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.Type: GrantFiled: January 25, 2019Date of Patent: October 13, 2020Assignee: Google LLCInventors: Kanury Kanishka Rao, Andrew W. Senior, Hasim Sak
-
Patent number: 10783900Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying the language of a spoken utterance. One of the methods includes receiving input features of an utterance; and processing the input features using an acoustic model that comprises one or more convolutional neural network (CNN) layers, one or more long short-term memory network (LSTM) layers, and one or more fully connected neural network layers to generate a transcription for the utterance.Type: GrantFiled: September 8, 2015Date of Patent: September 22, 2020Assignee: Google LLCInventors: Tara N. Sainath, Andrew W. Senior, Oriol Vinyals, Hasim Sak
-
Patent number: 10733979Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media for acoustic modeling of audio data. One method includes receiving audio data representing a portion of an utterance, providing the audio data to a trained recurrent neural network that has been trained to indicate the occurrence of a phone at any of multiple time frames within a maximum delay of receiving audio data corresponding to the phone, receiving, within the predetermined maximum delay of providing the audio data to the trained recurrent neural network, output of the trained neural network indicating a phone corresponding to the provided audio data using output of the trained neural network to determine a transcription for the utterance, and providing the transcription for the utterance.Type: GrantFiled: October 9, 2015Date of Patent: August 4, 2020Assignee: Google LLCInventors: Andrew W. Senior, Hasim Sak, Kanury Kanishka Rao
-
Patent number: 10706840Abstract: Methods, systems, and apparatus for performing speech recognition. In some implementations, acoustic data representing an utterance is obtained. The acoustic data corresponds to time steps in a series of time steps. One or more computers process scores indicative of the acoustic data using a recurrent neural network to generate a sequence of outputs. The sequence of outputs indicates a likely output label from among a predetermined set of output labels. The predetermined set of output labels includes output labels that respectively correspond to different linguistic units and to a placeholder label that does not represent a classification of acoustic data. The recurrent neural network is configured to use an output label indicated for a previous time step to determine an output label for the current time step. The generated sequence of outputs is processed to generate a transcription of the utterance, and the transcription of the utterance is provided.Type: GrantFiled: December 19, 2017Date of Patent: July 7, 2020Assignee: Google LLCInventors: Hasim Sak, Sean Matthew Shannon
-
Publication number: 20200135227Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying the language of a spoken utterance. One of the methods includes receiving input features of an utterance; and processing the input features using an acoustic model that comprises one or more convolutional neural network (CNN) layers, one or more long short-term memory network (LSTM) layers, and one or more fully connected neural network layers to generate a transcription for the utterance.Type: ApplicationFiled: December 31, 2019Publication date: April 30, 2020Inventors: Tara N. Sainath, Andrew W. Senior, Oriol Vinyals, Hasim Sak
-
Publication number: 20200118552Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating representation of acoustic sequences. One of the methods includes: receiving an acoustic sequence, the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; processing the acoustic feature representation at an initial time step using an acoustic modeling neural network; for each subsequent time step of the plurality of time steps: receiving an output generated by the acoustic modeling neural network for a preceding time step, generating a modified input from the output generated by the acoustic modeling neural network for the preceding time step and the acoustic representation for the time step, and processing the modified input using the acoustic modeling neural network to generate an output for the time step; and generating a phoneme representation for the utterance from the outputs for each of the time steps.Type: ApplicationFiled: December 5, 2019Publication date: April 16, 2020Applicant: Google LLCInventors: Hasim Sak, Andrew W. Senior