Patents by Inventor Hasim Sak

Hasim Sak has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 11961515
    Abstract: A method includes receiving a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions. At a target branch of a contrastive Siamese network, the method also includes generating a sequence of encoder outputs for the plurality of unlabeled audio samples and modifying time characteristics of the encoder outputs to generate a sequence of target branch outputs. At an augmentation branch of a contrastive Siamese network, the method also includes performing augmentation on the unlabeled audio samples, generating a sequence of augmented encoder outputs for the augmented unlabeled audio samples, and generating predictions of the sequence of target branch outputs generated at the target branch. The method also includes determining an unsupervised loss term based on target branch outputs and predictions of the sequence of target branch outputs. The method also includes updating parameters of the audio encoder based on the unsupervised loss term.
    Type: Grant
    Filed: December 14, 2021
    Date of Patent: April 16, 2024
    Assignee: Google LLC
    Inventors: Jaeyoung Kim, Soheil Khorram, Hasim Sak, Anshuman Tripathi, Han Lu, Qian Zhang
  • Publication number: 20230410796
    Abstract: Methods, systems, and apparatus for performing speech recognition. In some implementations, acoustic data representing an utterance is obtained. The acoustic data corresponds to time steps in a series of time steps. One or more computers process scores indicative of the acoustic data using a recurrent neural network to generate a sequence of outputs. The sequence of outputs indicates a likely output label from among a predetermined set of output labels. The predetermined set of output labels includes output labels that respectively correspond to different linguistic units and to a placeholder label that does not represent a classification of acoustic data. The recurrent neural network is configured to use an output label indicated for a previous time step to determine an output label for the current time step. The generated sequence of outputs is processed to generate a transcription of the utterance, and the transcription of the utterance is provided.
    Type: Application
    Filed: September 1, 2023
    Publication date: December 21, 2023
    Applicant: GOOGLE LLC
    Inventors: Hasim Sak, Sean Matthew Shannon
  • Publication number: 20230368779
    Abstract: A transformer-transducer model for unifying streaming and non-streaming speech recognition includes an audio encoder, a label encoder, and a joint network. The audio encoder receives a sequence of acoustic frames, and generates, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame. The label encoder receives a sequence of non-blank symbols output by a final softmax layer, and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypothesis. The audio encoder of the model further includes a neural network having an initial stack of transformer layers trained with zero look ahead audio context, and a final stack of transformer layers trained with a variable look ahead audio context.
    Type: Application
    Filed: July 24, 2023
    Publication date: November 16, 2023
    Applicant: Google LLC
    Inventors: Anshuman Tripathi, Hasim Sak, Han Lu, Qian Zhang, Jaeyoung Kim
  • Patent number: 11776531
    Abstract: Methods, systems, and apparatus for performing speech recognition. In some implementations, acoustic data representing an utterance is obtained. The acoustic data corresponds to time steps in a series of time steps. One or more computers process scores indicative of the acoustic data using a recurrent neural network to generate a sequence of outputs. The sequence of outputs indicates a likely output label from among a predetermined set of output labels. The predetermined set of output labels includes output labels that respectively correspond to different linguistic units and to a placeholder label that does not represent a classification of acoustic data. The recurrent neural network is configured to use an output label indicated for a previous time step to determine an output label for the current time step. The generated sequence of outputs is processed to generate a transcription of the utterance, and the transcription of the utterance is provided.
    Type: Grant
    Filed: May 28, 2020
    Date of Patent: October 3, 2023
    Assignee: Google LLC
    Inventors: Hasim Sak, Sean Matthew Shannon
  • Patent number: 11769493
    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.
    Type: Grant
    Filed: May 3, 2022
    Date of Patent: September 26, 2023
    Assignee: Google LLC
    Inventors: Kanury Kanishka Rao, Andrew W. Senior, Hasim Sak
  • Patent number: 11741947
    Abstract: A transformer-transducer model for unifying streaming and non-streaming speech recognition includes an audio encoder, a label encoder, and a joint network. The audio encoder receives a sequence of acoustic frames, and generates, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame. The label encoder receives a sequence of non-blank symbols output by a final softmax layer, and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypothesis. The audio encoder of the model further includes a neural network having an initial stack of transformer layers trained with zero look ahead audio context, and a final stack of transformer layers trained with a variable look ahead audio context.
    Type: Grant
    Filed: March 23, 2021
    Date of Patent: August 29, 2023
    Assignee: Google LLC
    Inventors: Anshuman Tripathi, Hasim Sak, Han Lu, Qian Zhang, Jaeyoung Kim
  • Patent number: 11721327
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating representation of acoustic sequences. One of the methods includes: receiving an acoustic sequence, the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; processing the acoustic feature representation at an initial time step using an acoustic modeling neural network; for each subsequent time step of the plurality of time steps: receiving an output generated by the acoustic modeling neural network for a preceding time step, generating a modified input from the output generated by the acoustic modeling neural network for the preceding time step and the acoustic representation for the time step, and processing the modified input using the acoustic modeling neural network to generate an output for the time step; and generating a phoneme representation for the utterance from the outputs for each of the time steps.
    Type: Grant
    Filed: January 8, 2021
    Date of Patent: August 8, 2023
    Assignee: Google LLC
    Inventors: Hasim Sak, Andrew W. Senior
  • Patent number: 11715486
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying the language of a spoken utterance. One of the methods includes receiving input features of an utterance; and processing the input features using an acoustic model that comprises one or more convolutional neural network (CNN) layers, one or more long short-term memory network (LSTM) layers, and one or more fully connected neural network layers to generate a transcription for the utterance.
    Type: Grant
    Filed: December 31, 2019
    Date of Patent: August 1, 2023
    Assignee: Google LLC
    Inventors: Tara N. Sainath, Andrew W. Senior, Oriol Vinyals, Hasim Sak
  • Publication number: 20230096805
    Abstract: A method includes receiving a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions. At a target branch of a contrastive Siamese network, the method also includes generating a sequence of encoder outputs for the plurality of unlabeled audio samples and modifying time characteristics of the encoder outputs to generate a sequence of target branch outputs. At an augmentation branch of a contrastive Siamese network, the method also includes performing augmentation on the unlabeled audio samples, generating a sequence of augmented encoder outputs for the augmented unlabeled audio samples, and generating predictions of the sequence of target branch outputs generated at the target branch. The method also includes determining an unsupervised loss term based on target branch outputs and predictions of the sequence of target branch outputs. The method also includes updating parameters of the audio encoder based on the unsupervised loss term.
    Type: Application
    Filed: December 14, 2021
    Publication date: March 30, 2023
    Applicant: Google LLC
    Inventors: Jaeyoung Kim, Soheil Khorram, Hasim Sak, Anshuman Tripathi, Han Lu, Qian Zhang
  • Publication number: 20230089308
    Abstract: A method includes receiving an input audio signal that corresponds to utterances spoken by multiple speakers. The method also includes processing the input audio to generate a transcription of the utterances and a sequence of speaker turn tokens each indicating a location of a respective speaker turn. The method also includes segmenting the input audio signal into a plurality of speaker segments based on the sequence of speaker tokens. The method also includes extracting a speaker-discriminative embedding from each speaker segment and performing spectral clustering on the speaker-discriminative embeddings to cluster the plurality of speaker segments into k classes. The method also includes assigning a respective speaker label to each speaker segment clustered into the respective class that is different than the respective speaker label assigned to the speaker segments clustered into each other class of the k classes.
    Type: Application
    Filed: December 14, 2021
    Publication date: March 23, 2023
    Applicant: Google LLC
    Inventors: Quan Wang, Han Lu, Evan Clark, Ignacio Lopez Moreno, Hasim Sak, Wei Xia, Taral Joglekar, Anshuman Tripathi
  • Publication number: 20230084758
    Abstract: A method for training a speech recognition model with a loss function includes receiving an audio signal including a first segment corresponding to audio spoken by a first speaker, a second segment corresponding to audio spoken by a second speaker, and an overlapping region where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding for each of the first and second speakers. The method also includes applying a masking loss after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.
    Type: Application
    Filed: November 15, 2022
    Publication date: March 16, 2023
    Applicant: Google LLC
    Inventors: Anshuman Tripathi, Han Liu, Hasim Sak
  • Patent number: 11521595
    Abstract: A method for training a speech recognition model with a loss function includes receiving an audio signal including a first segment corresponding to audio spoken by a first speaker, a second segment corresponding to audio spoken by a second speaker, and an overlapping region where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding for each of the first and second speakers. The method also includes applying a masking loss after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.
    Type: Grant
    Filed: May 1, 2020
    Date of Patent: December 6, 2022
    Assignee: Google LLC
    Inventors: Anshuman Tripathi, Han Lu, Hasim Sak
  • Publication number: 20220310097
    Abstract: A streaming speech recognition model includes an audio encoder configured to receive a sequence of acoustic frames and generate a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder configured to receive a sequence of non-blank symbols output by a final softmax layer and generate a dense representation. The streaming speech recognition model also includes a joint network configured to receive the higher order feature representation generated by the audio encoder and the dense representation generated by the label encoder and generate a probability distribution over possible speech recognition hypotheses. Here, the streaming speech recognition model is trained using self-alignment to reduce prediction delay by encouraging an alignment path that is one frame left from a reference forced-alignment frame.
    Type: Application
    Filed: December 15, 2021
    Publication date: September 29, 2022
    Applicant: Google LLC
    Inventors: Jaeyoung Kim, Han Lu, Anshuman Tripathi, Qian Zhang, Hasim Sak
  • Publication number: 20220262350
    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.
    Type: Application
    Filed: May 3, 2022
    Publication date: August 18, 2022
    Applicant: Google LLC
    Inventors: Kanury Kanishka Rao, Andrew W. Senior, Hasim Sak
  • Patent number: 11341958
    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.
    Type: Grant
    Filed: September 16, 2020
    Date of Patent: May 24, 2022
    Assignee: Google LLC
    Inventors: Kanury Kanishka Rao, Andrew W. Senior, Hasim Sak
  • Publication number: 20220108689
    Abstract: A transformer-transducer model for unifying streaming and non-streaming speech recognition includes an audio encoder, a label encoder, and a joint network. The audio encoder receives a sequence of acoustic frames, and generates, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame. The label encoder receives a sequence of non-blank symbols output by a final softmax layer, and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypothesis. The audio encoder of the model further includes a neural network having an initial stack of transformer layers trained with zero look ahead audio context, and a final stack of transformer layers trained with a variable look ahead audio context.
    Type: Application
    Filed: March 23, 2021
    Publication date: April 7, 2022
    Applicant: Google LLC
    Inventors: Anshuman Tripathi, Hasim Sak, Han Lu, Qian Zhang, Jaeyoung Kim
  • Publication number: 20210343273
    Abstract: A method for training a speech recognition model with a loss function includes receiving an audio signal including a first segment corresponding to audio spoken by a first speaker, a second segment corresponding to audio spoken by a second speaker, and an overlapping region where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding for each of the first and second speakers. The method also includes applying a masking loss after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.
    Type: Application
    Filed: May 1, 2020
    Publication date: November 4, 2021
    Applicant: Google LLC
    Inventors: Anshuman Tripathi, Han Lu, Hasim Sak
  • Publication number: 20210134275
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating representation of acoustic sequences. One of the methods includes: receiving an acoustic sequence, the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; processing the acoustic feature representation at an initial time step using an acoustic modeling neural network; for each subsequent time step of the plurality of time steps: receiving an output generated by the acoustic modeling neural network for a preceding time step, generating a modified input from the output generated by the acoustic modeling neural network for the preceding time step and the acoustic representation for the time step, and processing the modified input using the acoustic modeling neural network to generate an output for the time step; and generating a phoneme representation for the utterance from the outputs for each of the time steps.
    Type: Application
    Filed: January 8, 2021
    Publication date: May 6, 2021
    Applicant: Google LLC
    Inventors: Hasim Sak, Andrew W. Senior
  • Patent number: 10923112
    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating representation of acoustic sequences. One of the methods includes: receiving an acoustic sequence, the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; processing the acoustic feature representation at an initial time step using an acoustic modeling neural network; for each subsequent time step of the plurality of time steps: receiving an output generated by the acoustic modeling neural network for a preceding time step, generating a modified input from the output generated by the acoustic modeling neural network for the preceding time step and the acoustic representation for the time step, and processing the modified input using the acoustic modeling neural network to generate an output for the time step; and generating a phoneme representation for the utterance from the outputs for each of the time steps.
    Type: Grant
    Filed: December 5, 2019
    Date of Patent: February 16, 2021
    Assignee: Google LLC
    Inventors: Hasim Sak, Andrew W. Senior
  • Publication number: 20210005184
    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.
    Type: Application
    Filed: September 16, 2020
    Publication date: January 7, 2021
    Applicant: Google LLC
    Inventors: Kanury Kanishka Rao, Andrew W. Senior, Hasim Sak