Patents by Inventor Hank Liao

Hank Liao has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

LONGFORM WORD-LEVEL END-TO-END SPEAKER DIARIZATION WITH DYNAMIC AUDIO COHORT

Publication number: 20250252960

Abstract: A method includes obtaining a series of segmented labeled training samples. Each respective segmented labeled training sample includes one or more spoken terms spoken during a conversation by multiple speakers. Each respective spoken term is characterized by a corresponding sequence of acoustic frames and is paired with a corresponding transcription and a corresponding speaker label. For each respective segmented labeled training sample, the method includes obtaining a corresponding dynamic audio cohort associated with an immediately prior segmented labeled training sample, generating diarization results that include a corresponding speech recognition result having one or more predicted terms, and training a joint speech recognition and speaker diarization model based on a loss derived from the generated diarization results, the corresponding transcriptions, and the corresponding speaker labels.

Type: Application

Filed: January 31, 2025

Publication date: August 7, 2025

Applicant: Google LLC

Inventors: Quan Wang, Yiling Huang, Guanlong Zhao, Hank Liao, Weiran Wang
Rescoring automatic speech recognition hypotheses using audio-visual matching

Patent number: 12334054

Abstract: A method (400) includes receiving audio data (112) corresponding to an utterance (101) spoken by a user (10), receiving video data (114) representing motion of lips of the user while the user was speaking the utterance, and obtaining multiple candidate transcriptions (135) for the utterance based on the audio data. For each candidate transcription of the multiple candidate transcriptions, the method also includes generating a synthesized speech representation (145) of the corresponding candidate transcription and determining an agreement score (155) indicating a likelihood that the synthesized speech representation matches the motion of the lips of the user while the user speaks the utterance. The method also includes selecting one of the multiple candidate transcriptions for the utterance as a speech recognition output (175) based on the agreement scores determined for the multiple candidate transcriptions for the utterance.

Type: Grant

Filed: November 18, 2019

Date of Patent: June 17, 2025

Assignee: Google LLC

Inventors: Olivier Siohan, Takaki Makino, Richard Rose, Otavio Braga, Hank Liao, Basilio Garcia Castillo
PRIVACY-AWARE MEETING ROOM TRANSCRIPTION FROM AUDIO-VISUAL STREAM

Publication number: 20250173461

Abstract: A method for a privacy-aware transcription includes receiving audio-visual signal including audio data and image data for a speech environment and a privacy request from a participant in the speech environment where the privacy request indicates a privacy condition of the participant. The method further includes segmenting the audio data into a plurality of segments. For each segment, the method includes determining an identity of a speaker of a corresponding segment of the audio data based on the image data and determining whether the identity of the speaker of the corresponding segment includes the participant associated with the privacy condition. When the identity of the speaker of the corresponding segment includes the participant, the method includes applying the privacy condition to the corresponding segment. The method also includes processing the plurality of segments of the audio data to determine a transcript for the audio data.

Type: Application

Filed: January 30, 2025

Publication date: May 29, 2025

Applicant: Google LLC

Inventors: Olivier Siohan, Takaki Makino, Richard Rose, Otavio Braga, Hank Liao, Basilio Garcia Castillo
WORD-LEVEL END-TO-END NEURAL SPEAKER DIARIZATION WITH AUXNET

Publication number: 20250118292

Abstract: A method includes obtaining labeled training data including a plurality of spoken terms spoken during a conversation. For each respective spoken term, the method includes generating a corresponding sequence of intermediate audio encodings from a corresponding sequence of acoustic frames, generating a corresponding sequence of final audio encodings from the corresponding sequence of intermediate audio encodings, generating a corresponding speech recognition result, and generating a respective speaker token representing a predicted identity of a speaker for each corresponding speech recognition result. The method also includes training the joint speech recognition and speaker diarization model jointly based on a first loss derived from the generated speech recognition results and the corresponding transcriptions and a second loss derived from the generated speaker tokens and the corresponding speaker labels.

Type: Application

Filed: September 20, 2024

Publication date: April 10, 2025

Applicant: Google LLC

Inventors: Yiling Huang, Weiran Wang, Quan Wang, Guanlong Zhao, Hank Liao, Han Lu
Privacy-aware meeting room transcription from audio-visual stream

Patent number: 12242648

Abstract: A method for a privacy-aware transcription includes receiving audio-visual signal including audio data and image data for a speech environment and a privacy request from a participant in the speech environment where the privacy request indicates a privacy condition of the participant. The method further includes segmenting the audio data into a plurality of segments. For each segment, the method includes determining an identity of a speaker of a corresponding segment of the audio data based on the image data and determining whether the identity of the speaker of the corresponding segment includes the participant associated with the privacy condition. When the identity of the speaker of the corresponding segment includes the participant, the method includes applying the privacy condition to the corresponding segment. The method also includes processing the plurality of segments of the audio data to determine a transcript for the audio data.

Type: Grant

Filed: December 11, 2023

Date of Patent: March 4, 2025

Assignee: Google LLC

Inventors: Oliver Siohan, Takaki Makino, Richard Rose, Otavio Braga, Hank Liao, Basilio Garcia Castillo
Privacy-aware meeting room transcription from audio-visual stream

Patent number: 12118123

Abstract: A method for a privacy-aware transcription includes receiving audio-visual signal including audio data and image data for a speech environment and a privacy request from a participant in the speech environment where the privacy request indicates a privacy condition of the participant. The method further includes segmenting the audio data into a plurality of segments. For each segment, the method includes determining an identity of a speaker of a corresponding segment of the audio data based on the image data and determining whether the identity of the speaker of the corresponding segment includes the participant associated with the privacy condition. When the identity of the speaker of the corresponding segment includes the participant, the method includes applying the privacy condition to the corresponding segment. The method also includes processing the plurality of segments of the audio data to determine a transcript for the audio data.

Type: Grant

Filed: November 18, 2019

Date of Patent: October 15, 2024

Assignee: Google LLC

Inventors: Oliver Siohan, Takaki Makino, Richard Rose, Otavio Braga, Hank Liao, Basilio Garcia Castillo
PRIVACY-AWARE MEETING ROOM TRANSCRIPTION FROM AUDIO-VISUAL STREAM

Publication number: 20240104247

Abstract: A method for a privacy-aware transcription includes receiving audio-visual signal including audio data and image data for a speech environment and a privacy request from a participant in the speech environment where the privacy request indicates a privacy condition of the participant. The method further includes segmenting the audio data into a plurality of segments. For each segment, the method includes determining an identity of a speaker of a corresponding segment of the audio data based on the image data and determining whether the identity of the speaker of the corresponding segment includes the participant associated with the privacy condition. When the identity of the speaker of the corresponding segment includes the participant, the method includes applying the privacy condition to the corresponding segment. The method also includes processing the plurality of segments of the audio data to determine a transcript for the audio data.

Type: Application

Filed: December 11, 2023

Publication date: March 28, 2024

Applicant: Google LLC

Inventors: Oliver Siohan, Takaki Makino, Richard Rose, Otavio Braga, Hank Liao, Basilio Garcia Castillo
Rescoring Automatic Speech Recognition Hypotheses Using Audio-Visual Matching

Publication number: 20220392439

Abstract: A method (400) includes receiving audio data (112) corresponding to an utterance (101) spoken by a user (10), receiving video data (114) representing motion of lips of the user while the user was speaking the utterance, and obtaining multiple candidate transcriptions (135) for the utterance based on the audio data. For each candidate transcription of the multiple candidate transcriptions, the method also includes generating a synthesized speech representation (145) of the corresponding candidate transcription and determining an agreement score (155) indicating a likelihood that the synthesized speech representation matches the motion of the lips of the user while the user speaks the utterance. The method also includes selecting one of the multiple candidate transcriptions for the utterance as a speech recognition output (175) based on the agreement scores determined for the multiple candidate transcriptions for the utterance.

Type: Application

Filed: November 18, 2019

Publication date: December 8, 2022

Applicant: Google LLC

Inventors: Olivier Siohan, Takaki Makino, Richard Rose, Otavio Braga, Hank Liao, Basillo Garcia Castillo
Privacy-Aware Meeting Room Transcription from Audio-Visual Stream

Publication number: 20220382907

Abstract: A method for a privacy-aware transcription includes receiving audio-visual signal including audio data and image data for a speech environment and a privacy request from a participant in the speech environment where the privacy request indicates a privacy condition of the participant. The method further includes segmenting the audio data into a plurality of segments. For each segment, the method includes determining an identity of a speaker of a corresponding segment of the audio data based on the image data and determining whether the identity of the speaker of the corresponding segment includes the participant associated with the privacy condition. When the identity of the speaker of the corresponding segment includes the participant, the method includes applying the privacy condition to the corresponding segment. The method also includes processing the plurality of segments of the audio data to determine a transcript for the audio data.

Type: Application

Filed: November 18, 2019

Publication date: December 1, 2022

Applicant: Google LLC

Inventors: Oliver Siohan, Takaki Makino, Richard Rose, Otavio Braga, Hank Liao, Basilio Castillo
ACOUSTIC-TO-WORD NEURAL NETWORK SPEECH RECOGNIZER

Publication number: 20180174576

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media for large vocabulary continuous speech recognition. One method includes receiving audio data representing an utterance of a speaker. Acoustic features of the audio data are provided to a recurrent neural network trained using connectionist temporal classification to estimate likelihoods of occurrence of whole words based on acoustic feature input. Output of the recurrent neural network generated in response to the acoustic features is received. The output indicates a likelihood of occurrence for each of multiple different words in a vocabulary. A transcription for the utterance is generated based on the output of the recurrent neural network. The transcription is provided as output of the automated speech recognition system.

Type: Application

Filed: December 7, 2017

Publication date: June 21, 2018

Inventors: Hagen Soltau, Hasim Sak, Hank Liao