Patents by Inventor Ye Jia

Ye Jia has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Robust direct speech-to-speech translation

Patent number: 11960852

Abstract: A direct speech-to-speech translation (S2ST) model includes an encoder configured to receive an input speech representation that to an utterance spoken by a source speaker in a first language and encode the input speech representation into a hidden feature representation. The S2ST model also includes an attention module configured to generate a context vector that attends to the hidden representation encoded by the encoder. The S2ST model also includes a decoder configured to receive the context vector generated by the attention module and predict a phoneme representation that corresponds to a translation of the utterance in a second different language. The S2ST model also includes a synthesizer configured to receive the context vector and the phoneme representation and generate a translated synthesized speech representation that corresponds to a translation of the utterance spoken in the different second language.

Type: Grant

Filed: December 15, 2021

Date of Patent: April 16, 2024

Assignee: Google LLC

Inventors: Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz
SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS

Publication number: 20240112667

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

Type: Application

Filed: November 30, 2023

Publication date: April 4, 2024

Applicant: Google LLC

Inventors: Ye Jia, Zhifeng Chen, Yonghui Wu, Jonathan Shen, Ruoming Pang, Ron J. Weiss, Ignacio Lopez Moreno, Fei Ren, Yu Zhang, Quan Wang, Patrick An Phu Nguyen
COMMUNICATION SYSTEM FOR IMPROVING SPEECH

Publication number: 20240079024

Abstract: The invention provides a communication system and method including a processor; a computer-readable medium connected to the processor, and a set of instructions on the computer-readable medium. A speech reception unit is executable by the processor to receive input speech in the form of an input signal derived from a sound wave generated by a microphone that includes input language. A speech processing system connected to the speech reception unit is executable by the processor to modify the input signal to an output signal wherein the input speech in the input signal is modified to output speech in the output signal. A speech output unit connected to the speech processing system is executable by the processor to provide an output of the output signal.

Type: Application

Filed: September 4, 2023

Publication date: March 7, 2024

Applicant: Tomato AI, Inc.

Inventors: Ye JIA, James J. FAN, Ofer RONEN
ACOUSTIC FENCE

Publication number: 20240071356

Abstract: For online audio/video conferencing applications deployed in an open office environment, using shared conference devices, it can be advantageous to define an acoustic fence. A non-participant audio received from outside the acoustic fence can be considered noise and filtered out before transmission of an audio signal to a far end recipient. Three suppression stages are used to filter the non-participant audio. The first suppression stage uses beamformers for suppression. The second suppression stage is mask-based, and the third suppression stage is reference-based. The three suppression stages filter out non-participant audio signals, having a wide range of frequencies.

Type: Application

Filed: August 29, 2022

Publication date: February 29, 2024

Inventors: Zhenghang Gu, Zhaofeng Jia, Qiyong Liu, Ye Wang, Zexian Wu, Chunyu Zhang
Unsupervised Parallel Tacotron Non-Autoregressive and Controllable Text-To-Speech

Publication number: 20240062743

Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence.

Type: Application

Filed: October 31, 2023

Publication date: February 22, 2024

Applicant: Google LLC

Inventors: Isaac Elias, Byungha Chun, Jonathan Shen, Ye Jia, Yu Zhang, Yonghui Wu
Parallel tacotron non-autoregressive and controllable TTS

Patent number: 11908448

Abstract: A method for training a non-autoregressive TTS model includes receiving training data that includes a reference audio signal and a corresponding input text sequence. The method also includes encoding the reference audio signal into a variational embedding that disentangles the style/prosody information from the reference audio signal and encoding the input text sequence into an encoded text sequence. The method also includes predicting a phoneme duration for each phoneme in the input text sequence and determining a phoneme duration loss based on the predicted phoneme durations and a reference phoneme duration. The method also includes generating one or more predicted mel-frequency spectrogram sequences for the input text sequence and determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence. The method also includes training the TTS model based on the final spectrogram loss and the corresponding phoneme duration loss.

Type: Grant

Filed: May 21, 2021

Date of Patent: February 20, 2024

Assignee: Google LLC

Inventors: Isaac Elias, Jonathan Shen, Yu Zhang, Ye Jia, Ron J. Weiss, Yonghui Wu, Byungha Chun
Synthesis of speech from text in a voice of a target speaker using neural networks

Patent number: 11848002

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

Type: Grant

Filed: July 19, 2022

Date of Patent: December 19, 2023

Assignee: Google LLC

Inventors: Ye Jia, Zhifeng Chen, Yonghui Wu, Jonathan Shen, Ruoming Pang, Ron J. Weiss, Ignacio Lopez Moreno, Fei Ren, Yu Zhang, Quan Wang, Patrick An Phu Nguyen
Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech

Patent number: 11823656

Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence.

Type: Grant

Filed: May 21, 2021

Date of Patent: November 21, 2023

Assignee: Google LLC

Inventors: Isaac Elias, Byungha Chun, Jonathan Shen, Ye Jia, Yu Zhang, Yonghui Wu
MULTILINGUAL SPEECH SYNTHESIS AND CROSS-LANGUAGE VOICE CLONING

Publication number: 20230178068

Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.

Type: Application

Filed: January 30, 2023

Publication date: June 8, 2023

Applicant: Google LLC

Inventors: Yu Zhang, Ron J. Weiss, Byungha Chun, Yonghui Wu, Zhifeng Chen, Russell John Wyatt Skerry-Ryan, Ye Jia, Andrew M. Rosenberg, Bhuvana Ramabhadran
Two-Level Speech Prosody Transfer

Publication number: 20230064749

Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation for the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

Type: Application

Filed: November 11, 2022

Publication date: March 2, 2023

Applicant: Google LLC

Inventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Ye Jia, Yu Zhang, Robert Andrew James Clark, Vincent Wan
Multilingual speech synthesis and cross-language voice cloning

Patent number: 11580952

Abstract: A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.

Type: Grant

Filed: April 22, 2020

Date of Patent: February 14, 2023

Assignee: Google LLC

Inventors: Yu Zhang, Ron J. Weiss, Byungha Chun, Yonghui Wu, Zhifeng Chen, Russell John Wyatt Skerry-Ryan, Ye Jia, Andrew M. Rosenberg, Bhuvana Ramabhadran
Robust Direct Speech-to-Speech Translation

Publication number: 20230013777

Abstract: A direct speech-to-speech translation (S2ST) model includes an encoder configured to receive an input speech representation that to an utterance spoken by a source speaker in a first language and encode the input speech representation into a hidden feature representation. The S2ST model also includes an attention module configured to generate a context vector that attends to the hidden representation encoded by the encoder. The S2ST model also includes a decoder configured to receive the context vector generated by the attention module and predict a phoneme representation that corresponds to a translation of the utterance in a second different language. The S2ST model also includes a synthesizer configured to receive the context vector and the phoneme representation and generate a translated synthesized speech representation that corresponds to a translation of the utterance spoken in the different second language.

Type: Application

Filed: December 15, 2021

Publication date: January 19, 2023

Applicant: Google LLC

Inventors: Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz
Two-level speech prosody transfer

Patent number: 11514888

Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation tor the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

Type: Grant

Filed: August 13, 2020

Date of Patent: November 29, 2022

Assignee: Google LLC

Inventors: Lev Finkelstein, Chun-An Chan, Byungha Chun, Ye Jia, Yu Zhang, Robert Andrew James Clark, Vincent Wan
Synthesis of Speech from Text in a Voice of a Target Speaker Using Neural Networks

Publication number: 20220351713

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

Type: Application

Filed: July 19, 2022

Publication date: November 3, 2022

Applicant: Google LLC

Inventors: Ye Jia, Zhifeng Chen, Yonghui Wu, Jonathan Shen, Ruoming Pang, Ron J. Weiss, Ignacio Lopez Moreno, Fei Ren, Yu Zhang, Quan Wang, Patrick An Phu Nguyen
Synthesis of speech from text in a voice of a target speaker using neural networks

Patent number: 11488575

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

Type: Grant

Filed: May 17, 2019

Date of Patent: November 1, 2022

Assignee: Google LLC

Inventors: Ye Jia, Zhifeng Chen, Yonghui Wu, Jonathan Shen, Ruoming Pang, Ron J. Weiss, Ignacio Lopez Moreno, Fei Ren, Yu Zhang, Quan Wang, Patrick Nguyen
Phonemes And Graphemes for Neural Text-to-Speech

Publication number: 20220310059

Abstract: A method includes receiving a text input including a sequence of words represented as an input encoder embedding. The input encoder embedding includes a plurality of tokens, with the plurality of tokens including a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes. The method also includes, for each respective phoneme token of the second set of phoneme tokens: identifying a respective word of the sequence of words corresponding to the respective phoneme token and determining a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token. The method also includes generating an output encoder embedding based on a relationship between each respective phoneme token and the corresponding grapheme token determined to represent a same respective word as the respective phoneme token.

Type: Application

Filed: December 10, 2021

Publication date: September 29, 2022

Applicant: Google LLC

Inventors: Ye Jia, Byungha Chun, Yu Zhang, Jonathan Shen, Yonghui Wu
Unsupervised Parallel Tacotron Non-Autoregressive and Controllable Text-To-Speech

Publication number: 20220301543

Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence.

Type: Application

Filed: May 21, 2021

Publication date: September 22, 2022

Applicant: Google LLC

Inventors: Isaac Elias, Byungha Chun, Jonathan Shen, Ye Jia, Yu Zhang, Yonghui Wu
Building a text-to-speech system from a small amount of speech data

Patent number: 11335321

Abstract: A method of building a text-to-speech (TTS) system from a small amount of speech data includes receiving a first plurality of recorded speech samples from an assortment of speakers and a second plurality of recorded speech samples from a target speaker where the assortment of speakers does not include the target speaker. The method further includes training a TTS model using the first plurality of recorded speech samples from the assortment of speakers. Here, the trained TTS model is configured to output synthetic speech as an audible representation of a text input. The method also includes re-training the trained TTS model using the second plurality of recorded speech samples from the target speaker combined with the first plurality of recorded speech samples from the assortment of speakers. Here, the re-trained TTS model is configured to output synthetic speech resembling speaking characteristics of the target speaker.

Type: Grant

Filed: August 28, 2020

Date of Patent: May 17, 2022

Assignee: Google LLC

Inventors: Ye Jia, Byungha Chun, Yusuke Oda, Norman Casagrande, Tejas Iyer, Fan Luo, Russell John Wyatt Skerry-Ryan, Jonathan Shen, Yonghui Wu, Yu Zhang
Parallel Tacotron Non-Autoregressive and Controllable TTS

Publication number: 20220122582

Abstract: A method for training a non-autoregressive TTS model includes receiving training data that includes a reference audio signal and a corresponding input text sequence. The method also includes encoding the reference audio signal into a variational embedding that disentangles the style/prosody information from the reference audio signal and encoding the input text sequence into an encoded text sequence. The method also includes predicting a phoneme duration for each phoneme in the input text sequence and determining a phoneme duration loss based on the predicted phoneme durations and a reference phoneme duration. The method also includes generating one or more predicted mel-frequency spectrogram sequences for the input text sequence and determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence. The method also includes training the TTS model based on the final spectrogram loss and the corresponding phoneme duration loss.

Type: Application

Filed: May 21, 2021

Publication date: April 21, 2022

Applicant: Google LLC

Inventors: Isaac Elias, Jonathan Shen, Yu Zhang, Ye Jia, Ron J. Weiss, Yonghui Wu, Byungha Chun
TEXT-TO-SPEECH USING DURATION PREDICTION

Publication number: 20220108680

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, synthesizing audio data from text data using duration prediction. One of the methods includes processing an input text sequence that includes a respective text element at each of multiple input time steps using a first neural network to generate a modified input sequence comprising, for each input time step, a representation of the corresponding text element in the input text sequence; processing the modified input sequence using a second neural network to generate, for each input time step, a predicted duration of the corresponding text element in the output audio sequence; upsampling the modified input sequence according to the predicted durations to generate an intermediate sequence comprising a respective intermediate element at each of a plurality of intermediate time steps; and generating an output audio sequence using the intermediate sequence.

Type: Application

Filed: October 1, 2021

Publication date: April 7, 2022

Inventors: Yu Zhang, Isaac Elias, Byungha Chun, Ye Jia, Yonghui Wu, Mike Chrzanowski, Jonathan Shen

1 2 next