Patents by Inventor Vincent Wan

Vincent Wan has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Two-level speech prosody transfer

Patent number: 12327544

Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation for the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

Type: Grant

Filed: November 11, 2022

Date of Patent: June 10, 2025

Assignee: Google LLC

Inventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Ye Jia, Yu Zhang, Robert Andrew James Clark, Vincent Wan
Attention-based clockwork hierarchical variational encoder

Patent number: 12272349

Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by decoding a prosodic syllable embedding for the syllable based on attention by an attention mechanism to linguistic features of each phoneme of the syllable and generating a plurality of fixed-length predicted frames based on the predicted duration for the syllable.

Type: Grant

Filed: October 16, 2023

Date of Patent: April 8, 2025

Assignee: Google LLC

Inventors: Robert Clark, Chun-An Chan, Vincent Wan
Two-level text-to-speech systems using synthetic training data

Patent number: 12260851

Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect.

Type: Grant

Filed: July 14, 2021

Date of Patent: March 25, 2025

Assignee: Google LLC

Inventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Norman Casagrande, Yu Zhang, Robert Andrew James Clark, Vincent Wan
Two-Level Text-To-Speech Systems Using Synthetic Training Data

Publication number: 20250078808

Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect.

Type: Application

Filed: November 15, 2024

Publication date: March 6, 2025

Applicant: Google LLC

Inventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Norman Casagrande, Yu Zhang, Robert Andrew James Clark, Vincent Wan
Predicting parametric vocoder parameters from prosodic features

Patent number: 12125469

Abstract: A method for predicting parametric vocoder parameter includes receiving a text utterance having one or more words, each word having one or more syllables, and each syllable having one or more phonemes. The method also includes receiving, as input to a vocoder model, prosodic features that represent an intended prosody for the text utterance and a linguistic specification. The prosodic features include a duration, pitch contour, and energy contour for the text utterance, while the linguistic specification includes sentence-level linguistic features, word-level linguistic features for each word, syllable-level linguistic features for each syllable, and phoneme-level linguistic features for each phoneme. The method also includes predicting vocoder parameters based on the prosodic features and the linguistic specification.

Type: Grant

Filed: October 17, 2023

Date of Patent: October 22, 2024

Assignee: Google LLC

Inventors: Rakesh Iyer, Vincent Wan
Attention-based clockwork hierarchical variational encoder

Patent number: 12080272

Abstract: A method (400) for representing an intended prosody in synthesized speech includes receiving a text utterance (310) having at least one word (240), and selecting an utterance embedding (204) for the text utterance. Each word in the text utterance has at least one syllable (230) and each syllable has at least one phoneme (220). The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration (238) of the syllable by decoding a prosodic syllable embedding (232, 234) for the syllable based on attention by an attention mechanism (340) to linguistic features (222) of each phoneme of the syllable and generating a plurality of fixed-length predicted frames (260) based on the predicted duration for the syllable.

Type: Grant

Filed: December 10, 2019

Date of Patent: September 3, 2024

Assignee: Google LLC

Inventors: Robert Clark, Chun-an Chan, Vincent Wan
Predicting Parametric Vocoder Parameters From Prosodic Features

Publication number: 20240046915

Abstract: A method for predicting parametric vocoder parameter includes receiving a text utterance having one or more words, each word having one or more syllables, and each syllable having one or more phonemes. The method also includes receiving, as input to a vocoder model, prosodic features that represent an intended prosody for the text utterance and a linguistic specification. The prosodic features include a duration, pitch contour, and energy contour for the text utterance, while the linguistic specification includes sentence-level linguistic features, word-level linguistic features for each word, syllable-level linguistic features for each syllable, and phoneme-level linguistic features for each phoneme. The method also includes predicting vocoder parameters based on the prosodic features and the linguistic specification.

Type: Application

Filed: October 17, 2023

Publication date: February 8, 2024

Applicant: Google LLC

Inventors: Rakesh Iyer, Vincent Wan
Attention-Based Clockwork Hierarchical Variational Encoder

Publication number: 20240038214

Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by decoding a prosodic syllable embedding for the syllable based on attention by an attention mechanism to linguistic features of each phoneme of the syllable and generating a plurality of fixed-length predicted frames based on the predicted duration for the syllable.

Type: Application

Filed: October 16, 2023

Publication date: February 1, 2024

Applicant: Google LLC

Inventors: Robert Clark, Chun-an Chan, Vincent Wan
Predicting parametric vocoder parameters from prosodic features

Patent number: 11830474

Abstract: A method for predicting parametric vocoder parameter includes receiving a text utterance having one or more words, each word having one or more syllables, and each syllable having one or more phonemes. The method also includes receiving, as input to a vocoder model, prosodic features that represent an intended prosody for the text utterance and a linguistic specification. The prosodic features include a duration, pitch contour, and energy contour for the text utterance, while the linguistic specification includes sentence-level linguistic features, word-level linguistic features for each word, syllable-level linguistic features for each syllable, and phoneme-level linguistic features for each phoneme. The method also includes predicting vocoder parameters based on the prosodic features and the linguistic specification.

Type: Grant

Filed: January 6, 2022

Date of Patent: November 28, 2023

Assignee: Google LLC

Inventors: Rakesh Iyer, Vincent Wan
Two-Level Speech Prosody Transfer

Publication number: 20230064749

Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation for the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

Type: Application

Filed: November 11, 2022

Publication date: March 2, 2023

Applicant: Google LLC

Inventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Ye Jia, Yu Zhang, Robert Andrew James Clark, Vincent Wan
Two-Level Text-To-Speech Systems Using Synthetic Training Data

Publication number: 20230018384

Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect.

Type: Application

Filed: July 14, 2021

Publication date: January 19, 2023

Applicant: Google LLC

Inventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Norman Casagrande, Yu Zhang, Robert Andrew James Clark, Vincent Wan
Attention-Based Clockwork Hierarchical Variational Encoder

Publication number: 20220415306

Abstract: A method (400) for representing an intended prosody in synthesized speech includes receiving a text utterance (310) having at least one word (240), and selecting an utterance embedding (204) for the text utterance. Each word in the text utterance has at least one syllable (230) and each syllable has at least one phoneme (220). The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration (238) of the syllable by decoding a prosodic syllable embedding (232, 234) for the syllable based on attention by an attention mechanism (340) to linguistic features (222) of each phoneme of the syllable and generating a plurality of fixed-length predicted frames (260) based on the predicted duration for the syllable.

Type: Application

Filed: December 10, 2019

Publication date: December 29, 2022

Applicant: Google LLC

Inventors: Robert Clark, Chun-an Chan, Vincent Wan
Two-level speech prosody transfer

Patent number: 11514888

Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation tor the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

Type: Grant

Filed: August 13, 2020

Date of Patent: November 29, 2022

Assignee: Google LLC

Inventors: Lev Finkelstein, Chun-An Chan, Byungha Chun, Ye Jia, Yu Zhang, Robert Andrew James Clark, Vincent Wan
Clockwork hierarchical variational encoder

Patent number: 11393453

Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.

Type: Grant

Filed: January 13, 2021

Date of Patent: July 19, 2022

Assignee: Google LLC

Inventors: Robert Clark, Chun-an Chan, Vincent Wan
Predicting Parametric Vocoder Parameters From Prosodic Features

Publication number: 20220130371

Abstract: A method for predicting parametric vocoder parameter includes receiving a text utterance having one or more words, each word having one or more syllables, and each syllable having one or more phonemes. The method also includes receiving, as input to a vocoder model, prosodic features that represent an intended prosody for the text utterance and a linguistic specification. The prosodic features include a duration, pitch contour, and energy contour for the text utterance, while the linguistic specification includes sentence-level linguistic features, word-level linguistic features for each word, syllable-level linguistic features for each syllable, and phoneme-level linguistic features for each phoneme. The method also includes predicting vocoder parameters based on the prosodic features and the linguistic specification.

Type: Application

Filed: January 6, 2022

Publication date: April 28, 2022

Applicant: Google LLC

Inventors: Rakesh Iyer, Vincent Wan
Two-Level Speech Prosody Transfer

Publication number: 20220051654

Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation tor the input text utterance The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

Type: Application

Filed: August 13, 2020

Publication date: February 17, 2022

Applicant: Google LLC

Inventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Ye Jia, Yu Zhang, Robert Andrew James Clark, Vincent Wan
Predicting parametric vocoder parameters from prosodic features

Patent number: 11232780

Abstract: A method for predicting parametric vocoder parameter includes receiving a text utterance having one or more words, each word having one or more syllables, and each syllable having one or more phonemes. The method also includes receiving, as input to a vocoder model, prosodic features that represent an intended prosody for the text utterance and a linguistic specification. The prosodic features include a duration, pitch contour, and energy contour for the text utterance, while the linguistic specification includes sentence-level linguistic features, word-level linguistic features for each word, syllable-level linguistic features for each syllable, and phoneme-level linguistic features for each phoneme. The method also includes predicting vocoder parameters based on the prosodic features and the linguistic specification.

Type: Grant

Filed: September 26, 2020

Date of Patent: January 25, 2022

Assignee: Google LLC

Inventors: Rakesh Iyer, Vincent Wan
Clockwork Hierarchical Variational Encoder

Publication number: 20210134266

Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.

Type: Application

Filed: January 13, 2021

Publication date: May 6, 2021

Applicant: Google LLC

Inventors: Robert Clark, Chun-an Chan, Vincent Wan
Clockwork hierarchical variational encoder

Patent number: 10923107

Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.

Type: Grant

Filed: April 12, 2019

Date of Patent: February 16, 2021

Assignee: Google LLC

Inventors: Robert Clark, Chun-an Chan, Vincent Wan
Clockwork Hierarchical Variational Encoder

Publication number: 20190348020

Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.

Type: Application

Filed: April 12, 2019

Publication date: November 14, 2019

Applicant: Google LLC

Inventors: Robert Clark, Chun-an Chan, Vincent Wan

1 2 next