Patents by Inventor Vincent Wan

Vincent Wan has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

  • Patent number: 12327544
    Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation for the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.
    Type: Grant
    Filed: November 11, 2022
    Date of Patent: June 10, 2025
    Assignee: Google LLC
    Inventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Ye Jia, Yu Zhang, Robert Andrew James Clark, Vincent Wan
  • Patent number: 12272349
    Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by decoding a prosodic syllable embedding for the syllable based on attention by an attention mechanism to linguistic features of each phoneme of the syllable and generating a plurality of fixed-length predicted frames based on the predicted duration for the syllable.
    Type: Grant
    Filed: October 16, 2023
    Date of Patent: April 8, 2025
    Assignee: Google LLC
    Inventors: Robert Clark, Chun-An Chan, Vincent Wan
  • Patent number: 12260851
    Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect.
    Type: Grant
    Filed: July 14, 2021
    Date of Patent: March 25, 2025
    Assignee: Google LLC
    Inventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Norman Casagrande, Yu Zhang, Robert Andrew James Clark, Vincent Wan
  • Publication number: 20250078808
    Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect.
    Type: Application
    Filed: November 15, 2024
    Publication date: March 6, 2025
    Applicant: Google LLC
    Inventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Norman Casagrande, Yu Zhang, Robert Andrew James Clark, Vincent Wan
  • Patent number: 12125469
    Abstract: A method for predicting parametric vocoder parameter includes receiving a text utterance having one or more words, each word having one or more syllables, and each syllable having one or more phonemes. The method also includes receiving, as input to a vocoder model, prosodic features that represent an intended prosody for the text utterance and a linguistic specification. The prosodic features include a duration, pitch contour, and energy contour for the text utterance, while the linguistic specification includes sentence-level linguistic features, word-level linguistic features for each word, syllable-level linguistic features for each syllable, and phoneme-level linguistic features for each phoneme. The method also includes predicting vocoder parameters based on the prosodic features and the linguistic specification.
    Type: Grant
    Filed: October 17, 2023
    Date of Patent: October 22, 2024
    Assignee: Google LLC
    Inventors: Rakesh Iyer, Vincent Wan
  • Patent number: 12080272
    Abstract: A method (400) for representing an intended prosody in synthesized speech includes receiving a text utterance (310) having at least one word (240), and selecting an utterance embedding (204) for the text utterance. Each word in the text utterance has at least one syllable (230) and each syllable has at least one phoneme (220). The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration (238) of the syllable by decoding a prosodic syllable embedding (232, 234) for the syllable based on attention by an attention mechanism (340) to linguistic features (222) of each phoneme of the syllable and generating a plurality of fixed-length predicted frames (260) based on the predicted duration for the syllable.
    Type: Grant
    Filed: December 10, 2019
    Date of Patent: September 3, 2024
    Assignee: Google LLC
    Inventors: Robert Clark, Chun-an Chan, Vincent Wan
  • Publication number: 20240046915
    Abstract: A method for predicting parametric vocoder parameter includes receiving a text utterance having one or more words, each word having one or more syllables, and each syllable having one or more phonemes. The method also includes receiving, as input to a vocoder model, prosodic features that represent an intended prosody for the text utterance and a linguistic specification. The prosodic features include a duration, pitch contour, and energy contour for the text utterance, while the linguistic specification includes sentence-level linguistic features, word-level linguistic features for each word, syllable-level linguistic features for each syllable, and phoneme-level linguistic features for each phoneme. The method also includes predicting vocoder parameters based on the prosodic features and the linguistic specification.
    Type: Application
    Filed: October 17, 2023
    Publication date: February 8, 2024
    Applicant: Google LLC
    Inventors: Rakesh Iyer, Vincent Wan
  • Publication number: 20240038214
    Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by decoding a prosodic syllable embedding for the syllable based on attention by an attention mechanism to linguistic features of each phoneme of the syllable and generating a plurality of fixed-length predicted frames based on the predicted duration for the syllable.
    Type: Application
    Filed: October 16, 2023
    Publication date: February 1, 2024
    Applicant: Google LLC
    Inventors: Robert Clark, Chun-an Chan, Vincent Wan
  • Patent number: 11830474
    Abstract: A method for predicting parametric vocoder parameter includes receiving a text utterance having one or more words, each word having one or more syllables, and each syllable having one or more phonemes. The method also includes receiving, as input to a vocoder model, prosodic features that represent an intended prosody for the text utterance and a linguistic specification. The prosodic features include a duration, pitch contour, and energy contour for the text utterance, while the linguistic specification includes sentence-level linguistic features, word-level linguistic features for each word, syllable-level linguistic features for each syllable, and phoneme-level linguistic features for each phoneme. The method also includes predicting vocoder parameters based on the prosodic features and the linguistic specification.
    Type: Grant
    Filed: January 6, 2022
    Date of Patent: November 28, 2023
    Assignee: Google LLC
    Inventors: Rakesh Iyer, Vincent Wan
  • Publication number: 20230064749
    Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation for the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.
    Type: Application
    Filed: November 11, 2022
    Publication date: March 2, 2023
    Applicant: Google LLC
    Inventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Ye Jia, Yu Zhang, Robert Andrew James Clark, Vincent Wan
  • Publication number: 20230018384
    Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect.
    Type: Application
    Filed: July 14, 2021
    Publication date: January 19, 2023
    Applicant: Google LLC
    Inventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Norman Casagrande, Yu Zhang, Robert Andrew James Clark, Vincent Wan
  • Publication number: 20220415306
    Abstract: A method (400) for representing an intended prosody in synthesized speech includes receiving a text utterance (310) having at least one word (240), and selecting an utterance embedding (204) for the text utterance. Each word in the text utterance has at least one syllable (230) and each syllable has at least one phoneme (220). The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration (238) of the syllable by decoding a prosodic syllable embedding (232, 234) for the syllable based on attention by an attention mechanism (340) to linguistic features (222) of each phoneme of the syllable and generating a plurality of fixed-length predicted frames (260) based on the predicted duration for the syllable.
    Type: Application
    Filed: December 10, 2019
    Publication date: December 29, 2022
    Applicant: Google LLC
    Inventors: Robert Clark, Chun-an Chan, Vincent Wan
  • Patent number: 11514888
    Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation tor the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.
    Type: Grant
    Filed: August 13, 2020
    Date of Patent: November 29, 2022
    Assignee: Google LLC
    Inventors: Lev Finkelstein, Chun-An Chan, Byungha Chun, Ye Jia, Yu Zhang, Robert Andrew James Clark, Vincent Wan
  • Patent number: 11393453
    Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.
    Type: Grant
    Filed: January 13, 2021
    Date of Patent: July 19, 2022
    Assignee: Google LLC
    Inventors: Robert Clark, Chun-an Chan, Vincent Wan
  • Publication number: 20220130371
    Abstract: A method for predicting parametric vocoder parameter includes receiving a text utterance having one or more words, each word having one or more syllables, and each syllable having one or more phonemes. The method also includes receiving, as input to a vocoder model, prosodic features that represent an intended prosody for the text utterance and a linguistic specification. The prosodic features include a duration, pitch contour, and energy contour for the text utterance, while the linguistic specification includes sentence-level linguistic features, word-level linguistic features for each word, syllable-level linguistic features for each syllable, and phoneme-level linguistic features for each phoneme. The method also includes predicting vocoder parameters based on the prosodic features and the linguistic specification.
    Type: Application
    Filed: January 6, 2022
    Publication date: April 28, 2022
    Applicant: Google LLC
    Inventors: Rakesh Iyer, Vincent Wan
  • Publication number: 20220051654
    Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation tor the input text utterance The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.
    Type: Application
    Filed: August 13, 2020
    Publication date: February 17, 2022
    Applicant: Google LLC
    Inventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Ye Jia, Yu Zhang, Robert Andrew James Clark, Vincent Wan
  • Patent number: 11232780
    Abstract: A method for predicting parametric vocoder parameter includes receiving a text utterance having one or more words, each word having one or more syllables, and each syllable having one or more phonemes. The method also includes receiving, as input to a vocoder model, prosodic features that represent an intended prosody for the text utterance and a linguistic specification. The prosodic features include a duration, pitch contour, and energy contour for the text utterance, while the linguistic specification includes sentence-level linguistic features, word-level linguistic features for each word, syllable-level linguistic features for each syllable, and phoneme-level linguistic features for each phoneme. The method also includes predicting vocoder parameters based on the prosodic features and the linguistic specification.
    Type: Grant
    Filed: September 26, 2020
    Date of Patent: January 25, 2022
    Assignee: Google LLC
    Inventors: Rakesh Iyer, Vincent Wan
  • Publication number: 20210134266
    Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.
    Type: Application
    Filed: January 13, 2021
    Publication date: May 6, 2021
    Applicant: Google LLC
    Inventors: Robert Clark, Chun-an Chan, Vincent Wan
  • Patent number: 10923107
    Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.
    Type: Grant
    Filed: April 12, 2019
    Date of Patent: February 16, 2021
    Assignee: Google LLC
    Inventors: Robert Clark, Chun-an Chan, Vincent Wan
  • Publication number: 20190348020
    Abstract: A method for representing an intended prosody in synthesized speech includes receiving a text utterance having at least one word, and selecting an utterance embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with a corresponding prosodic syllable embedding for the syllable; predicting a pitch contour of the syllable based on the predicted duration for the syllable; and generating a plurality of fixed-length predicted pitch frames based on the predicted duration for the syllable. Each fixed-length predicted pitch frame represents part of the predicted pitch contour of the syllable.
    Type: Application
    Filed: April 12, 2019
    Publication date: November 14, 2019
    Applicant: Google LLC
    Inventors: Robert Clark, Chun-an Chan, Vincent Wan