Patents by Inventor Robert Andrew James Clark
Robert Andrew James Clark has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20240127791Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating speech from text. One of the systems includes one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to implement: a sequence-to-sequence recurrent neural network configured to: receive a sequence of characters in a particular natural language, and process the sequence of characters to generate a spectrogram of a verbal utterance of the sequence of characters in the particular natural language; and a subsystem configured to: receive the sequence of characters in the particular natural language, and provide the sequence of characters as input to the sequence-to-sequence recurrent neural network to obtain as output the spectrogram of the verbal utterance of the sequence of characters in the particular natural language.Type: ApplicationFiled: November 21, 2023Publication date: April 18, 2024Inventors: Samuel Bengio, Yuxuan Wang, Zongheng Yang, Zhifeng Chen, Yonghui Wu, Ioannis Agiomyrgiannakis, Ron J. Weiss, Navdeep Jaitly, Ryan M. Rifkin, Robert Andrew James Clark, Quoc V. Le, Russell J. Ryan, Ying Xiao
-
Patent number: 11881210Abstract: A method for generating a prosodic representation includes receiving a text utterance having one or more words. Each word has at least one syllable having at least one phoneme. The method also includes generating, using a Bidirectional Encoder Representations from Transformers (BERT) model, a sequence of wordpiece embeddings and selecting an utterance embedding for the text utterance, the utterance embedding representing an intended prosody. Each wordpiece embedding is associated with one of the one or more words of the text utterance. For each syllable, using the selected utterance embedding and a prosody model that incorporates the BERT model, the method also includes generating a corresponding prosodic syllable embedding for the syllable based on the wordpiece embedding associated with the word that includes the syllable and predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with the corresponding prosodic syllable embedding for the syllable.Type: GrantFiled: May 5, 2020Date of Patent: January 23, 2024Assignee: Google LLCInventors: Tom Marius Kenter, Manish Kumar Sharma, Robert Andrew James Clark, Aliaksei Severyn
-
Patent number: 11862142Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating speech from text. One of the systems includes one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to implement: a sequence-to-sequence recurrent neural network configured to: receive a sequence of characters in a particular natural language, and process the sequence of characters to generate a spectrogram of a verbal utterance of the sequence of characters in the particular natural language; and a subsystem configured to: receive the sequence of characters in the particular natural language, and provide the sequence of characters as input to the sequence-to-sequence recurrent neural network to obtain as output the spectrogram of the verbal utterance of the sequence of characters in the particular natural language.Type: GrantFiled: August 2, 2021Date of Patent: January 2, 2024Assignee: Google LLCInventors: Samuel Bengio, Yuxuan Wang, Zongheng Yang, Zhifeng Chen, Yonghui Wu, Ioannis Agiomyrgiannakis, Ron J. Weiss, Navdeep Jaitly, Ryan M. Rifkin, Robert Andrew James Clark, Quoc V. Le, Russell J. Ryan, Ying Xiao
-
Patent number: 11790274Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a machine learning model to generate embeddings of inputs to the machine learning model, the machine learning model having an encoder that generates the embeddings from the inputs and a decoder that generates outputs from the generated embeddings, wherein the embedding is partitioned into a sequence of embedding partitions that each includes one or more dimensions of the embedding, the operations comprising: for a first embedding partition in the sequence of embedding partitions: performing initial training to train the encoder and a decoder replica corresponding to the first embedding partition; for each particular embedding partition that is after the first embedding partition in the sequence of embedding partitions: performing incremental training to train the encoder and a decoder replica corresponding to the particular partition.Type: GrantFiled: October 26, 2022Date of Patent: October 17, 2023Assignee: Google LLCInventors: Robert Andrew James Clark, Chun-an Chan, Vincent Ping Leung Wan
-
Patent number: 11664011Abstract: A method of providing a frame-based mel spectral representation of speech includes receiving a text utterance having at least one word and selecting a mel spectral embedding for the text utterance. Each word has at least one syllable and each syllable has at least one phoneme. For each phoneme, the method further includes using the selected mel spectral embedding to: (i) predict a duration of the corresponding phoneme based on corresponding linguistic features associated with the word that includes the corresponding phoneme and corresponding linguistic features associated with the syllable that includes the corresponding phoneme; and (ii) generate a plurality of fixed-length predicted mel-frequency spectrogram frames based on the predicted duration for the corresponding phoneme. Each fixed-length predicted mel-frequency spectrogram frame represents mel-spectral information of the corresponding phoneme.Type: GrantFiled: February 9, 2022Date of Patent: May 30, 2023Assignee: Google LLCInventors: Robert Andrew James Clark, Chun-an Chan, Vincent Ping Leung Wan
-
Publication number: 20230064749Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation for the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.Type: ApplicationFiled: November 11, 2022Publication date: March 2, 2023Applicant: Google LLCInventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Ye Jia, Yu Zhang, Robert Andrew James Clark, Vincent Wan
-
Publication number: 20230060886Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a machine learning model to generate embeddings of inputs to the machine learning model, the machine learning model having an encoder that generates the embeddings from the inputs and a decoder that generates outputs from the generated embeddings, wherein the embedding is partitioned into a sequence of embedding partitions that each includes one or more dimensions of the embedding, the operations comprising: for a first embedding partition in the sequence of embedding partitions: performing initial training to train the encoder and a decoder replica corresponding to the first embedding partition; for each particular embedding partition that is after the first embedding partition in the sequence of embedding partitions: performing incremental training to train the encoder and a decoder replica corresponding to the particular partition.Type: ApplicationFiled: October 26, 2022Publication date: March 2, 2023Applicant: Google LLCInventors: Robert Andrew James Clark, Chun-an Chan, Vincent Ping Leung Wan
-
Publication number: 20230018384Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect.Type: ApplicationFiled: July 14, 2021Publication date: January 19, 2023Applicant: Google LLCInventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Norman Casagrande, Yu Zhang, Robert Andrew James Clark, Vincent Wan
-
Patent number: 11514888Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation tor the input text utterance. The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody. The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.Type: GrantFiled: August 13, 2020Date of Patent: November 29, 2022Assignee: Google LLCInventors: Lev Finkelstein, Chun-An Chan, Byungha Chun, Ye Jia, Yu Zhang, Robert Andrew James Clark, Vincent Wan
-
Patent number: 11494695Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a machine learning model to generate embeddings of inputs to the machine learning model, the machine learning model having an encoder that generates the embeddings from the inputs and a decoder that generates outputs from the generated embeddings, wherein the embedding is partitioned into a sequence of embedding partitions that each includes one or more dimensions of the embedding, the operations comprising: for a first embedding partition in the sequence of embedding partitions: performing initial training to train the encoder and a decoder replica corresponding to the first embedding partition; for each particular embedding partition that is after the first embedding partition in the sequence of embedding partitions: performing incremental training to train the encoder and a decoder replica corresponding to the particular partition.Type: GrantFiled: September 27, 2019Date of Patent: November 8, 2022Assignee: Google LLCInventors: Robert Andrew James Clark, Chun-an Chan, Vincent Ping Leung Wan
-
Publication number: 20220172705Abstract: A method for providing a frame-based mel spectral representation of speech includes receiving a text utterance having at least one word, and selecting a mel spectral embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. For each phoneme, using the selected mel spectral embedding, the method also includes: predicting a duration of the corresponding phoneme by encoding linguistic features of the corresponding phoneme with a corresponding syllable embedding for the syllable that includes the corresponding phoneme; and generating a plurality of fixed-length predicted mel-frequency spectrogram frames based on the predicted duration for the corresponding phoneme. Each fixed-length predicted mel-frequency spectrogram frame representing mel-spectral information of the corresponding phoneme.Type: ApplicationFiled: February 9, 2022Publication date: June 2, 2022Applicant: Google LLCInventors: Robert Andrew James Clark, Chun-an Chan, Vincent Ping Leung Wan
-
Patent number: 11264010Abstract: A method for providing a frame-based mel spectral representation of speech includes receiving a text utterance having at least one word, and selecting a mel spectral embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. For each phoneme, using the selected mel spectral embedding, the method also includes: predicting a duration of the corresponding phoneme by encoding linguistic features of the corresponding phoneme with a corresponding syllable embedding for the syllable that includes the corresponding phoneme; and generating a plurality of fixed-length predicted mel-frequency spectrogram frames based on the predicted duration for the corresponding phoneme. Each fixed-length predicted mel-frequency spectrogram frame representing mel-spectral information of the corresponding phoneme.Type: GrantFiled: November 8, 2019Date of Patent: March 1, 2022Assignee: Google LLCInventors: Robert Andrew James Clark, Chun-an Chan, Vincent Ping Leung Wan
-
Publication number: 20220051654Abstract: A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation tor the input text utterance The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.Type: ApplicationFiled: August 13, 2020Publication date: February 17, 2022Applicant: Google LLCInventors: Lev Finkelstein, Chun-an Chan, Byungha Chun, Ye Jia, Yu Zhang, Robert Andrew James Clark, Vincent Wan
-
Publication number: 20210366463Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating speech from text. One of the systems includes one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to implement: a sequence-to-sequence recurrent neural network configured to: receive a sequence of characters in a particular natural language, and process the sequence of characters to generate a spectrogram of a verbal utterance of the sequence of characters in the particular natural language; and a subsystem configured to: receive the sequence of characters in the particular natural language, and provide the sequence of characters as input to the sequence-to-sequence recurrent neural network to obtain as output the spectrogram of the verbal utterance of the sequence of characters in the particular natural language.Type: ApplicationFiled: August 2, 2021Publication date: November 25, 2021Inventors: Samuel Bengio, Yuxuan Wang, Zongheng Yang, Zhifeng Chen, Yonghui Wu, Ioannis Agiomyrgiannakis, Ron J. Weiss, Navdeep Jaitly, Ryan M. Rifkin, Robert Andrew James Clark, Quoc V. Le, Russell J. Ryan, Ying Xiao
-
Publication number: 20210350795Abstract: A method for generating a prosodic representation includes receiving a text utterance having one or more words. Each word has at least one syllable having at least one phoneme. The method also includes generating, using a Bidirectional Encoder Representations from Transformers (BERT) model, a sequence of wordpiece embeddings and selecting an utterance embedding for the text utterance, the utterance embedding representing an intended prosody. Each wordpiece embedding is associated with one of the one or more words of the text utterance. For each syllable, using the selected utterance embedding and a prosody model that incorporates the BERT model, the method also includes generating a corresponding prosodic syllable embedding for the syllable based on the wordpiece embedding associated with the word that includes the syllable and predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with the corresponding prosodic syllable embedding for the syllable.Type: ApplicationFiled: May 5, 2020Publication date: November 11, 2021Applicant: Google LLCInventors: Tom Marius Kenter, Manish Kumar Sharma, Robert Andrew James Clark, Aliaksei Severyn
-
Patent number: 11107457Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating speech from text. One of the systems includes one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to implement: a sequence-to-sequence recurrent neural network configured to: receive a sequence of characters in a particular natural language, and process the sequence of characters to generate a spectrogram of a verbal utterance of the sequence of characters in the particular natural language; and a subsystem configured to: receive the sequence of characters in the particular natural language, and provide the sequence of characters as input to the sequence-to-sequence recurrent neural network to obtain as output the spectrogram of the verbal utterance of the sequence of characters in the particular natural language.Type: GrantFiled: November 26, 2019Date of Patent: August 31, 2021Assignee: Google LLCInventors: Samuel Bengio, Yuxuan Wang, Zongheng Yang, Zhifeng Chen, Yonghui Wu, Ioannis Agiomyrgiannakis, Ron J. Weiss, Navdeep Jaitly, Ryan M. Rifkin, Robert Andrew James Clark, Quoc V. Le, Russell J. Ryan, Ying Xiao
-
Publication number: 20210097427Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a machine learning model to generate embeddings of inputs to the machine learning model, the machine learning model having an encoder that generates the embeddings from the inputs and a decoder that generates outputs from the generated embeddings, wherein the embedding is partitioned into a sequence of embedding partitions that each includes one or more dimensions of the embedding, the operations comprising: for a first embedding partition in the sequence of embedding partitions: performing initial training to train the encoder and a decoder replica corresponding to the first embedding partition; for each particular embedding partition that is after the first embedding partition in the sequence of embedding partitions: performing incremental training to train the encoder and a decoder replica corresponding to the particular partition.Type: ApplicationFiled: September 27, 2019Publication date: April 1, 2021Inventors: Robert Andrew James Clark, Chun-an Chan, Vincent Ping Leung Wan
-
Publication number: 20200098350Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating speech from text. One of the systems includes one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to implement: a sequence-to-sequence recurrent neural network configured to: receive a sequence of characters in a particular natural language, and process the sequence of characters to generate a spectrogram of a verbal utterance of the sequence of characters in the particular natural language; and a subsystem configured to: receive the sequence of characters in the particular natural language, and provide the sequence of characters as input to the sequence-to-sequence recurrent neural network to obtain as output the spectrogram of the verbal utterance of the sequence of characters in the particular natural language.Type: ApplicationFiled: November 26, 2019Publication date: March 26, 2020Inventors: Samuel Bengio, Yuxuan Wang, Zongheng Yang, Zhifeng Chen, Yonghui Wu, Ioannis Agiomyrgiannakis, Ron J. Weiss, Navdeep Jaitly, Ryan M. Rifkin, Robert Andrew James Clark, Quoc V. Le, Russell J. Ryan, Ying Xiao
-
Publication number: 20200074985Abstract: A method for providing a frame-based mel spectral representation of speech includes receiving a text utterance having at least one word, and selecting a mel spectral embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. For each phoneme, using the selected mel spectral embedding, the method also includes: predicting a duration of the corresponding phoneme by encoding linguistic features of the corresponding phoneme with a corresponding syllable embedding for the syllable that includes the corresponding phoneme; and generating a plurality of fixed-length predicted mel-frequency spectrogram frames based on the predicted duration for the corresponding phoneme. Each fixed-length predicted mel-frequency spectrogram frame representing mel-spectral information of the corresponding phoneme.Type: ApplicationFiled: November 8, 2019Publication date: March 5, 2020Applicant: Google LLCInventors: Robert Andrew James Clark, Chun-an Chan, Vincent Ping Leung Wan
-
Patent number: 10573293Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating speech from text. One of the systems includes one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to implement: a sequence-to-sequence recurrent neural network configured to: receive a sequence of characters in a particular natural language, and process the sequence of characters to generate a spectrogram of a verbal utterance of the sequence of characters in the particular natural language; and a subsystem configured to: receive the sequence of characters in the particular natural language, and provide the sequence of characters as input to the sequence-to-sequence recurrent neural network to obtain as output the spectrogram of the verbal utterance of the sequence of characters in the particular natural language.Type: GrantFiled: June 20, 2019Date of Patent: February 25, 2020Assignee: Google LLCInventors: Samuel Bengio, Yuxuan Wang, Zongheng Yang, Zhifeng Chen, Yonghui Wu, Ioannis Agiomyrgiannakis, Ron J. Weiss, Navdeep Jaitly, Ryan M. Rifkin, Robert Andrew James Clark, Quoc V. Le, Russell J. Ryan, Ying Xiao