Patents by Inventor Jonathan RAIMAN

Jonathan RAIMAN has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

Real-time neural text-to-speech

Patent number: 11705107

Abstract: Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.

Type: Grant

Filed: October 1, 2020

Date of Patent: July 18, 2023

Assignee: Baidu USA LLC

Inventors: Sercan O. Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, John Miller, Andrew Ng, Jonathan Raiman, Shubhahrata Sengupta, Mohammad Shoeybi
Multi-speaker neural text-to-speech

Patent number: 11651763

Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.

Type: Grant

Filed: November 2, 2020

Date of Patent: May 16, 2023

Assignee: Baidu USA LLC

Inventors: Sercan O. Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou
DATA PATH CIRCUIT DESIGN USING REINFORCEMENT LEARNING

Publication number: 20230139623

Abstract: Apparatuses, systems, and techniques for designing a data path circuit such as a parallel prefix circuit with reinforcement learning are described. A method can include receiving a first design state of a data path circuit, inputting the first design state of the data path circuit into a machine learning model, and performing reinforcement learning using the machine learning model to output a final design state of the data path circuit, wherein the final design state of the data path circuit has decreased area, power consumption and/or delay as compared to conventionally designed data path circuits.

Type: Application

Filed: November 2, 2021

Publication date: May 4, 2023

Inventors: Rajarshi Roy, Saad Godil, Jonathan Raiman, Neel Kant, Ilyas Elkin, Ming Y. Siu, Robert Kirby, Stuart Oberman, Bryan Catanzaro
MULTI-SPEAKER NEURAL TEXT-TO-SPEECH

Publication number: 20210049999

Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.

Type: Application

Filed: November 2, 2020

Publication date: February 18, 2021

Applicant: Baidu USA LLC

Inventors: Sercan O. ARIK, Gregory DIAMOS, Andrew GIBIANSKY, John MILLER, Kainan PENG, Wei PING, Jonathan RAIMAN, Yanqi ZHOU
REAL-TIME NEURAL TEXT-TO-SPEECH

Publication number: 20210027762

Abstract: Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.

Type: Application

Filed: October 1, 2020

Publication date: January 28, 2021

Applicant: Baidu USA LLC

Inventors: Sercan O. ARIK, Mike CHRZANOWSKI, Adam COATES, Gregory DIAMOS, Andrew GIBIANSKY, John MILLER, Andrew NG, Jonathan RAIMAN, Shubhahrata SENGUPTA, Mohammad SHOEYBI
Lazy compilation and kernel fusion in dynamic computation graphs

Patent number: 10901715

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for lazy compilation and kernel fusion in dynamic computation graphs. One of the operations is performed by generating an input graph based on translation of user code into an expression graph. The expression graph represents control flow dependencies of operations of the generated input graph. Optimization of the input graph is then performed by iterative application of optimization rules to the input graph. An optimized version of the input graph results from the application of the optimization rules. A transformation graph then is generated by comparing changes made from the original input graph to the final optimized version of the input graph. The transformation graph provides a blueprint such that the system may recreate the optimization of a similarly structured later generated input graph without having to reapply the optimization rules.

Type: Grant

Filed: September 26, 2019

Date of Patent: January 26, 2021

Inventor: Jonathan Raiman
Systems and methods for multi-speaker neural text-to-speech

Patent number: 10896669

Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.

Type: Grant

Filed: May 8, 2018

Date of Patent: January 19, 2021

Assignee: Baidu USA LLC

Inventors: Sercan O. Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou
Systems and methods for real-time neural text-to-speech

Patent number: 10872598

Abstract: Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.

Type: Grant

Filed: January 29, 2018

Date of Patent: December 22, 2020

Assignee: Baidu USA LLC

Inventors: Sercan O. Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, John Miller, Andrew Ng, Jonathan Raiman, Shubhahrata Sengupta, Mohammad Shoeybi
Systems and methods for neural text-to-speech using convolutional sequence learning

Patent number: 10796686

Abstract: Described herein are embodiments of a fully-convolutional attention-based neural text-to-speech (TTS) system, which various embodiments may generally be referred to as Deep Voice 3. Embodiments of Deep Voice 3 match state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. Deep Voice 3 embodiments were scaled to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, common error modes of attention-based speech synthesis networks were identified and mitigated, and several different waveform synthesis methods were compared. Also presented are embodiments that describe how to scale inference to ten million queries per day on one single-GPU server.

Type: Grant

Filed: August 8, 2018

Date of Patent: October 6, 2020

Assignee: Baidu USA LLC

Inventors: Sercan O. Arik, Wei Ping, Kainan Peng, Sharan Narang, Ajay Kannan, Andrew Gibiansky, Jonathan Raiman, John Miller
Global normalized reader systems and methods

Patent number: 10572595

Abstract: Presented herein are systems and methods for question answering (QA). In embodiments, extractive question answering (QA) is cast as an iterative search problem through the document's structure: select the answer's sentence, start word, and end word. This representation reduces the space of each search step and allows computation to be conditionally allocated to promising search paths. In embodiments, globally normalizing the decision process and back-propagating through beam search makes this representation viable and learning efficient. Various model embodiments, referred to as Globally Normalized Readers (GNR), achieve excellent performance. Also introduced are embodiments of data-augmentation to produce semantically valid examples by aligning named entities to a knowledge base and performing swaps new entities of the same type. This methodology also improved the performance of GNR models and is of independent interest for a variety of natural language processing (NLP) tasks.

Type: Grant

Filed: September 15, 2017

Date of Patent: February 25, 2020

Assignee: Baidu USA LLC

Inventors: Jonathan Raiman, John Miller
SYSTEMS AND METHODS FOR NEURAL TEXT-TO-SPEECH USING CONVOLUTIONAL SEQUENCE LEARNING

Publication number: 20190122651

Abstract: Described herein are embodiments of a fully-convolutional attention-based neural text-to-speech (TTS) system, which various embodiments may generally be referred to as Deep Voice 3. Embodiments of Deep Voice 3 match state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. Deep Voice 3 embodiments were scaled to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, common error modes of attention-based speech synthesis networks were identified and mitigated, and several different waveform synthesis methods were compared. Also presented are embodiments that describe how to scale inference to ten million queries per day on one single-GPU server.

Type: Application

Filed: August 8, 2018

Publication date: April 25, 2019

Applicant: Baidu USA LLC

Inventors: Sercan O. ARIK, Wei PING, Kainan PENG, Sharan NARANG, Ajay KANNAN, Andrew GIBIANSKY, Jonathan RAIMAN, John MILLER
SYSTEMS AND METHODS FOR MULTI-SPEAKER NEURAL TEXT-TO-SPEECH

Publication number: 20180336880

Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.

Type: Application

Filed: May 8, 2018

Publication date: November 22, 2018

Applicant: Baidu USA LLC

Inventors: Sercan O. ARIK, Gregory DIAMOS, Andrew GIBIANSKY, John MILLER, Kainan PENG, Wei PING, Jonathan RAIMAN, Yanqi ZHOU
GLOBAL NORMALIZED READER SYSTEMS AND METHODS

Publication number: 20180300312

Abstract: Presented herein are systems and methods for question answering (QA). In embodiments, extractive question answering (QA) is cast as an iterative search problem through the document's structure: select the answer's sentence, start word, and end word. This representation reduces the space of each search step and allows computation to be conditionally allocated to promising search paths. In embodiments, globally normalizing the decision process and back-propagating through beam search makes this representation viable and learning efficient. Various model embodiments, referred to as Globally Normalized Readers (GNR), achieve excellent performance. Also introduced are embodiments of data-augmentation to produce semantically valid examples by aligning named entities to a knowledge base and performing swaps new entities of the same type. This methodology also improved the performance of GNR models and is of independent interest for a variety of natural language processing (NLP) tasks.

Type: Application

Filed: September 15, 2017

Publication date: October 18, 2018

Applicant: Baidu USA LLC

Inventors: Jonathan RAIMAN, John MILLER
SYSTEMS AND METHODS FOR REAL-TIME NEURAL TEXT-TO-SPEECH

Publication number: 20180247636

Abstract: Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.

Type: Application

Filed: January 29, 2018

Publication date: August 30, 2018

Applicant: Baidu USA LLC

Inventors: Sercan O. ARIK, Mike CHRZANOWSKI, Adam COATES, Gregory DIAMOS, Andrew GIBIANSKY, John MILLER, Andrew NG, Jonathan RAIMAN, Shubhahrata SENGUPTA, Mohammad SHOEYBI