Patents by Inventor Kainan PENG

Kainan PENG has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

METHOD FOR TESTING INTERNAL FORCE INCREMENT OF ARCH BRIDGE SUSPENDER BY INERTIAL MEASUREMENT

Publication number: 20240118150

Abstract: The present disclosure provides a method for testing an internal force increment of an arch bridge suspender by inertial measurement, including the following steps: (1) selecting a suspender to be tested with internal force increment, and mounting an acceleration sensing device or a speed sensing device at a lower edge of the suspender to be tested; (2) setting an appropriate sampling frequency and collecting signals; (3) processing information data collected in step (2) by using Formulas; and (4) recording a result of the information data processing and obtaining the internal force increment of the suspender. The method can obtain the internal force increment of the suspender by collecting acceleration or speed signals of the lower edge of the suspender and performing calculation from the signals. This method has the advantages of simple and convenient testing, high replicability and low test cost.

Type: Application

Filed: October 8, 2021

Publication date: April 11, 2024

Inventors: Hua Wang, Longlin Wang, Tianzhi Hao, Zehua Xie, Mengsheng Yu, Xiaoli Zhuo, Yuhou Yang, Jiejun Ning, Xirui Wang, Xi Peng, Kainan Huang, Junhong Wu
Multi-speaker neural text-to-speech

Patent number: 11651763

Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.

Type: Grant

Filed: November 2, 2020

Date of Patent: May 16, 2023

Assignee: Baidu USA LLC

Inventors: Sercan O. Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou
Small-footprint flow-based models for raw audio

Patent number: 11521592

Abstract: WaveFlow is a small-footprint generative flow for raw audio, which may be directly trained with maximum likelihood. WaveFlow handles the long-range structure of waveform with a dilated two-dimensional (2D) convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow may provide a unified view of likelihood-based models for raw audio, including WaveNet and WaveGlow, which may be considered special cases. It generates high-fidelity speech, while synthesizing several orders of magnitude faster than existing systems since it uses only a few sequential steps to generate relatively long waveforms. WaveFlow significantly reduces the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Its small footprint with 5.91M parameters makes it 15 times smaller than some existing models. WaveFlow can generate 22.05 kHz high-fidelity audio 42.

Type: Grant

Filed: August 5, 2020

Date of Patent: December 6, 2022

Assignee: Baidu USA LLC

Inventors: Wei Ping, Kainan Peng, Kexin Zhao, Zhao Song
Waveform generation using end-to-end text-to-waveform system

Patent number: 11482207

Abstract: Described herein are embodiments of an end-to-end text-to-speech (TTS) system with parallel wave generation. In one or more embodiments, a Gaussian inverse autoregressive flow is distilled from an autoregressive WaveNet by minimizing a novel regularized Kullback-Leibler (KL) divergence between their highly-peaked output distributions. Embodiments of the methodology computes the KL divergence in a closed-form, which simplifies the training process and provides very efficient distillation. Embodiments of a novel text-to-wave neural architecture for speech synthesis are also described, which are fully convolutional and enable fast end-to-end training from scratch. These embodiments significantly outperform the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet. Also, a parallel waveform synthesizer embodiment conditioned on the hidden representation in an embodiment of this end-to-end model were successfully distilled.

Type: Grant

Filed: December 21, 2020

Date of Patent: October 25, 2022

Assignee: Baidu USA LLC

Inventors: Wei Ping, Kainan Peng, Jitong Chen
Systems and methods for neural voice cloning with a few samples

Patent number: 11238843

Abstract: Voice cloning is a highly desired capability for personalized speech interfaces. Neural network-based speech synthesis has been shown to generate high quality speech for a large number of speakers. Neural voice cloning systems that take a few audio samples as input are presented herein. Two approaches, speaker adaptation and speaker encoding, are disclosed. Speaker adaptation embodiments are based on fine-tuning a multi-speaker generative model with a few cloning samples. Speaker encoding embodiments are based on training a separate model to directly infer a new speaker embedding from cloning audios, which is used in or with a multi-speaker generative model. Both approaches achieve good performance in terms of naturalness of the speech and its similarity to original speaker—even with very few cloning audios.

Type: Grant

Filed: September 26, 2018

Date of Patent: February 1, 2022

Assignee: Baidu USA LLC

Inventors: Sercan O. Arik, Jitong Chen, Kainan Peng, Wei Ping, Yanqi Zhou
Parallel neural text-to-speech

Patent number: 11017761

Abstract: Presented herein are embodiments of a non-autoregressive sequence-to-sequence model that converts text to an audio representation. Embodiment are fully convolutional, and a tested embodiment obtained about 46.7 times speed-up over a prior model at synthesis while maintaining comparable speech quality using a WaveNet vocoder. Interestingly, a tested embodiment also has fewer attention errors than the autoregressive model on challenging test sentences. In one or more embodiments, the first fully parallel neural text-to-speech system was built by applying the inverse autoregressive flow (IAF) as the parallel neural vocoder. System embodiments can synthesize speech from text through a single feed-forward pass. Also disclosed herein are embodiments of a novel approach to train the IAF from scratch as a generative model for raw waveform, which avoids the need for distillation from a separately trained WaveNet.

Type: Grant

Filed: October 16, 2019

Date of Patent: May 25, 2021

Assignee: Baidu USA LLC

Inventors: Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao
WAVEFORM GENERATION USING END-TO-END TEXT-TO-WAVEFORM SYSTEM

Publication number: 20210110810

Abstract: Described herein are embodiments of an end-to-end text-to-speech (TTS) system with parallel wave generation. In one or more embodiments, a Gaussian inverse autoregressive flow is distilled from an autoregressive WaveNet by minimizing a novel regularized Kullback-Leibler (KL) divergence between their highly-peaked output distributions. Embodiments of the methodology computes the KL divergence in a closed-form, which simplifies the training process and provides very efficient distillation. Embodiments of a novel text-to-wave neural architecture for speech synthesis are also described, which are fully convolutional and enable fast end-to-end training from scratch. These embodiments significantly outperform the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet. Also, a parallel waveform synthesizer embodiment conditioned on the hidden representation in an embodiment of this end-to-end model were successfully distilled.

Type: Application

Filed: December 21, 2020

Publication date: April 15, 2021

Applicant: Baidu USA LLC

Inventors: Wei PING, Kainan PENG, Jitong CHEN
SMALL-FOOTPRINT FLOW-BASED MODELS FOR RAW AUDIO

Publication number: 20210090547

Abstract: WaveFlow is a small-footprint generative flow for raw audio, which may be directly trained with maximum likelihood. WaveFlow handles the long-range structure of waveform with a dilated two-dimensional (2D) convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow may provide a unified view of likelihood-based models for raw audio, including WaveNet and WaveGlow, which may be considered special cases. It generates high-fidelity speech, while synthesizing several orders of magnitude faster than existing systems since it uses only a few sequential steps to generate relatively long waveforms. WaveFlow significantly reduces the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Its small footprint with 5.91M parameters makes it 15 times smaller than some existing models. WaveFlow can generate 22.05 kHz high-fidelity audio 42.

Type: Application

Filed: August 5, 2020

Publication date: March 25, 2021

Applicant: Baidu USA LLC

Inventors: Wei PING, Kainan PENG, Kexin ZHAO, Zhao SONG
MULTI-SPEAKER NEURAL TEXT-TO-SPEECH

Publication number: 20210049999

Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.

Type: Application

Filed: November 2, 2020

Publication date: February 18, 2021

Applicant: Baidu USA LLC

Inventors: Sercan O. ARIK, Gregory DIAMOS, Andrew GIBIANSKY, John MILLER, Kainan PENG, Wei PING, Jonathan RAIMAN, Yanqi ZHOU
Systems and methods for multi-speaker neural text-to-speech

Patent number: 10896669

Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.

Type: Grant

Filed: May 8, 2018

Date of Patent: January 19, 2021

Assignee: Baidu USA LLC

Inventors: Sercan O. Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou
Systems and methods for parallel wave generation in end-to-end text-to-speech

Patent number: 10872596

Abstract: Described herein are embodiments of an end-to-end text-to-speech (TTS) system with parallel wave generation. In one or more embodiments, a Gaussian inverse autoregressive flow is distilled from an autoregressive WaveNet by minimizing a novel regularized Kullback-Leibler (KL) divergence between their highly-peaked output distributions. Embodiments of the methodology computes the KL divergence in a closed-form, which simplifies the training process and provides very efficient distillation. Embodiments of a novel text-to-wave neural architecture for speech synthesis are also described, which are fully convolutional and enable fast end-to-end training from scratch. These embodiments significantly outperform the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet. Also, a parallel waveform synthesizer embodiment conditioned on the hidden representation in an embodiment of this end-to-end model were successfully distilled.

Type: Grant

Filed: February 15, 2019

Date of Patent: December 22, 2020

Assignee: Baidu USA LLC

Inventors: Wei Ping, Kainan Peng, Jitong Chen
Systems and methods for neural text-to-speech using convolutional sequence learning

Patent number: 10796686

Abstract: Described herein are embodiments of a fully-convolutional attention-based neural text-to-speech (TTS) system, which various embodiments may generally be referred to as Deep Voice 3. Embodiments of Deep Voice 3 match state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. Deep Voice 3 embodiments were scaled to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, common error modes of attention-based speech synthesis networks were identified and mitigated, and several different waveform synthesis methods were compared. Also presented are embodiments that describe how to scale inference to ten million queries per day on one single-GPU server.

Type: Grant

Filed: August 8, 2018

Date of Patent: October 6, 2020

Assignee: Baidu USA LLC

Inventors: Sercan O. Arik, Wei Ping, Kainan Peng, Sharan Narang, Ajay Kannan, Andrew Gibiansky, Jonathan Raiman, John Miller
PARALLEL NEURAL TEXT-TO-SPEECH

Publication number: 20200066253

Abstract: Presented herein are embodiments of a non-autoregressive sequence-to-sequence model that converts text to an audio representation. Embodiment are fully convolutional, and a tested embodiment obtained about 46.7 times speed-up over a prior model at synthesis while maintaining comparable speech quality using a WaveNet vocoder. Interestingly, a tested embodiment also has fewer attention errors than the autoregressive model on challenging test sentences. In one or more embodiments, the first fully parallel neural text-to-speech system was built by applying the inverse autoregressive flow (IAF) as the parallel neural vocoder. System embodiments can synthesize speech from text through a single feed-forward pass. Also disclosed herein are embodiments of a novel approach to train the IAF from scratch as a generative model for raw waveform, which avoids the need for distillation from a separately trained WaveNet.

Type: Application

Filed: October 16, 2019

Publication date: February 27, 2020

Applicant: Baidu USA LLC

Inventors: Kainan PENG, Wei PING, Zhao SONG, Kexin ZHAO
SYSTEMS AND METHODS FOR NEURAL VOICE CLONING WITH A FEW SAMPLES

Publication number: 20190251952

Abstract: Voice cloning is a highly desired capability for personalized speech interfaces. Neural network-based speech synthesis has been shown to generate high quality speech for a large number of speakers. Neural voice cloning systems that take a few audio samples as input are presented herein. Two approaches, speaker adaptation and speaker encoding, are disclosed. Speaker adaptation embodiments are based on fine-tuning a multi-speaker generative model with a few cloning samples. Speaker encoding embodiments are based on training a separate model to directly infer a new speaker embedding from cloning audios, which is used in or with a multi-speaker generative model. Both approaches achieve good performance in terms of naturalness of the speech and its similarity to original speaker—even with very few cloning audios.

Type: Application

Filed: September 26, 2018

Publication date: August 15, 2019

Applicant: Baidu USA LLC

Inventors: Sercan O. ARIK, Jitong CHEN, Kainan PENG, Wei PING, Yanqi ZHOU
SYSTEMS AND METHODS FOR PARALLEL WAVE GENERATION IN END-TO-END TEXT-TO-SPEECH

Publication number: 20190180732

Abstract: Described herein are embodiments of an end-to-end text-to-speech (TTS) system with parallel wave generation. In one or more embodiments, a Gaussian inverse autoregressive flow is distilled from an autoregressive WaveNet by minimizing a novel regularized Kullback-Leibler (KL) divergence between their highly-peaked output distributions. Embodiments of the methodology computes the KL divergence in a closed-form, which simplifies the training process and provides very efficient distillation. Embodiments of a novel text-to-wave neural architecture for speech synthesis are also described, which are fully convolutional and enable fast end-to-end training from scratch. These embodiments significantly outperform the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet. Also, a parallel waveform synthesizer embodiment conditioned on the hidden representation in an embodiment of this end-to-end model were successfully distilled.

Type: Application

Filed: February 15, 2019

Publication date: June 13, 2019

Applicant: Baidu USA LLC

Inventors: Wei PING, Kainan PENG, Jitong CHEN
SYSTEMS AND METHODS FOR NEURAL TEXT-TO-SPEECH USING CONVOLUTIONAL SEQUENCE LEARNING

Publication number: 20190122651

Abstract: Described herein are embodiments of a fully-convolutional attention-based neural text-to-speech (TTS) system, which various embodiments may generally be referred to as Deep Voice 3. Embodiments of Deep Voice 3 match state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. Deep Voice 3 embodiments were scaled to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, common error modes of attention-based speech synthesis networks were identified and mitigated, and several different waveform synthesis methods were compared. Also presented are embodiments that describe how to scale inference to ten million queries per day on one single-GPU server.

Type: Application

Filed: August 8, 2018

Publication date: April 25, 2019

Applicant: Baidu USA LLC

Inventors: Sercan O. ARIK, Wei PING, Kainan PENG, Sharan NARANG, Ajay KANNAN, Andrew GIBIANSKY, Jonathan RAIMAN, John MILLER
SYSTEMS AND METHODS FOR MULTI-SPEAKER NEURAL TEXT-TO-SPEECH

Publication number: 20180336880

Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.

Type: Application

Filed: May 8, 2018

Publication date: November 22, 2018

Applicant: Baidu USA LLC

Inventors: Sercan O. ARIK, Gregory DIAMOS, Andrew GIBIANSKY, John MILLER, Kainan PENG, Wei PING, Jonathan RAIMAN, Yanqi ZHOU