Patents by Inventor Kainan PENG
Kainan PENG has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20240118150Abstract: The present disclosure provides a method for testing an internal force increment of an arch bridge suspender by inertial measurement, including the following steps: (1) selecting a suspender to be tested with internal force increment, and mounting an acceleration sensing device or a speed sensing device at a lower edge of the suspender to be tested; (2) setting an appropriate sampling frequency and collecting signals; (3) processing information data collected in step (2) by using Formulas; and (4) recording a result of the information data processing and obtaining the internal force increment of the suspender. The method can obtain the internal force increment of the suspender by collecting acceleration or speed signals of the lower edge of the suspender and performing calculation from the signals. This method has the advantages of simple and convenient testing, high replicability and low test cost.Type: ApplicationFiled: October 8, 2021Publication date: April 11, 2024Inventors: Hua Wang, Longlin Wang, Tianzhi Hao, Zehua Xie, Mengsheng Yu, Xiaoli Zhuo, Yuhou Yang, Jiejun Ning, Xirui Wang, Xi Peng, Kainan Huang, Junhong Wu
-
Patent number: 11651763Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.Type: GrantFiled: November 2, 2020Date of Patent: May 16, 2023Assignee: Baidu USA LLCInventors: Sercan O. Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou
-
Patent number: 11521592Abstract: WaveFlow is a small-footprint generative flow for raw audio, which may be directly trained with maximum likelihood. WaveFlow handles the long-range structure of waveform with a dilated two-dimensional (2D) convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow may provide a unified view of likelihood-based models for raw audio, including WaveNet and WaveGlow, which may be considered special cases. It generates high-fidelity speech, while synthesizing several orders of magnitude faster than existing systems since it uses only a few sequential steps to generate relatively long waveforms. WaveFlow significantly reduces the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Its small footprint with 5.91M parameters makes it 15 times smaller than some existing models. WaveFlow can generate 22.05 kHz high-fidelity audio 42.Type: GrantFiled: August 5, 2020Date of Patent: December 6, 2022Assignee: Baidu USA LLCInventors: Wei Ping, Kainan Peng, Kexin Zhao, Zhao Song
-
Patent number: 11482207Abstract: Described herein are embodiments of an end-to-end text-to-speech (TTS) system with parallel wave generation. In one or more embodiments, a Gaussian inverse autoregressive flow is distilled from an autoregressive WaveNet by minimizing a novel regularized Kullback-Leibler (KL) divergence between their highly-peaked output distributions. Embodiments of the methodology computes the KL divergence in a closed-form, which simplifies the training process and provides very efficient distillation. Embodiments of a novel text-to-wave neural architecture for speech synthesis are also described, which are fully convolutional and enable fast end-to-end training from scratch. These embodiments significantly outperform the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet. Also, a parallel waveform synthesizer embodiment conditioned on the hidden representation in an embodiment of this end-to-end model were successfully distilled.Type: GrantFiled: December 21, 2020Date of Patent: October 25, 2022Assignee: Baidu USA LLCInventors: Wei Ping, Kainan Peng, Jitong Chen
-
Patent number: 11238843Abstract: Voice cloning is a highly desired capability for personalized speech interfaces. Neural network-based speech synthesis has been shown to generate high quality speech for a large number of speakers. Neural voice cloning systems that take a few audio samples as input are presented herein. Two approaches, speaker adaptation and speaker encoding, are disclosed. Speaker adaptation embodiments are based on fine-tuning a multi-speaker generative model with a few cloning samples. Speaker encoding embodiments are based on training a separate model to directly infer a new speaker embedding from cloning audios, which is used in or with a multi-speaker generative model. Both approaches achieve good performance in terms of naturalness of the speech and its similarity to original speaker—even with very few cloning audios.Type: GrantFiled: September 26, 2018Date of Patent: February 1, 2022Assignee: Baidu USA LLCInventors: Sercan O. Arik, Jitong Chen, Kainan Peng, Wei Ping, Yanqi Zhou
-
Patent number: 11017761Abstract: Presented herein are embodiments of a non-autoregressive sequence-to-sequence model that converts text to an audio representation. Embodiment are fully convolutional, and a tested embodiment obtained about 46.7 times speed-up over a prior model at synthesis while maintaining comparable speech quality using a WaveNet vocoder. Interestingly, a tested embodiment also has fewer attention errors than the autoregressive model on challenging test sentences. In one or more embodiments, the first fully parallel neural text-to-speech system was built by applying the inverse autoregressive flow (IAF) as the parallel neural vocoder. System embodiments can synthesize speech from text through a single feed-forward pass. Also disclosed herein are embodiments of a novel approach to train the IAF from scratch as a generative model for raw waveform, which avoids the need for distillation from a separately trained WaveNet.Type: GrantFiled: October 16, 2019Date of Patent: May 25, 2021Assignee: Baidu USA LLCInventors: Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao
-
Publication number: 20210110810Abstract: Described herein are embodiments of an end-to-end text-to-speech (TTS) system with parallel wave generation. In one or more embodiments, a Gaussian inverse autoregressive flow is distilled from an autoregressive WaveNet by minimizing a novel regularized Kullback-Leibler (KL) divergence between their highly-peaked output distributions. Embodiments of the methodology computes the KL divergence in a closed-form, which simplifies the training process and provides very efficient distillation. Embodiments of a novel text-to-wave neural architecture for speech synthesis are also described, which are fully convolutional and enable fast end-to-end training from scratch. These embodiments significantly outperform the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet. Also, a parallel waveform synthesizer embodiment conditioned on the hidden representation in an embodiment of this end-to-end model were successfully distilled.Type: ApplicationFiled: December 21, 2020Publication date: April 15, 2021Applicant: Baidu USA LLCInventors: Wei PING, Kainan PENG, Jitong CHEN
-
Publication number: 20210090547Abstract: WaveFlow is a small-footprint generative flow for raw audio, which may be directly trained with maximum likelihood. WaveFlow handles the long-range structure of waveform with a dilated two-dimensional (2D) convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow may provide a unified view of likelihood-based models for raw audio, including WaveNet and WaveGlow, which may be considered special cases. It generates high-fidelity speech, while synthesizing several orders of magnitude faster than existing systems since it uses only a few sequential steps to generate relatively long waveforms. WaveFlow significantly reduces the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Its small footprint with 5.91M parameters makes it 15 times smaller than some existing models. WaveFlow can generate 22.05 kHz high-fidelity audio 42.Type: ApplicationFiled: August 5, 2020Publication date: March 25, 2021Applicant: Baidu USA LLCInventors: Wei PING, Kainan PENG, Kexin ZHAO, Zhao SONG
-
Publication number: 20210049999Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.Type: ApplicationFiled: November 2, 2020Publication date: February 18, 2021Applicant: Baidu USA LLCInventors: Sercan O. ARIK, Gregory DIAMOS, Andrew GIBIANSKY, John MILLER, Kainan PENG, Wei PING, Jonathan RAIMAN, Yanqi ZHOU
-
Patent number: 10896669Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.Type: GrantFiled: May 8, 2018Date of Patent: January 19, 2021Assignee: Baidu USA LLCInventors: Sercan O. Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou
-
Patent number: 10872596Abstract: Described herein are embodiments of an end-to-end text-to-speech (TTS) system with parallel wave generation. In one or more embodiments, a Gaussian inverse autoregressive flow is distilled from an autoregressive WaveNet by minimizing a novel regularized Kullback-Leibler (KL) divergence between their highly-peaked output distributions. Embodiments of the methodology computes the KL divergence in a closed-form, which simplifies the training process and provides very efficient distillation. Embodiments of a novel text-to-wave neural architecture for speech synthesis are also described, which are fully convolutional and enable fast end-to-end training from scratch. These embodiments significantly outperform the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet. Also, a parallel waveform synthesizer embodiment conditioned on the hidden representation in an embodiment of this end-to-end model were successfully distilled.Type: GrantFiled: February 15, 2019Date of Patent: December 22, 2020Assignee: Baidu USA LLCInventors: Wei Ping, Kainan Peng, Jitong Chen
-
Patent number: 10796686Abstract: Described herein are embodiments of a fully-convolutional attention-based neural text-to-speech (TTS) system, which various embodiments may generally be referred to as Deep Voice 3. Embodiments of Deep Voice 3 match state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. Deep Voice 3 embodiments were scaled to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, common error modes of attention-based speech synthesis networks were identified and mitigated, and several different waveform synthesis methods were compared. Also presented are embodiments that describe how to scale inference to ten million queries per day on one single-GPU server.Type: GrantFiled: August 8, 2018Date of Patent: October 6, 2020Assignee: Baidu USA LLCInventors: Sercan O. Arik, Wei Ping, Kainan Peng, Sharan Narang, Ajay Kannan, Andrew Gibiansky, Jonathan Raiman, John Miller
-
Publication number: 20200066253Abstract: Presented herein are embodiments of a non-autoregressive sequence-to-sequence model that converts text to an audio representation. Embodiment are fully convolutional, and a tested embodiment obtained about 46.7 times speed-up over a prior model at synthesis while maintaining comparable speech quality using a WaveNet vocoder. Interestingly, a tested embodiment also has fewer attention errors than the autoregressive model on challenging test sentences. In one or more embodiments, the first fully parallel neural text-to-speech system was built by applying the inverse autoregressive flow (IAF) as the parallel neural vocoder. System embodiments can synthesize speech from text through a single feed-forward pass. Also disclosed herein are embodiments of a novel approach to train the IAF from scratch as a generative model for raw waveform, which avoids the need for distillation from a separately trained WaveNet.Type: ApplicationFiled: October 16, 2019Publication date: February 27, 2020Applicant: Baidu USA LLCInventors: Kainan PENG, Wei PING, Zhao SONG, Kexin ZHAO
-
Publication number: 20190251952Abstract: Voice cloning is a highly desired capability for personalized speech interfaces. Neural network-based speech synthesis has been shown to generate high quality speech for a large number of speakers. Neural voice cloning systems that take a few audio samples as input are presented herein. Two approaches, speaker adaptation and speaker encoding, are disclosed. Speaker adaptation embodiments are based on fine-tuning a multi-speaker generative model with a few cloning samples. Speaker encoding embodiments are based on training a separate model to directly infer a new speaker embedding from cloning audios, which is used in or with a multi-speaker generative model. Both approaches achieve good performance in terms of naturalness of the speech and its similarity to original speaker—even with very few cloning audios.Type: ApplicationFiled: September 26, 2018Publication date: August 15, 2019Applicant: Baidu USA LLCInventors: Sercan O. ARIK, Jitong CHEN, Kainan PENG, Wei PING, Yanqi ZHOU
-
Publication number: 20190180732Abstract: Described herein are embodiments of an end-to-end text-to-speech (TTS) system with parallel wave generation. In one or more embodiments, a Gaussian inverse autoregressive flow is distilled from an autoregressive WaveNet by minimizing a novel regularized Kullback-Leibler (KL) divergence between their highly-peaked output distributions. Embodiments of the methodology computes the KL divergence in a closed-form, which simplifies the training process and provides very efficient distillation. Embodiments of a novel text-to-wave neural architecture for speech synthesis are also described, which are fully convolutional and enable fast end-to-end training from scratch. These embodiments significantly outperform the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet. Also, a parallel waveform synthesizer embodiment conditioned on the hidden representation in an embodiment of this end-to-end model were successfully distilled.Type: ApplicationFiled: February 15, 2019Publication date: June 13, 2019Applicant: Baidu USA LLCInventors: Wei PING, Kainan PENG, Jitong CHEN
-
Publication number: 20190122651Abstract: Described herein are embodiments of a fully-convolutional attention-based neural text-to-speech (TTS) system, which various embodiments may generally be referred to as Deep Voice 3. Embodiments of Deep Voice 3 match state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. Deep Voice 3 embodiments were scaled to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, common error modes of attention-based speech synthesis networks were identified and mitigated, and several different waveform synthesis methods were compared. Also presented are embodiments that describe how to scale inference to ten million queries per day on one single-GPU server.Type: ApplicationFiled: August 8, 2018Publication date: April 25, 2019Applicant: Baidu USA LLCInventors: Sercan O. ARIK, Wei PING, Kainan PENG, Sharan NARANG, Ajay KANNAN, Andrew GIBIANSKY, Jonathan RAIMAN, John MILLER
-
Publication number: 20180336880Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.Type: ApplicationFiled: May 8, 2018Publication date: November 22, 2018Applicant: Baidu USA LLCInventors: Sercan O. ARIK, Gregory DIAMOS, Andrew GIBIANSKY, John MILLER, Kainan PENG, Wei PING, Jonathan RAIMAN, Yanqi ZHOU