Patents by Inventor Wei Ping
Wei Ping has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Publication number: 20250342346Abstract: Apparatuses, systems, and techniques for cross-modality alignment for large language models (LLMs), enabling enhanced multi-modal interaction. In at least one embodiment, a textual embedding is obtained by encoding a multi-modal input and algining the encoded results into a textual embedding space. A visual embedding is obtained based on features extracted from visual data in the multi-modal input using visual encoders. A multi-modal output is generated based on the textual embedding and the visual embedding.Type: ApplicationFiled: December 30, 2024Publication date: November 6, 2025Inventors: Hongxu Yin, Pavlo Molchanov, Song Han, Yao Lu, Hanrong Ye, Andrew Tao, Wei Ping, Jan Kautz, De-An Huang, Zhiding Yu
-
Publication number: 20240095447Abstract: Apparatuses, systems, and techniques are presented to identify and prevent generation of restricted content. In at least one embodiment, one or more neural networks are used to identify restricted content based only on the restricted content.Type: ApplicationFiled: June 22, 2022Publication date: March 21, 2024Inventors: Wei Ping, Boxin Wang, Chaowei Xiao, Mohammad Shoeybi, Mostofa Patwary, Anima Anandkumar, Bryan Catanzaro
-
Patent number: 11875809Abstract: Developed and presented herein are embodiments of a new end-to-end approach for audio denoising, from a synthesis perspective. Instead of explicitly modelling the noise component in the input signal, embodiments directly synthesize the denoised audio from a generative model (or vocoder), as in text-to-speech systems. In one or more embodiments, to generate the phonetic contents for the autoregressive generative model, it is learned via a variational autoencoder with discrete latent representations. Furthermore, in one or more embodiments, a new matching loss is presented for the denoising purpose, which is masked on when the corresponding latent codes differ. As compared against other method on test datasets, embodiments achieve competitive performance and can be trained from scratch.Type: GrantFiled: October 1, 2020Date of Patent: January 16, 2024Assignee: Baidu USA LLCInventors: Zhao Song, Wei Ping
-
Patent number: 11869483Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.Type: GrantFiled: October 7, 2021Date of Patent: January 9, 2024Assignee: Nvidia CorporationInventors: Kevin Shih, Jose Rafael Valle Gomes da Costa, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
-
Publication number: 20230419947Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.Type: ApplicationFiled: August 15, 2023Publication date: December 28, 2023Inventors: Kevin Shih, Jose Rafael Valle Gomes da Costa, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
-
Publication number: 20230402028Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.Type: ApplicationFiled: August 28, 2023Publication date: December 14, 2023Inventors: Kevin Shih, Jose Rafael Valle Gomes da Costa, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
-
Patent number: 11769481Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.Type: GrantFiled: October 7, 2021Date of Patent: September 26, 2023Assignee: Nvidia CorporationInventors: Kevin Shih, Jose Rafael Valle Gomes da Costa, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
-
Patent number: 11651763Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.Type: GrantFiled: November 2, 2020Date of Patent: May 16, 2023Assignee: Baidu USA LLCInventors: Sercan O. Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou
-
Publication number: 20230110905Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.Type: ApplicationFiled: October 7, 2021Publication date: April 13, 2023Inventors: Kevin Shih, Jose Rafael Valle Gomes da Costa, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
-
Publication number: 20230113950Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.Type: ApplicationFiled: October 7, 2021Publication date: April 13, 2023Inventors: Kevin Shih, Jose Rafael Valle Gomes da Costa, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
-
Publication number: 20230035306Abstract: Apparatuses, systems, and techniques are presented to generate media content.Type: ApplicationFiled: July 21, 2021Publication date: February 2, 2023Inventors: Ming-Yu Liu, Koki Nagano, Yeongho Seol, Jose Rafael Valle Gomes da Costa, Jaewoo Seo, Ting-Chun Wang, Arun Mallya, Sameh Khamis, Wei Ping, Rohan Badlani, Kevin Jonathan Shih, Bryan Catanzaro, Simon Yuen, Jan Kautz
-
Patent number: 11521592Abstract: WaveFlow is a small-footprint generative flow for raw audio, which may be directly trained with maximum likelihood. WaveFlow handles the long-range structure of waveform with a dilated two-dimensional (2D) convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow may provide a unified view of likelihood-based models for raw audio, including WaveNet and WaveGlow, which may be considered special cases. It generates high-fidelity speech, while synthesizing several orders of magnitude faster than existing systems since it uses only a few sequential steps to generate relatively long waveforms. WaveFlow significantly reduces the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Its small footprint with 5.91M parameters makes it 15 times smaller than some existing models. WaveFlow can generate 22.05 kHz high-fidelity audio 42.Type: GrantFiled: August 5, 2020Date of Patent: December 6, 2022Assignee: Baidu USA LLCInventors: Wei Ping, Kainan Peng, Kexin Zhao, Zhao Song
-
Patent number: 11482207Abstract: Described herein are embodiments of an end-to-end text-to-speech (TTS) system with parallel wave generation. In one or more embodiments, a Gaussian inverse autoregressive flow is distilled from an autoregressive WaveNet by minimizing a novel regularized Kullback-Leibler (KL) divergence between their highly-peaked output distributions. Embodiments of the methodology computes the KL divergence in a closed-form, which simplifies the training process and provides very efficient distillation. Embodiments of a novel text-to-wave neural architecture for speech synthesis are also described, which are fully convolutional and enable fast end-to-end training from scratch. These embodiments significantly outperform the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet. Also, a parallel waveform synthesizer embodiment conditioned on the hidden representation in an embodiment of this end-to-end model were successfully distilled.Type: GrantFiled: December 21, 2020Date of Patent: October 25, 2022Assignee: Baidu USA LLCInventors: Wei Ping, Kainan Peng, Jitong Chen
-
Publication number: 20220108712Abstract: Developed and presented herein are embodiments of a new end-to-end approach for audio denoising, from a synthesis perspective. Instead of explicitly modelling the noise component in the input signal, embodiments directly synthesize the denoised audio from a generative model (or vocoder), as in text-to-speech systems. In one or more embodiments, to generate the phonetic contents for the autoregressive generative model, it is learned via a variational autoencoder with discrete latent representations. Furthermore, in one or more embodiments, a new matching loss is presented for the denoising purpose, which is masked on when the corresponding latent codes differ. As compared against other method on test datasets, embodiments achieve competitive performance and can be trained from scratch.Type: ApplicationFiled: October 1, 2020Publication date: April 7, 2022Applicant: Baidu USA LLCInventors: Zhao SONG, Wei PING
-
Patent number: 11238843Abstract: Voice cloning is a highly desired capability for personalized speech interfaces. Neural network-based speech synthesis has been shown to generate high quality speech for a large number of speakers. Neural voice cloning systems that take a few audio samples as input are presented herein. Two approaches, speaker adaptation and speaker encoding, are disclosed. Speaker adaptation embodiments are based on fine-tuning a multi-speaker generative model with a few cloning samples. Speaker encoding embodiments are based on training a separate model to directly infer a new speaker embedding from cloning audios, which is used in or with a multi-speaker generative model. Both approaches achieve good performance in terms of naturalness of the speech and its similarity to original speaker—even with very few cloning audios.Type: GrantFiled: September 26, 2018Date of Patent: February 1, 2022Assignee: Baidu USA LLCInventors: Sercan O. Arik, Jitong Chen, Kainan Peng, Wei Ping, Yanqi Zhou
-
Patent number: 11138964Abstract: According to various embodiments, an end-to-end TTS framework can integrate a watermarking process into the training of the TTS framework, which enables watermarks to be imperceptible within a synthesized/cloned audio segment generated by the TTS framework. The watermarks added in such a matter are statistically undetectable to prevent authorized removal. According to an exemplary method of training the TTS framework, a TTS neural network model and a watermarking neural network mode in the TTS framework are trained in an end to end manner, with the watermarking being part of the optimization process of the TTS framework. During the training, neuron values of the TTS neural network model are adjusted based on training data to prepare one or more spaces for adding a watermark in a synthesized audio segment to be generated by the TTS framework.Type: GrantFiled: October 21, 2019Date of Patent: October 5, 2021Assignee: BAIDU USA LLCInventors: Wei Ping, Zhenyu Zhong, Yueqiang Cheng, Xing Li, Tao Wei
-
Patent number: 11017761Abstract: Presented herein are embodiments of a non-autoregressive sequence-to-sequence model that converts text to an audio representation. Embodiment are fully convolutional, and a tested embodiment obtained about 46.7 times speed-up over a prior model at synthesis while maintaining comparable speech quality using a WaveNet vocoder. Interestingly, a tested embodiment also has fewer attention errors than the autoregressive model on challenging test sentences. In one or more embodiments, the first fully parallel neural text-to-speech system was built by applying the inverse autoregressive flow (IAF) as the parallel neural vocoder. System embodiments can synthesize speech from text through a single feed-forward pass. Also disclosed herein are embodiments of a novel approach to train the IAF from scratch as a generative model for raw waveform, which avoids the need for distillation from a separately trained WaveNet.Type: GrantFiled: October 16, 2019Date of Patent: May 25, 2021Assignee: Baidu USA LLCInventors: Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao
-
Publication number: 20210118423Abstract: According to various embodiments, an end-to-end TTS framework can integrate a watermarking process into the training of the TTS framework, which enables watermarks to be imperceptible within a synthesized/cloned audio segment generated by the TTS framework. The watermarks added in such a matter are statistically undetectable to prevent authorized removal. According to an exemplary method of training the TTS framework, a TTS neural network model and a watermarking neural network mode in the TTS framework are trained in an end to end manner, with the watermarking being part of the optimization process of the TTS framework. During the training, neuron values of the TTS neural network model are adjusted based on training data to prepare one or more spaces for adding a watermark in a synthesized audio segment to be generated by the TTS framework.Type: ApplicationFiled: October 21, 2019Publication date: April 22, 2021Inventors: Wei PING, Zhenyu ZHONG, Yueqiang CHENG, Xing LI, Tao WEI
-
Publication number: 20210110810Abstract: Described herein are embodiments of an end-to-end text-to-speech (TTS) system with parallel wave generation. In one or more embodiments, a Gaussian inverse autoregressive flow is distilled from an autoregressive WaveNet by minimizing a novel regularized Kullback-Leibler (KL) divergence between their highly-peaked output distributions. Embodiments of the methodology computes the KL divergence in a closed-form, which simplifies the training process and provides very efficient distillation. Embodiments of a novel text-to-wave neural architecture for speech synthesis are also described, which are fully convolutional and enable fast end-to-end training from scratch. These embodiments significantly outperform the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet. Also, a parallel waveform synthesizer embodiment conditioned on the hidden representation in an embodiment of this end-to-end model were successfully distilled.Type: ApplicationFiled: December 21, 2020Publication date: April 15, 2021Applicant: Baidu USA LLCInventors: Wei PING, Kainan PENG, Jitong CHEN
-
Publication number: 20210090547Abstract: WaveFlow is a small-footprint generative flow for raw audio, which may be directly trained with maximum likelihood. WaveFlow handles the long-range structure of waveform with a dilated two-dimensional (2D) convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow may provide a unified view of likelihood-based models for raw audio, including WaveNet and WaveGlow, which may be considered special cases. It generates high-fidelity speech, while synthesizing several orders of magnitude faster than existing systems since it uses only a few sequential steps to generate relatively long waveforms. WaveFlow significantly reduces the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Its small footprint with 5.91M parameters makes it 15 times smaller than some existing models. WaveFlow can generate 22.05 kHz high-fidelity audio 42.Type: ApplicationFiled: August 5, 2020Publication date: March 25, 2021Applicant: Baidu USA LLCInventors: Wei PING, Kainan PENG, Kexin ZHAO, Zhao SONG