Patents by Inventor Wei Ping

Wei Ping has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).

CROSS-MODALITY ALIGNMENT FOR LARGE LANGUAGE MODELS

Publication number: 20250342346

Abstract: Apparatuses, systems, and techniques for cross-modality alignment for large language models (LLMs), enabling enhanced multi-modal interaction. In at least one embodiment, a textual embedding is obtained by encoding a multi-modal input and algining the encoded results into a textual embedding space. A visual embedding is obtained based on features extracted from visual data in the multi-modal input using visual encoders. A multi-modal output is generated based on the textual embedding and the visual embedding.

Type: Application

Filed: December 30, 2024

Publication date: November 6, 2025

Inventors: Hongxu Yin, Pavlo Molchanov, Song Han, Yao Lu, Hanrong Ye, Andrew Tao, Wei Ping, Jan Kautz, De-An Huang, Zhiding Yu
NEURAL NETWORK-BASED LANGUAGE RESTRICTION

Publication number: 20240095447

Abstract: Apparatuses, systems, and techniques are presented to identify and prevent generation of restricted content. In at least one embodiment, one or more neural networks are used to identify restricted content based only on the restricted content.

Type: Application

Filed: June 22, 2022

Publication date: March 21, 2024

Inventors: Wei Ping, Boxin Wang, Chaowei Xiao, Mohammad Shoeybi, Mostofa Patwary, Anima Anandkumar, Bryan Catanzaro
Speech denoising via discrete representation learning

Patent number: 11875809

Abstract: Developed and presented herein are embodiments of a new end-to-end approach for audio denoising, from a synthesis perspective. Instead of explicitly modelling the noise component in the input signal, embodiments directly synthesize the denoised audio from a generative model (or vocoder), as in text-to-speech systems. In one or more embodiments, to generate the phonetic contents for the autoregressive generative model, it is learned via a variational autoencoder with discrete latent representations. Furthermore, in one or more embodiments, a new matching loss is presented for the denoising purpose, which is masked on when the corresponding latent codes differ. As compared against other method on test datasets, embodiments achieve competitive performance and can be trained from scratch.

Type: Grant

Filed: October 1, 2020

Date of Patent: January 16, 2024

Assignee: Baidu USA LLC

Inventors: Zhao Song, Wei Ping
Unsupervised alignment for text to speech synthesis using neural networks

Patent number: 11869483

Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.

Type: Grant

Filed: October 7, 2021

Date of Patent: January 9, 2024

Assignee: Nvidia Corporation

Inventors: Kevin Shih, Jose Rafael Valle Gomes da Costa, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
UNSUPERVISED ALIGNMENT FOR TEXT TO SPEECH SYNTHESIS USING NEURAL NETWORKS

Publication number: 20230419947

Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.

Type: Application

Filed: August 15, 2023

Publication date: December 28, 2023

Inventors: Kevin Shih, Jose Rafael Valle Gomes da Costa, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
UNSUPERVISED ALIGNMENT FOR TEXT TO SPEECH SYNTHESIS USING NEURAL NETWORKS

Publication number: 20230402028

Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.

Type: Application

Filed: August 28, 2023

Publication date: December 14, 2023

Inventors: Kevin Shih, Jose Rafael Valle Gomes da Costa, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
Unsupervised alignment for text to speech synthesis using neural networks

Patent number: 11769481

Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.

Type: Grant

Filed: October 7, 2021

Date of Patent: September 26, 2023

Assignee: Nvidia Corporation

Inventors: Kevin Shih, Jose Rafael Valle Gomes da Costa, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
Multi-speaker neural text-to-speech

Patent number: 11651763

Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.

Type: Grant

Filed: November 2, 2020

Date of Patent: May 16, 2023

Assignee: Baidu USA LLC

Inventors: Sercan O. Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou
UNSUPERVISED ALIGNMENT FOR TEXT TO SPEECH SYNTHESIS USING NEURAL NETWORKS

Publication number: 20230110905

Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.

Type: Application

Filed: October 7, 2021

Publication date: April 13, 2023

Inventors: Kevin Shih, Jose Rafael Valle Gomes da Costa, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
UNSUPERVISED ALIGNMENT FOR TEXT TO SPEECH SYNTHESIS USING NEURAL NETWORKS

Publication number: 20230113950

Abstract: Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.

Type: Application

Filed: October 7, 2021

Publication date: April 13, 2023

Inventors: Kevin Shih, Jose Rafael Valle Gomes da Costa, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro
SYNTHESIZING VIDEO FROM AUDIO USING ONE OR MORE NEURAL NETWORKS

Publication number: 20230035306

Abstract: Apparatuses, systems, and techniques are presented to generate media content.

Type: Application

Filed: July 21, 2021

Publication date: February 2, 2023

Inventors: Ming-Yu Liu, Koki Nagano, Yeongho Seol, Jose Rafael Valle Gomes da Costa, Jaewoo Seo, Ting-Chun Wang, Arun Mallya, Sameh Khamis, Wei Ping, Rohan Badlani, Kevin Jonathan Shih, Bryan Catanzaro, Simon Yuen, Jan Kautz
Small-footprint flow-based models for raw audio

Patent number: 11521592

Abstract: WaveFlow is a small-footprint generative flow for raw audio, which may be directly trained with maximum likelihood. WaveFlow handles the long-range structure of waveform with a dilated two-dimensional (2D) convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow may provide a unified view of likelihood-based models for raw audio, including WaveNet and WaveGlow, which may be considered special cases. It generates high-fidelity speech, while synthesizing several orders of magnitude faster than existing systems since it uses only a few sequential steps to generate relatively long waveforms. WaveFlow significantly reduces the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Its small footprint with 5.91M parameters makes it 15 times smaller than some existing models. WaveFlow can generate 22.05 kHz high-fidelity audio 42.

Type: Grant

Filed: August 5, 2020

Date of Patent: December 6, 2022

Assignee: Baidu USA LLC

Inventors: Wei Ping, Kainan Peng, Kexin Zhao, Zhao Song
Waveform generation using end-to-end text-to-waveform system

Patent number: 11482207

Abstract: Described herein are embodiments of an end-to-end text-to-speech (TTS) system with parallel wave generation. In one or more embodiments, a Gaussian inverse autoregressive flow is distilled from an autoregressive WaveNet by minimizing a novel regularized Kullback-Leibler (KL) divergence between their highly-peaked output distributions. Embodiments of the methodology computes the KL divergence in a closed-form, which simplifies the training process and provides very efficient distillation. Embodiments of a novel text-to-wave neural architecture for speech synthesis are also described, which are fully convolutional and enable fast end-to-end training from scratch. These embodiments significantly outperform the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet. Also, a parallel waveform synthesizer embodiment conditioned on the hidden representation in an embodiment of this end-to-end model were successfully distilled.

Type: Grant

Filed: December 21, 2020

Date of Patent: October 25, 2022

Assignee: Baidu USA LLC

Inventors: Wei Ping, Kainan Peng, Jitong Chen
SPEECH DENOISING VIA DISCRETE REPRESENTATION LEARNING

Publication number: 20220108712

Abstract: Developed and presented herein are embodiments of a new end-to-end approach for audio denoising, from a synthesis perspective. Instead of explicitly modelling the noise component in the input signal, embodiments directly synthesize the denoised audio from a generative model (or vocoder), as in text-to-speech systems. In one or more embodiments, to generate the phonetic contents for the autoregressive generative model, it is learned via a variational autoencoder with discrete latent representations. Furthermore, in one or more embodiments, a new matching loss is presented for the denoising purpose, which is masked on when the corresponding latent codes differ. As compared against other method on test datasets, embodiments achieve competitive performance and can be trained from scratch.

Type: Application

Filed: October 1, 2020

Publication date: April 7, 2022

Applicant: Baidu USA LLC

Inventors: Zhao SONG, Wei PING
Systems and methods for neural voice cloning with a few samples

Patent number: 11238843

Abstract: Voice cloning is a highly desired capability for personalized speech interfaces. Neural network-based speech synthesis has been shown to generate high quality speech for a large number of speakers. Neural voice cloning systems that take a few audio samples as input are presented herein. Two approaches, speaker adaptation and speaker encoding, are disclosed. Speaker adaptation embodiments are based on fine-tuning a multi-speaker generative model with a few cloning samples. Speaker encoding embodiments are based on training a separate model to directly infer a new speaker embedding from cloning audios, which is used in or with a multi-speaker generative model. Both approaches achieve good performance in terms of naturalness of the speech and its similarity to original speaker—even with very few cloning audios.

Type: Grant

Filed: September 26, 2018

Date of Patent: February 1, 2022

Assignee: Baidu USA LLC

Inventors: Sercan O. Arik, Jitong Chen, Kainan Peng, Wei Ping, Yanqi Zhou
Inaudible watermark enabled text-to-speech framework

Patent number: 11138964

Abstract: According to various embodiments, an end-to-end TTS framework can integrate a watermarking process into the training of the TTS framework, which enables watermarks to be imperceptible within a synthesized/cloned audio segment generated by the TTS framework. The watermarks added in such a matter are statistically undetectable to prevent authorized removal. According to an exemplary method of training the TTS framework, a TTS neural network model and a watermarking neural network mode in the TTS framework are trained in an end to end manner, with the watermarking being part of the optimization process of the TTS framework. During the training, neuron values of the TTS neural network model are adjusted based on training data to prepare one or more spaces for adding a watermark in a synthesized audio segment to be generated by the TTS framework.

Type: Grant

Filed: October 21, 2019

Date of Patent: October 5, 2021

Assignee: BAIDU USA LLC

Inventors: Wei Ping, Zhenyu Zhong, Yueqiang Cheng, Xing Li, Tao Wei
Parallel neural text-to-speech

Patent number: 11017761

Abstract: Presented herein are embodiments of a non-autoregressive sequence-to-sequence model that converts text to an audio representation. Embodiment are fully convolutional, and a tested embodiment obtained about 46.7 times speed-up over a prior model at synthesis while maintaining comparable speech quality using a WaveNet vocoder. Interestingly, a tested embodiment also has fewer attention errors than the autoregressive model on challenging test sentences. In one or more embodiments, the first fully parallel neural text-to-speech system was built by applying the inverse autoregressive flow (IAF) as the parallel neural vocoder. System embodiments can synthesize speech from text through a single feed-forward pass. Also disclosed herein are embodiments of a novel approach to train the IAF from scratch as a generative model for raw waveform, which avoids the need for distillation from a separately trained WaveNet.

Type: Grant

Filed: October 16, 2019

Date of Patent: May 25, 2021

Assignee: Baidu USA LLC

Inventors: Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao
INAUDIBLE WATERMARK ENABLED TEXT-TO-SPEECH FRAMEWORK

Publication number: 20210118423

Abstract: According to various embodiments, an end-to-end TTS framework can integrate a watermarking process into the training of the TTS framework, which enables watermarks to be imperceptible within a synthesized/cloned audio segment generated by the TTS framework. The watermarks added in such a matter are statistically undetectable to prevent authorized removal. According to an exemplary method of training the TTS framework, a TTS neural network model and a watermarking neural network mode in the TTS framework are trained in an end to end manner, with the watermarking being part of the optimization process of the TTS framework. During the training, neuron values of the TTS neural network model are adjusted based on training data to prepare one or more spaces for adding a watermark in a synthesized audio segment to be generated by the TTS framework.

Type: Application

Filed: October 21, 2019

Publication date: April 22, 2021

Inventors: Wei PING, Zhenyu ZHONG, Yueqiang CHENG, Xing LI, Tao WEI
WAVEFORM GENERATION USING END-TO-END TEXT-TO-WAVEFORM SYSTEM

Publication number: 20210110810

Abstract: Described herein are embodiments of an end-to-end text-to-speech (TTS) system with parallel wave generation. In one or more embodiments, a Gaussian inverse autoregressive flow is distilled from an autoregressive WaveNet by minimizing a novel regularized Kullback-Leibler (KL) divergence between their highly-peaked output distributions. Embodiments of the methodology computes the KL divergence in a closed-form, which simplifies the training process and provides very efficient distillation. Embodiments of a novel text-to-wave neural architecture for speech synthesis are also described, which are fully convolutional and enable fast end-to-end training from scratch. These embodiments significantly outperform the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet. Also, a parallel waveform synthesizer embodiment conditioned on the hidden representation in an embodiment of this end-to-end model were successfully distilled.

Type: Application

Filed: December 21, 2020

Publication date: April 15, 2021

Applicant: Baidu USA LLC

Inventors: Wei PING, Kainan PENG, Jitong CHEN
SMALL-FOOTPRINT FLOW-BASED MODELS FOR RAW AUDIO

Publication number: 20210090547

Abstract: WaveFlow is a small-footprint generative flow for raw audio, which may be directly trained with maximum likelihood. WaveFlow handles the long-range structure of waveform with a dilated two-dimensional (2D) convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow may provide a unified view of likelihood-based models for raw audio, including WaveNet and WaveGlow, which may be considered special cases. It generates high-fidelity speech, while synthesizing several orders of magnitude faster than existing systems since it uses only a few sequential steps to generate relatively long waveforms. WaveFlow significantly reduces the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Its small footprint with 5.91M parameters makes it 15 times smaller than some existing models. WaveFlow can generate 22.05 kHz high-fidelity audio 42.

Type: Application

Filed: August 5, 2020

Publication date: March 25, 2021

Applicant: Baidu USA LLC

Inventors: Wei PING, Kainan PENG, Kexin ZHAO, Zhao SONG

1 2 next