Patents by Inventor Gregory DIAMOS
Gregory DIAMOS has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
-
Patent number: 11741342Abstract: Neural Architecture Search (NAS) is a laborious process. Prior work on automated NAS targets mainly on improving accuracy but lacked consideration of computational resource use. Presented herein are embodiments of a Resource-Efficient Neural Architect (RENA), an efficient resource-constrained NAS using reinforcement learning with network embedding. RENA embodiments use a policy network to process the network embeddings to generate new configurations. Example demonstrates of RENA embodiments on image recognition and keyword spotting (KWS) problems are also presented herein. RENA embodiments can find novel architectures that achieve high performance even with tight resource constraints. For the CIFAR10 dataset, the tested embodiment achieved 2.95% test error when compute intensity is greater than 100 FLOPs/byte, and 3.87% test error when model size was less than 3M parameters.Type: GrantFiled: March 8, 2019Date of Patent: August 29, 2023Assignee: Baidu USA LLCInventors: Yanqi Zhou, Siavash Ebrahimi, Sercan Arik, Haonan Yu, Hairong Liu, Gregory Diamos
-
Patent number: 11705107Abstract: Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.Type: GrantFiled: October 1, 2020Date of Patent: July 18, 2023Assignee: Baidu USA LLCInventors: Sercan O. Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, John Miller, Andrew Ng, Jonathan Raiman, Shubhahrata Sengupta, Mohammad Shoeybi
-
Patent number: 11651763Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.Type: GrantFiled: November 2, 2020Date of Patent: May 16, 2023Assignee: Baidu USA LLCInventors: Sercan O. Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou
-
Patent number: 11651223Abstract: Described herein are systems and methods to prune deep neural network models in reducing the overall memory and compute requirements of these models. It is demonstrated that using block pruning and group lasso combined with pruning during training, block-sparse recurrent neural networks (RNNs) may be built as accurate as dense baseline models. Two different approaches are disclosed to induce block sparsity in neural network models: pruning blocks of weights in a layer and using group lasso regularization to create blocks of weights with zeros. Using these techniques, it is demonstrated that block-sparse RNNs with high sparsity can be created with small loss in accuracy. Block-sparse RNNs eliminate overheads related to data storage and irregular memory accesses while increasing hardware efficiency compared to unstructured sparsity.Type: GrantFiled: October 4, 2018Date of Patent: May 16, 2023Assignee: Baidu USA LLCInventors: Sharan Narang, Eric Undersander, Gregory Diamos
-
Publication number: 20230136672Abstract: A model management system performs error analysis on results predicted by a machine learning model. The model management system identifies an incorrectly classified image outputted from a machine learning model and identifies using the Neural Template Matching (NTM) algorithm, an additional image correlated to the selected image. The system outputs correlated images based on a given image and a selection by a user through a user interface of a region of interest (ROI) of the given image. The region is defined by a bounding polygon input and the correlated images include features correlated to the features within the ROI. The system prompts a task associated with the additional image. The system receives a response that includes an indication that the additional image is incorrectly labeled and including a replacement label and instruct that the machine learning model be retrained using an updated training dataset that includes the replacement label.Type: ApplicationFiled: October 21, 2022Publication date: May 4, 2023Inventors: Mark William Sabini, Kai Yang, Andrew Yan-Tak Ng, Daniel Bibireata, Dillon Laird, Whitney Blodgett, Yan Liu, Yazhou Cao, Yuxiang Zhang, Gregory Diamos, YuQing Zhou, Sanjay Boddhu, Quinn Killough, Shankaranand Jagadeesan, Camilo Zapata, Sebastian Rodriguez
-
Patent number: 11593655Abstract: As deep learning application domains grow, a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements is extremely beneficial. Presented herein are large-scale empirical study of error and model size growth as training sets grow. Embodiments of a methodology for this measurement are introduced herein as well as embodiments for predicting other metrics, such as compute-related metrics. It is shown herein that power-law may be used to represent deep model relationships, such as error and training data size. It is also shown that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.Type: GrantFiled: November 30, 2018Date of Patent: February 28, 2023Assignee: Baidu USA LLCInventors: Joel Hestness, Gregory Diamos, Hee Woo Jun, Sharan Narang, Newsha Ardalani, Md Mostofa Ali Patwary, Yanqi Zhou
-
Patent number: 11562733Abstract: Presented herein are embodiments of state-of-the-art speech recognition systems developed using end-to-end deep learning. In embodiments, the model architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, embodiments of the system do not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learn a function that is robust to such effects. Neither a phoneme dictionary, nor even the concept of a “phoneme,” is needed. Embodiments include a well-optimized recurrent neural network (RNN) training system that can use multiple GPUs, as well as a set of novel data synthesis techniques that allows for a large amount of varied data for training to be efficiently obtained.Type: GrantFiled: August 15, 2019Date of Patent: January 24, 2023Assignee: BAIDU USA LLCInventors: Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Gregory Diamos, Erich Eisen, Ryan Prenger, Sanjeev Satheesh, Shubhabrata Sengupta, Adam Coates, Andrew Ng
-
Patent number: 11462209Abstract: For the problem of waveform synthesis from spectrograms, presented herein are embodiments of an efficient neural network architecture, based on transposed convolutions to achieve a high compute intensity and fast inference. In one or more embodiments, for training of the convolutional vocoder architecture, losses are used that are related to perceptual audio quality, as well as a GAN framework to guide with a critic that discerns unrealistic waveforms. While yielding a high-quality audio, embodiments of the model can achieve more than 500 times faster than real-time audio synthesis. Multi-head convolutional neural network (MCNN) embodiments for waveform synthesis from spectrograms are also disclosed. MCNN embodiments enable significantly better utilization of modern multi-core processors than commonly-used iterative algorithms like Griffin-Lim and yield very fast (more than 300× real-time) waveform synthesis.Type: GrantFiled: March 27, 2019Date of Patent: October 4, 2022Assignee: Baidu USA LLCInventors: Sercan Arik, Hee Woo Jun, Eric Undersander, Gregory Diamos
-
Publication number: 20210049999Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.Type: ApplicationFiled: November 2, 2020Publication date: February 18, 2021Applicant: Baidu USA LLCInventors: Sercan O. ARIK, Gregory DIAMOS, Andrew GIBIANSKY, John MILLER, Kainan PENG, Wei PING, Jonathan RAIMAN, Yanqi ZHOU
-
Publication number: 20210027762Abstract: Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.Type: ApplicationFiled: October 1, 2020Publication date: January 28, 2021Applicant: Baidu USA LLCInventors: Sercan O. ARIK, Mike CHRZANOWSKI, Adam COATES, Gregory DIAMOS, Andrew GIBIANSKY, John MILLER, Andrew NG, Jonathan RAIMAN, Shubhahrata SENGUPTA, Mohammad SHOEYBI
-
Patent number: 10896669Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.Type: GrantFiled: May 8, 2018Date of Patent: January 19, 2021Assignee: Baidu USA LLCInventors: Sercan O. Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou
-
Patent number: 10872598Abstract: Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.Type: GrantFiled: January 29, 2018Date of Patent: December 22, 2020Assignee: Baidu USA LLCInventors: Sercan O. Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, John Miller, Andrew Ng, Jonathan Raiman, Shubhahrata Sengupta, Mohammad Shoeybi
-
Patent number: 10832120Abstract: Systems and methods for a multi-core optimized Recurrent Neural Network (RNN) architecture are disclosed. The various architectures affect communication and synchronization operations according to the Multi-Bulk-Synchronous-Parallel (MBSP) model for a given processor. The resulting family of network architectures, referred to as MBSP-RNNs, perform similarly to a conventional RNNs having the same number of parameters, but are substantially more efficient when mapped onto a modern general purpose processor. Due to the large gain in computational efficiency, for a fixed computational budget, MBSP-RNNs outperform RNNs at applications such as end-to-end speech recognition.Type: GrantFiled: April 5, 2016Date of Patent: November 10, 2020Assignee: Baidu USA LLCInventors: Gregory Diamos, Awni Hannun, Bryan Catanzaro, Dario Amodei, Erich Elsen, Jesse Engel, Shubhabrata Sengupta
-
Publication number: 20200175374Abstract: As deep learning application domains grow, a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements is extremely beneficial. Presented herein are large-scale empirical study of error and model size growth as training sets grow. Embodiments of a methodology for this measurement are introduced herein as well as embodiments for predicting other metrics, such as compute-related metrics. It is shown herein that power-law may be used to represent deep model relationships, such as error and training data size. It is also shown that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.Type: ApplicationFiled: November 30, 2018Publication date: June 4, 2020Applicant: Baidu USA LLCInventors: Joel HESTNESS, Gregory DIAMOS, Hee Woo JUN, Sharan NARANG, Newsha ARDALANI, Md Mostofa Ali PATWARY, Yanqi ZHOU
-
Patent number: 10540957Abstract: Presented herein are embodiments of state-of-the-art speech recognition systems developed using end-to-end deep learning. In embodiments, the model architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, embodiments of the system do not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learn a function that is robust to such effects. A phoneme dictionary, nor even the concept of a “phoneme,” is needed. Embodiments include a well-optimized recurrent neural network (RNN) training system that can use multiple GPUs, as well as a set of novel data synthesis techniques that allows for a large amount of varied data for training to be efficiently obtained.Type: GrantFiled: June 9, 2015Date of Patent: January 21, 2020Assignee: BAIDU USA LLCInventors: Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Gregory Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubhabrata Sengupta, Adam Coates, Andrew Y. Ng
-
Publication number: 20190371298Abstract: Presented herein are embodiments of state-of-the-art speech recognition systems developed using end-to-end deep learning. In embodiments, the model architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, embodiments of the system do not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learn a function that is robust to such effects. A phoneme dictionary, nor even the concept of a “phoneme,” is needed. Embodiments include a well-optimized recurrent neural network (RNN) training system that can use multiple GPUs, as well as a set of novel data synthesis techniques that allows for a large amount of varied data for training to be efficiently obtained.Type: ApplicationFiled: August 15, 2019Publication date: December 5, 2019Applicant: BAIDU USA LLCInventors: Awni HANNUN, Carl CASE, Jared Casper, Bryan Catanzaro, Gregory Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubhabrata Sengupta, Adam Coates, Andrew Ng
-
Publication number: 20190355347Abstract: For the problem of waveform synthesis from spectrograms, presented herein are embodiments of an efficient neural network architecture, based on transposed convolutions to achieve a high compute intensity and fast inference. In one or more embodiments, for training of the convolutional vocoder architecture, losses are used that are related to perceptual audio quality, as well as a GAN framework to guide with a critic that discerns unrealistic waveforms. While yielding a high-quality audio, embodiments of the model can achieve more than 500 times faster than real-time audio synthesis. Multi-head convolutional neural network (MCNN) embodiments for waveform synthesis from spectrograms are also disclosed. MCNN embodiments enable significantly better utilization of modern multi-core processors than commonly-used iterative algorithms like Griffin-Lim and yield very fast (more than 300× real-time) waveform synthesis.Type: ApplicationFiled: March 27, 2019Publication date: November 21, 2019Applicant: Baidu USA LLCInventors: Sercan ARIK, Hee Woo JUN, Eric UNDERSANDER, Gregory DIAMOS
-
Publication number: 20190354837Abstract: Neural Architecture Search (NAS) is a laborious process. Prior work on automated NAS targets mainly on improving accuracy but lacked consideration of computational resource use. Presented herein are embodiments of a Resource-Efficient Neural Architect (RENA), an efficient resource-constrained NAS using reinforcement learning with network embedding. RENA embodiments use a policy network to process the network embeddings to generate new configurations. Example demonstrates of RENA embodiments on image recognition and keyword spotting (KWS) problems are also presented herein. RENA embodiments can find novel architectures that achieve high performance even with tight resource constraints. For the CIFAR10 dataset, the tested embodiment achieved 2.95% test error when compute intensity is greater than 100 FLOPs/byte, and 3.87% test error when model size was less than 3M parameters.Type: ApplicationFiled: March 8, 2019Publication date: November 21, 2019Applicant: Baidu USA LLCInventors: Yanqi ZHOU, Siavash EBRAHIMI, Sercan ARIK, Haonan YU, Hairong LIU, Gregory DIAMOS
-
Publication number: 20190130271Abstract: Described herein are systems and methods to prune deep neural network models in reducing the overall memory and compute requirements of these models. It is demonstrated that using block pruning and group lasso combined with pruning during training, block-sparse recurrent neural networks (RNNs) may be built as accurate as dense baseline models. Two different approaches are disclosed to induce block sparsity in neural network models: pruning blocks of weights in a layer and using group lasso regularization to create blocks of weights with zeros. Using these techniques, it is demonstrated that block-sparse RNNs with high sparsity can be created with small loss in accuracy. Block-sparse RNNs eliminate overheads related to data storage and irregular memory accesses while increasing hardware efficiency compared to unstructured sparsity.Type: ApplicationFiled: October 4, 2018Publication date: May 2, 2019Applicant: Baidu USA LLCInventors: Sharan NARANG, Eric UNDERSANDER, Gregory DIAMOS
-
Publication number: 20180336880Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.Type: ApplicationFiled: May 8, 2018Publication date: November 22, 2018Applicant: Baidu USA LLCInventors: Sercan O. ARIK, Gregory DIAMOS, Andrew GIBIANSKY, John MILLER, Kainan PENG, Wei PING, Jonathan RAIMAN, Yanqi ZHOU