Prosody Rules Derived From Text (epo) Patents (Class 704/E13.013)
  • Patent number: 11651763
    Abstract: Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.
    Type: Grant
    Filed: November 2, 2020
    Date of Patent: May 16, 2023
    Assignee: Baidu USA LLC
    Inventors: Sercan O. Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou
  • Patent number: 8886538
    Abstract: Systems and methods for speech synthesis and, in particular, text-to-speech systems and methods for converting a text input to a synthetic waveform by processing prosodic and phonetic content of a spoken example of the text input to accurately mimic the input speech style and pronunciation. Systems and methods provide an interface to a TTS system to allow a user to input a text string and a spoken utterance of the text string, extract prosodic parameters from the spoken input, and process the prosodic parameters to derive corresponding markup for the text input to enable a more natural sounding synthesized speech.
    Type: Grant
    Filed: September 26, 2003
    Date of Patent: November 11, 2014
    Assignee: Nuance Communications, Inc.
    Inventors: Andy Aaron, Raimo Bakis, Ellen M. Eide, Wael M. Hamza
  • Patent number: 7962341
    Abstract: A method for the prosodic labelling of speech including performing a first analysis step using data from an audio file, wherein the audio file is analysed as a plurality of frames positioned at fixed time intervals in said audio file; and performing a second analysis step on said data from said audio file using results of said first analysis step, wherein analysis is performed using a plurality of analysis windows and wherein the position of the analysis windows are determined by segmental information.
    Type: Grant
    Filed: December 8, 2006
    Date of Patent: June 14, 2011
    Assignee: Kabushiki Kaisha Toshiba
    Inventor: Norbert Braunschweiler
  • Publication number: 20100312563
    Abstract: Techniques to create and share custom voice fonts are described. An apparatus may include a preprocessing component to receive voice audio data and a corresponding text script from a client and to process the voice audio data to produce prosody labels and a rich script. The apparatus may further include a verification component to automatically verify the voice audio data and the text script. The apparatus may further include a training component to train a custom voice font from the verified voice audio data and rich script and to generate custom voice font data usable by the TTS component. Other embodiments are described and claimed.
    Type: Application
    Filed: June 4, 2009
    Publication date: December 9, 2010
    Applicant: MICROSOFT CORPORATION
    Inventors: Sheng Zhao, Zhi Li, Shenghao Qin, Chiwei Che, Jingyang Xu, Binggong Ding
  • Publication number: 20090300041
    Abstract: A method and system are disclosed that train a text-to-speech synthesis system for use in speech synthesis. The method includes generating a speech database of audio files comprising domain-specific voices having various prosodies, and training a text-to-speech synthesis system using the speech database by selecting audio segments having a prosody based on at least one dialog state. The system includes a processor, a speech database of audio files, and modules for implementing the method.
    Type: Application
    Filed: August 13, 2009
    Publication date: December 3, 2009
    Applicant: AT&T Corp.
    Inventor: Horst Juergen Schroeter
  • Publication number: 20090254349
    Abstract: A speech synthesizer can execute speech content editing at high speed and generate speech content easily. The speech synthesizer includes a small speech element DB (101), a small speech element selection unit (102), a small speech element concatenation unit (103), a prosody modification unit (104), a large speech element DB (105), a correspondence DB (106) that associates the small speech element DB (101) with the large speech element DB (105), a speech element candidate obtainment unit (107), a large speech element selection unit (108), and a large speech element concatenation unit (109). By editing synthetic speech using the small speech element DB (101) and performing quality enhancement on an editing result using the large speech element DB (105), speech content can be generated easily on a mobile terminal.
    Type: Application
    Filed: May 11, 2007
    Publication date: October 8, 2009
    Inventors: Yoshifumi Hirose, Yumiko Kato, Takahiro Kamai
  • Publication number: 20090055188
    Abstract: The prosody control unit pattern generation module generates pitch patterns in respective prosody control units based on language attribute information, the phoneme duration and emphasis degree information, the modification method decision module decides a modification method by smoothing processing with respect to the pitch pattern in a connection portion between the prosody control unit and at least one of previous and next prosody control units based on at least emphasis degree information to generate modification method information, and the pattern connection module modifies pitch patterns generated in respective prosody control units by smoothing processing according to the modification method information and connects them to generate a sentence pitch pattern corresponding to a text to be a target for speech synthesis.
    Type: Application
    Filed: February 22, 2008
    Publication date: February 26, 2009
    Applicant: KABUSHIKI KAISHA TOSHIBA
    Inventors: Gou Hirabayashi, Takehiko Kagoshima
  • Publication number: 20090048821
    Abstract: Embodiments are directed towards a language learning environment accessible from within virtually any website that enables a user to practice a language using tools such as translators, and text to speech capabilities. In one embodiment, the user may access a webpage in one language, and employ the language widget to select portions of content on the webpage, perform translation of the content, or perform a text to audio (speech) conversion of the selected portions. The text to speech conversion may be performed independent of translation, thereby allowing the user to hear a pronunciation of text within the website in a language associated with the website. The user may download an audio file of the converted text for use in later replay for mobile learning.
    Type: Application
    Filed: June 2, 2008
    Publication date: February 19, 2009
    Applicant: Yahoo! Inc.
    Inventors: Shuk Yin Yam, Jeong Sik Jang
  • Publication number: 20080167875
    Abstract: An embodiment of the invention is a software tool used to convert text, speech synthesis markup language (SSML), and or extended SSML to synthesized audio. Provisions are provided to create, view, play, and edit the synthesized speech including editing pitch and duration targets, speaking type, paralinguistic events, and prosody. Prosody can be provided by way of a sample recording. Users can interact with the software tool by way of a graphical user interface (GUI). The software tool can produce synthesized audio file output in many file formats.
    Type: Application
    Filed: January 9, 2007
    Publication date: July 10, 2008
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Raimo Bakis, Ellen M. Eide, Roberto Pieraccini, Maria E. Smith, Jie Zeng
  • Publication number: 20080147405
    Abstract: The present invention provides a method and apparatus of forming Chinese prosodic words, which method comprises the steps of inputting Chinese text; performing process of word segmentation and part of speech annotation for the input Chinese text to generate an initial prosodic word sequence; inserting grids representing prosodic word boundaries for all the words in the initial prosodic word sequence to generate a grid prosodic word sequence; annotating the grids ready to be deleted in the grid prosodic word sequence based on the prosodic word forming means; judging the grids which actually need to be deleted in the grids ready to be deleted based on the prosodic word forming means; deleting the grids which actually need to be deleted in the grid prosodic word sequence, and word forming the words between every two grids in the remaining grids to generate prosodic words.
    Type: Application
    Filed: December 10, 2007
    Publication date: June 19, 2008
    Applicant: FUJITSU LIMITED
    Inventors: Guo Qing, Nobuyuki Katae
  • Publication number: 20080147402
    Abstract: Disclosed are apparatus and methods that employ a modified version of a computational model of the human peripheral and central auditory system, and that provide for automatic pattern recognition using category dependent feature selection. The validity of the output of the model is examined by deriving feature vectors from the dimension expanded cortical response of the central auditory system for use in a conventional phoneme recognition task. In addition, the cortical response may be a place-coded data set where sounds are categorized according to the regions containing their most distinguishing features. This provides for a novel category-dependent feature selection apparatus and methods in which this mechanism may be utilized to better simulate robust human pattern (speech) recognition.
    Type: Application
    Filed: November 29, 2007
    Publication date: June 19, 2008
    Inventors: Woojay Jeon, Biing-Hwang Juang
  • Publication number: 20080109225
    Abstract: A speech piece editing section (5) retrieves speech piece data on a speech piece the read of which matches that of a speech piece in a fixed message from a speech piece database (7) and converts the speech piece so as to match the speed specified by utterance speed data. The speech piece editing section (5) predicts the prosody of a fixed message and selects an item of the retrieved speech piece data most matching each speech piece of the fixed message one by one according to the prosody prediction results. However, if the proportion of the speech piece corresponding to the selected item of the speech piece data does not reach a predetermined value, the selection is cancelled. Concerning the speech piece for which selection is not made, waveform data representing the waveform of each unit speech is supplied to a sound processing section (41). The selected speech piece data and the supplied waveform data are interconnected thereby to create data representing a synthesized speech.
    Type: Application
    Filed: March 10, 2006
    Publication date: May 8, 2008
    Applicant: KABUSHIKI KAISHA KENWOOD
    Inventor: Yasushi Sato