Elementary Speech Units Used In Speech Synthesizers; Concatenation Rules (epo) Patents (Class 704/E13.009)
  • Publication number: 20140142946
    Abstract: The present invention is a method and system to convert speech signal into a parametric representation in terms of timbre vectors, and to recover the speech signal thereof. The speech signal is first segmented into non-overlapping frames using the glottal closure instant information, each frame is converted into an amplitude spectrum using a Fourier analyzer, and then using Laguerre functions to generate a set of coefficients which constitute a timbre vector. A sequence of timbre vectors can be subject to a variety of manipulations. The new timbre vectors are converted back into voice signals by first transforming into amplitude spectra using Laguerre functions, then generating phase spectra from the amplitude spectra using Kramers-Knonig relations. A Fourier transformer converts the amplitude spectra and phase spectra into elementary waveforms, then superposed to become the output voice. The method and system can be used for voice transformation, speech synthesis, and automatic speech recognition.
    Type: Application
    Filed: September 24, 2012
    Publication date: May 22, 2014
    Inventor: Chengjun Julian Chen
  • Patent number: 8731931
    Abstract: Disclosed herein are systems, methods, and non-transitory computer-readable storage media for speech synthesis. A system practicing the method receives a set of ordered lists of speech units, for each respective speech unit in each ordered list in the set of ordered lists, constructs a sublist of speech units from a next ordered list which are suitable for concatenation, performs a cost analysis of paths through the set of ordered lists of speech units based on the sublist of speech units for each respective speech unit, and synthesizes speech using a lowest cost path of speech units through the set of ordered lists based on the cost analysis. The ordered lists can be ordered based on the respective pitch of each speech unit. In one embodiment, speech units which do not have an assigned pitch can be assigned a pitch.
    Type: Grant
    Filed: June 18, 2010
    Date of Patent: May 20, 2014
    Assignee: AT&T Intellectual Property I, L.P.
    Inventor: Alistair D. Conkie
  • Publication number: 20120310651
    Abstract: A voice signal is synthesized using a plurality of phonetic piece data each indicating a phonetic piece containing at least two phoneme sections corresponding to different phonemes. In the apparatus, a phonetic piece adjustor forms a target section from first and second phonetic pieces so as to connect the first and second phonetic pieces to each other such that the target section includes a rear phoneme section of the first piece and a front phoneme section of the second piece, and expands the target section by a target time length to form an adjustment section such that a central part is expanded at an expansion rate higher than that of front and rear parts of the target section, to thereby create synthesized phonetic piece data having the target time length. A voice synthesizer creates a voice signal from the synthesized phonetic piece data.
    Type: Application
    Filed: May 31, 2012
    Publication date: December 6, 2012
    Applicant: Yamaha Corporation
    Inventor: Keijiro SAINO
  • Publication number: 20110218810
    Abstract: A system for controlling digital effects in live performances with vocal improvisation is described. The system features a complex controller that in one embodiment utilizes several magnetically activated electronic switches attached to a glove that is worn by an artist during a live performance. The switches are activated by a permanent magnet that is also attached to the switch bearing glove and a second magnet attached to a glove worn on the opposite hand. Furthermore, the switches are wirelessly connected by a miniature, battery-operated wireless data communications unit to a digital vocal processor unit that provides a dual mode, multi-channel phrase looping capability wherein individual channels can be selected for re-recording and selected banks of channels can be deleted during the performance. This combination of features allows a complex sequence of digital effects to be controlled by the artist during a performance while maintaining the freedom of movement desired to enhance the performance.
    Type: Application
    Filed: February 28, 2011
    Publication date: September 8, 2011
    Inventor: Momilani Ramstrum
  • Publication number: 20110125493
    Abstract: The voice quality conversion apparatus includes: low-frequency harmonic level calculating units and a harmonic level mixing unit for calculating a low-frequency sound source spectrum by mixing a level of a harmonic of an input sound source waveform and a level of a harmonic of a target sound source waveform at a predetermined conversion ratio for each order of harmonics including fundamental, in a frequency range equal to or lower than a boundary frequency; a high-frequency spectral envelope mixing unit that calculates a high-frequency sound source spectrum by mixing the input sound source spectrum and the target sound source spectrum at the predetermined conversion ratio in a frequency range larger than the boundary frequency; and a spectrum combining unit that combines the low-frequency sound source spectrum with the high-frequency sound source spectrum at the boundary frequency to generate a sound source spectrum for an entire frequency range.
    Type: Application
    Filed: January 31, 2011
    Publication date: May 26, 2011
    Inventors: Yoshifumi Hirose, Takahiro Kamai
  • Publication number: 20110087488
    Abstract: According to an embodiment, a speech synthesis apparatus includes a selecting unit configured to select speaker's parameters one by one for respective speakers and obtain a plurality of speakers' parameters, the speaker's parameters being prepared for respective pitch waveforms corresponding to speaker's speech sounds, the speaker's parameters including formant frequencies, formant phases, formant powers, and window functions concerning respective formants that are contained in the respective pitch waveforms. The apparatus includes a mapping unit configured to make formants correspond to each other between the plurality of speakers' parameters using a cost function based on the formant frequencies and the formant powers. The apparatus includes a generating unit configured to generate an interpolated speaker's parameter by interpolating, at desired interpolation ratios, the formant frequencies, formant phases, formant powers, and window functions of formants which are made to correspond to each other.
    Type: Application
    Filed: December 16, 2010
    Publication date: April 14, 2011
    Inventors: Ryo Morinaka, Takehiko Kagoshima
  • Publication number: 20110054903
    Abstract: Embodiments of rich text modeling for speech synthesis are disclosed. In operation, a text-to-speech engine refines a plurality of rich context models based on decision tree-tied Hidden Markov Models (HMMs) to produce a plurality of refined rich context models. The text-to-speech engine then generates synthesized speech for an input text based at least on some of the plurality of refined rich context models.
    Type: Application
    Filed: December 2, 2009
    Publication date: March 3, 2011
    Applicant: MICROSOFT CORPORATION
    Inventors: Zhi-Jie Yan, Yao Qian, Frank Kao-Ping Soong
  • Publication number: 20110046958
    Abstract: The present invention discloses a method and an apparatus for extracting a prosodic feature of a speech signal, the method including: dividing the speech signal into speech frames; transforming the speech frames from time domain to frequency domain; and extracting respective prosodic features for different frequency ranges. According to the above technical solution of the present invention, it is possible to effectively extract the prosodic feature which can combine with a traditional acoustics feature without any obstacle.
    Type: Application
    Filed: August 16, 2010
    Publication date: February 24, 2011
    Applicant: Sony Corporation
    Inventors: Kun LIU, Weiguo Wu
  • Publication number: 20110004476
    Abstract: Variation over time in fundamental frequency in singing voices is separated into a melody-dependent component and a phoneme-dependent component, modeled for each of the components and stored into a singing synthesizing database. In execution of singing synthesis, a pitch curve indicative of variation over time in fundamental frequency of the melody is synthesized in accordance with an arrangement of notes represented by a singing synthesizing score and the melody-dependent component, and the pitch curve is corrected, for each of pitch curve sections corresponding to phonemes constituting lyrics, using a phoneme-dependent component model corresponding to the phoneme. Such arrangements can accurately model a singing expression, unique to a singing person and appearing in a melody singing style of the person, while taking into account phoneme-dependent pitch variation, and thereby permits synthesis of singing voices that sound more natural.
    Type: Application
    Filed: July 1, 2010
    Publication date: January 6, 2011
    Applicant: Yamaha Corporation
    Inventors: Keijiro Saino, Jordi Bonada
  • Publication number: 20100223058
    Abstract: A speech synthesis device includes a pitch pattern generation unit (104) which generates a pitch pattern by combining, based on pitch pattern target data including phonemic information formed from at least syllables, phonemes, and words, a standard pattern which approximately expresses the rough shape of the pitch pattern and an original utterance pattern which expresses the pitch pattern of a recorded speech, a unit waveform selection unit (106) which selects unit waveform data based on the generated pitch pattern and upon selection, selects original utterance unit waveform data corresponding to the original utterance pattern in a section where the original utterance pattern is used, and a speech waveform generation unit (107) which generates a synthetic speech by editing the selected unit waveform data so as to reproduce prosody represented by the generated pitch pattern.
    Type: Application
    Filed: August 28, 2008
    Publication date: September 2, 2010
    Inventors: Yasuyuki Mitsui, Reishi Kondo
  • Publication number: 20100211393
    Abstract: A speech synthesis device is provided with: a central segment selection unit for selecting a central segment from among a plurality of speech segments; a prosody generation unit for generating prosody information based on the central segment; a non-central segment selection unit for selecting a non-central segment, which is a segment outside of a central segment section, based on the central segment and the prosody information; and a waveform generation unit for generating a synthesized speech waveform based on the prosody information, the central segment, and the non-central segment. The speech synthesis device first selects a central segment that forms a basis for prosody generation and generates prosody information based on the central segment so that it is possible to sufficiently reduce both concatenation distortion and sound quality degradation accompanying prosody control in the section of the central segment.
    Type: Application
    Filed: April 28, 2008
    Publication date: August 19, 2010
    Inventors: Masanori Kato, Yasuyuki Mitsui, Reishi Kondo
  • Publication number: 20100049523
    Abstract: Systems and methods for providing synthesized speech in a manner that takes into account the environment where the speech is presented. A method embodiment includes, based on a listening environment and at least one other parameter associated with at least one other parameter, selecting an approach from the plurality of approaches for presenting synthesized speech in a listening environment, presenting synthesized speech according to the selected approach and based on natural language input received from a user indicating that an inability to understand the presented synthesized speech, selecting a second approach from the plurality of approaches and presenting subsequent synthesized speech using the second approach.
    Type: Application
    Filed: October 28, 2009
    Publication date: February 25, 2010
    Applicant: AT&T Corp.
    Inventors: Kenneth H. Rosen, Carroll W. Creswell, Jeffrey J. Farah, Pradeep K. Bansal, Ann K. Syrdal
  • Publication number: 20090326951
    Abstract: Ratios of powers at the peaks of respective formants of the spectrum of a pitch-cycle waveform and powers at boundaries between the formants are obtained and, when the ratios are large, bandwidth of window functions are widened and the formant waveforms are generated by multiplying generated sinusoidal waveforms from the formant parameter sets on the basis of pitch-cycle waveform generating data by the window functions of the widened bandwidth, whereby a pitch-cycle waveform is generated by the sum of these formant waveforms.
    Type: Application
    Filed: April 14, 2009
    Publication date: December 31, 2009
    Applicant: KABUSHIKI KAISHA TOSHIBA
    Inventors: Ryo Morinaka, Takehiko Kagoshima
  • Publication number: 20090313025
    Abstract: A method and system are disclosed that automatically segment speech to generate a speech inventory. The method includes initializing a Hidden Markov Model (HMM) using seed input data, performing a segmentation of the HMM into speech units to generate phone labels, correcting the segmentation of the speech units. Correcting the segmentation of the speech units includes re-estimating the HMM based on a current version of the phone labels, embedded re-estimating of the HMM, and updating the current version of the phone labels using spectral boundary correction. The system includes modules configured to control a processor to perform steps of the method.
    Type: Application
    Filed: August 20, 2009
    Publication date: December 17, 2009
    Applicant: AT&T Corp.
    Inventors: Alistair D. CONKIE, Yeon-Jun KIM
  • Publication number: 20090281807
    Abstract: A voice quality conversion device converts voice quality of an input speech using information of the speech.
    Type: Application
    Filed: May 8, 2008
    Publication date: November 12, 2009
    Inventors: Yoshifumi Hirose, Takahiro Kamai, Yumiko Kato
  • Publication number: 20090248417
    Abstract: A method to generate a pitch contour for speech synthesis is proposed. The method is based on finding the pitch contour that maximizes a total likelihood function created by the combination of all the statistical models of the pitch contour segments of an utterance, at one or multiple linguistic levels. These statistical models are trained from a database of spoken speech, by means of a decision tree that for each linguistic level clusters the parametric representation of the pitch segments extracted from the spoken speech data with some features obtained from the text associated with that speech data. The parameterization of the pitch segments is performed in such a way, the likelihood function of any linguistic level can be expressed in terms of the parameters of one of the levels, thus allowing the maximization to be calculated with respect to the parameters of that level.
    Type: Application
    Filed: March 17, 2009
    Publication date: October 1, 2009
    Applicant: KABUSHIKI KAISHA TOSHIBA
    Inventors: Javier Latorre, Masami Akamine
  • Publication number: 20090204405
    Abstract: Apparatus and method for generating high quality synthesized speech having smooth waveform concatenation. The apparatus includes a pitch frequency calculation section, a pitch synchronization position calculation section, a unit waveform storage, a unit waveform selection section, a unit waveform generation section, and a waveform synthesis section. The unit waveform generation section includes a conversion ratio calculation section, a sampling rate conversion section, and a unit waveform re-selection section. The conversion ratio calculation section calculates a sampling rate conversion ratio from the pitch information and the position of pitch synchronization, and the sampling rate conversion section converts the sampling rate of the unit waveform, delivered as input, based on the sampling rate conversion ratio.
    Type: Application
    Filed: September 4, 2006
    Publication date: August 13, 2009
    Applicant: NEC CORPORATION
    Inventors: Masanori Kato, Satoshi Tsukada
  • Publication number: 20080319754
    Abstract: According to an aspect of an embodiment, an apparatus for converting text data into sound signal, comprises: a phoneme determiner for determining phoneme data corresponding to a plurality of phonemes and pause data corresponding to a plurality of pauses to be inserted among a series of phonemes in the text data to be converted into sound signal; a phoneme length adjuster for modifying the phoneme data and the pause data by determining lengths of the phonemes, respectively in accordance with a speed of the sound signal and selectively adjusting, the length of at least one of the phonemes which is a fricative in the text data so that the at least one of the fricative phonemes is relatively extended timewise as compared to other phonemes; and an output unit for outputting sound signal on the basis of the adjusted phoneme data and pause data by the phoneme length adjuster.
    Type: Application
    Filed: June 13, 2008
    Publication date: December 25, 2008
    Applicant: FUJITSU LIMITED
    Inventors: Rika Nishiike, Hitoshi Sasaki
  • Publication number: 20080312931
    Abstract: A speech synthesis system stores a group of speech units in a memory, selects a plurality of speech units from the group based on prosodic information of target speech, the speech units selected corresponding to each of segments which are obtained by segmenting a phoneme string of the target speech and minimizing distortion of synthetic speech generated from the speech units selected to the target speech, generates a new speech unit corresponding to the each of the segments, by fusing the speech units selected, to obtain a plurality of new speech units corresponding to the segments respectively, and generates synthetic speech by concatenating the new speech units.
    Type: Application
    Filed: August 18, 2008
    Publication date: December 18, 2008
    Inventors: Tatsuya MIZUTANI, Takehiko Kagoshima
  • Publication number: 20080243511
    Abstract: The present invention is a speech synthesizer that generates speech data of text including a fixed part and a variable part, in combination with recorded speech and rule-based synthetic speech. The speech synthesizer is a high-quality one in which recorded speech and synthetic speech are concatenated with the discontinuity of timbres and prosodies not perceived.
    Type: Application
    Filed: October 22, 2007
    Publication date: October 2, 2008
    Inventors: Yusuke Fujita, Ryota Kamoshida, Kenji Nagamatsu
  • Publication number: 20080172234
    Abstract: Systems and methods for dynamically selecting among text-to-speech (TTS) systems. Exemplary embodiments of the systems and methods include identifying text for converting into a speech waveform, synthesizing said text by three TTS systems, generating a candidate waveform from each of the three systems, generating a score from each of the three systems, comparing each of the three scores, selecting a score based on a criteria and selecting one of the three waveforms based on the selected of the three scores.
    Type: Application
    Filed: January 12, 2007
    Publication date: July 17, 2008
    Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
    Inventors: Ellen M. Eide, Raul Fernandez, Wael M. Hamza, Michael A. Picheny
  • Publication number: 20080126093
    Abstract: An apparatus for providing a language based interactive multimedia system includes a selection element, a comparison element and a processing element. The selection element may be configured to select a phoneme graph based on a type of speech processing associated with an input sequence of phonemes. The comparison element may be configured to compare the input sequence of phonemes to the selected phoneme graph. The processing element may be in communication with the comparison element and configured to process the input sequence of phonemes based on the comparison.
    Type: Application
    Filed: November 28, 2006
    Publication date: May 29, 2008
    Inventor: Sunil Sivadas
  • Publication number: 20080027727
    Abstract: A speech unit corpus stores a group of speech units. A selection unit divides a phoneme sequence of target speech into a plurality of segments, and selects a combination of speech units for each segment from the speech unit corpus. An estimation unit estimates a distortion between the target speech and synthesized speech generated by fusing each speech unit of the combination for each segment. The selection unit recursively selects the combination of speech units for each segment based on the distortion. A fusion unit generates a new speech unit for each segment by fusing each speech unit of the combination selected for each segment. A concatenation unit generates synthesized speech by concatenating the new speech unit for each segment.
    Type: Application
    Filed: July 23, 2007
    Publication date: January 31, 2008
    Applicant: KABUSHIKI KAISHA TOSHIBA
    Inventors: Masahiro MORITA, Takehiko Kagoshima