Abstract: A system and method are disclosed for generating customized text-to-speech voices for a particular application. The method comprises generating a custom text-to-speech voice by selecting a voice for generating a custom text-to-speech voice associated with a domain, collecting text data associated with the domain from a pre-existing text data source and using the collected text data, generating an in-domain inventory of synthesis speech units by selecting speech units appropriate to the domain via a search of a pre-existing inventory of synthesis speech units, or by recording the minimal inventory for a selected level of synthesis quality. The text-to-speech custom voice for the domain is generated utilizing the in-domain inventory of synthesis speech units. Active learning techniques may also be employed to identify problem phrases wherein only a few minutes of recorded data is necessary to deliver a high quality TTS custom voice.
Type:
Grant
Filed:
July 31, 2017
Date of Patent:
April 27, 2021
Assignee:
Cerence Operating Company
Inventors:
Srinivas Bangalore, Junlan Feng, Mazin Gilbert, Juergen Schroeter, Ann K. Syrdal, David Schulz
Abstract: A system and/or method receives speech input including an accent. The accent is classified with an accent classifier to yield an accent classification. Automatic speech recognition is performed based on the speech input and the accent classification to yield an automatic speech recognition output. Natural language understanding is performed on the speech recognition output and the accent classification determining an intent of the speech recognition output. Natural language generation generates an output based on the speech recognition output and the intent and the accent classification. An output is rendered using text to speech based on the natural language generation and the accent classification.
Type:
Application
Filed:
September 13, 2019
Publication date:
March 18, 2021
Applicant:
Cerence Operating Company
Inventors:
Yang SUN, Junho PARK, Goujin WEI, Daniel WILLETT
Abstract: Some embodiments described herein relate to a multimodal user interface for use in an automobile. The multimodal user interface may display information on a windshield of the automobile, such as by projecting information on the windshield, and may accept input from a user via multiple modalities, which may include a speech interface as well as other interfaces. The other interfaces may include interfaces allowing a user to provide geometric input by indicating an angle. In some embodiments, a user may define a task to be performed using multiple different input modalities. For example, the user may provide via the speech interface speech input describing a task that the user is requesting be performed, and may provide via one or more other interfaces geometric parameters regarding the task. The multimodal user interface may determine the task and the geometric parameters from the inputs.
Abstract: There is provided an automated speech recognition system that applies weights to grapheme-to-phoneme models, and interpolates pronunciations from combinations of the models, to recognize utterances of foreign named entities for naive, informed, and in-between pronunciations.
Type:
Application
Filed:
August 6, 2019
Publication date:
February 11, 2021
Applicant:
Cerence Operating Company
Inventors:
Stefan Christof HAHN, Efthymia GEORGALA, Olivier Stéphane Jérôme DIVAY, Eric Joseph MARSHALL
Abstract: Systems and methods are introduced to perform noise suppression of an audio signal. The audio signal includes foreground speech components and background noise. The foreground speech components correspond to speech from a user's speaking into an audio receiving device. The background noise includes babble noise that includes speech from one or more interfering speakers. A soft speech detector determines, dynamically, a speech detection result indicating a likelihood of a presence of the foreground speech components in the audio signal. The speech detection result is employed to control, dynamically, an amount of attenuation of the noise suppression to reduce the babble noise in the audio signal. Further processing achieves a more stationary background and reduction of musical tones in the audio signal.
Abstract: A method, computer program product, and computer system for transforming, by a computing device, a speech signal into a speech signal representation. A regression deep neural network may be trained with a cost function to minimize a mean squared error between actual values of the speech signal representation and estimated values of the speech signal representation, wherein the cost function may include one or more discriminative terms. Bandwidth of the speech signal may be extended by extending the speech signal representation of the speech signal using the regression deep neural network trained with the cost function that includes the one or more discriminative terms.
Abstract: Disclosed herein are systems, methods, and non-transitory computer-readable storage media for speech synthesis. A system practicing the method receives a set of ordered lists of speech units, for each respective speech unit in each ordered list in the set of ordered lists, constructs a sublist of speech units from a next ordered list which are suitable for concatenation, performs a cost analysis of paths through the set of ordered lists of speech units based on the sublist of speech units for each respective speech unit, and synthesizes speech using a lowest cost path of speech units through the set of ordered lists based on the cost analysis. The ordered lists can be ordered based on the respective pitch of each speech unit. In one embodiment, speech units which do not have an assigned pitch can be assigned a pitch.
Abstract: Methods and apparatus for frequency selective signal mixing for speech enhancement. In one embodiment frequency-based channel selection is performed for signal magnitude, signal energy, and noise estimate using speaker activity detection information, signal-to-noise ratio, and/or signal level, Frequency-based channel selection is performed for a dynamic spectral floor to adjust the noise estimate using speaker dominance information. Noise reduction is performed on the signal for the selected channel.
Type:
Grant
Filed:
October 30, 2013
Date of Patent:
January 14, 2020
Assignee:
Cerence Operating Company
Inventors:
Timo Matheja, Markus Buck, Julien Premont
Abstract: The present invention relates to a method for selecting and downloading content from a content provider which is accessible via an IP/DNS/URL address to a mobile device, the content being any text information data, for converting the text information data to at least one audio message and for storing the at least one audio message as at least one audio file on the mobile device, wherein the at least one audio file is playable and discernable as a music file. Said method implemented on a mobile phone enables controlling and playing the audio messages as music files by determining a title associated with the audio message using word underline and size attributes of the text, for user selection on the mobile's graphical user interface, for instance also in a car environment with a car kit enabling a control and a selection of one or more of said at least one audio files for playing from the mobile phone.
Abstract: A method or associated system for motion adaptive speech processing includes dynamically estimating a motion profile that is representative of a user's motion based on data from one or more resources, such as sensors and non-speech resources, associated with the user. The method includes effecting processing of a speech signal received from the user, for example, while the user is in motion, the processing taking into account the estimated motion profile to produce an interpretation of the speech signal. Dynamically estimating the motion profile can include computing a motion weight vector using the data from the one or more resources associated with the user, and can further include interpolating a plurality of models using the motion weight vector to generate a motion adaptive model. The motion adaptive model can be used to enhance voice destination entry for the user and re-used for other users who do not provide motion profiles.
Type:
Grant
Filed:
June 10, 2015
Date of Patent:
December 10, 2019
Assignee:
Cerence Operating Company
Inventors:
Munir Nikolai Alexander Georges, Josef Damianus Anastasiadis, Oliver Bender