CONFIGURABLE NEURAL SPEECH SYNTHESIS
A discriminator trained on labeled samples of speech can compute probabilities of voice properties. A speech synthesis generative neural network that takes in text and continuous scale values of voice properties is trained to synthesize speech audio that the discriminator will infer as matching the values of the input voice properties. Voice parameters can include speaker voice parameters, accents, and attitudes, among others. Training can be done by transfer learning from an existing neural speech synthesis model or such a model can be trained with a loss function that considers speech and parameter values. A graphical user interface can allow voice designers for products to synthesize speech with a desired voice or generate a speech synthesis engine with frozen voice parameters. A vector of parameters can be used for comparison to previously registered voices in databases such as ones for trademark registration.
Latest SoundHound, Inc. Patents:
- SEMANTICALLY CONDITIONED VOICE ACTIVITY DETECTION
- Method for providing information, method for generating database, and program
- REAL-TIME NATURAL LANGUAGE PROCESSING AND FULFILLMENT
- TEXT-TO-SPEECH SYSTEM WITH VARIABLE FRAME RATE
- DOMAIN SPECIFIC NEURAL SENTENCE GENERATOR FOR MULTI-DOMAIN VIRTUAL ASSISTANTS
This application is a continuation application of U.S. Non-Provisional patent application Ser. No. 17/341,082, filed Jun. 7, 2021, which claims the benefit of U.S. Provisional Patent Application No. 62/705,127, entitled “CONFIGURABLE NEURAL SPEECH SYNTHESIS,” filed Jun. 12, 2020; which is incorporated herein by reference for all purposes.
BACKGROUNDAs people are increasingly utilizing a variety of computing devices, including portable devices such as tablet computers and smart phones, it can be advantageous to adapt the ways in which people interact with these devices. For example, different voice data may be desirable for a variety of applications. In an example, it may be desirable to generate text-to-speech (TTS) voices for video game characters to provide a more interactive and immersive gaming experience. In another example, a user may desire a TTS voice that represents their qualities, such as gender, age, regional accent, etc. However, conventional TTS voices for speech synthesis, using, e.g., concatenative or other approaches, are trained on a single speaker. As such, the playback sound is configurable only along typical digital signal processing (DSP) parameters such as pitch and speed. As a result, machines using a voice sound the same or, for machines to have unique sounding voices, a large or expensive effort is required to collect training data. This is often not practical for voice-enabling large numbers of diverse devices including ones from small companies or developers with financial or time-to-market constraints. Accordingly, it is desirable to provide improved techniques for text-to-speech.
The accompanying drawings illustrate several embodiments and, together with the description, serve to explain the principles of the invention according to the embodiments. It will be appreciated by one skilled in the art that the particular arrangements illustrated in the drawings are merely exemplary and are not to be considered as limiting of the scope of the invention or the claims herein in any way.
Systems and methods in accordance with various embodiments of the present disclosure may overcome one or more of the aforementioned and other deficiencies experienced in conventional approaches to speech synthesis. In particular, various embodiments described herein provide for configurable neural speech synthesis that may be used separately or in combinations within devices, systems, processes, and methods.
In an embodiment, one example includes a computerized process of training a model (e.g., a neural speech synthesis model or a speech synthesis model) that can generate speech audio (also referred to as voice data) conditioned on a value of a voice property. In this example, source samples of speech audio (e.g., voice data from an individual such as a voice donor or machine-generated voice data from a TTS system or other audio generation system) are obtained. The source samples are labeled with discrete values of a voice property, including, for example, a gender voice property, an age voice property, an accent voice property, a timbre voice property. Other voice properties may indicate whether the source samples indicate the attitude of the speaker, such as whether the speaker appears happy, sad, calm, excited, formal, casual, etc.
A discriminator is trained from the source samples and labels. The discriminator is configured to generate a probability value that quantifies the likelihood of the voice property from a sample of speech audio.
A model (e.g., neural speech synthesis model or synthesis model) is trained by synthesizing a multiplicity of synthesized speech samples using the model with a diverse set of voice property values. Corresponding properties are generated for the synthesized speech samples using the discriminator. A property-learning weight adjustment is generated by back-propagating changes to minimize a loss function that depends on the difference between the voice property values and corresponding probabilities.
In certain embodiments, synthesizing the multiplicity of synthesized speech samples uses a transcription of source samples, and the process further comprises computing a source-matching weight adjustment by back-propagating changes to minimize a loss function that depends on differences between the source samples and the synthesized speech. Such a process allows for the simultaneous training of the neural speech synthesis model for the conversion of text to speech and the ability to provide different voice sounds. The process also prevents the synthesis model from learning to generate an undesirable output or other output signal that causes the discriminator to output inauthentic or otherwise undesirable speech (e.g., an output that does not sound like real or expected speech). Simultaneous training is an alternative to first training a general speech synthesis model and then augmenting the training to be able to create variations in voices.
Thereafter, in response to receiving a string of text and at least one voice property value at the model (e.g., the neural speech synthesis model or synthesis model), the model evaluates the string of text and the voice property value to convert the text to speech audio in a voice based on the voice property value. Said in another way, the model synthesizes speech audio corresponding to the text based on the voice property value. The at least one voice property can be ones that are meaningful to a user, such as gender. This allows a user to quickly and easily try different voice sounds and thereby find a voice that meets the needs of their product or use. Further, it allows for saving the property values and comparing them to others to ensure that they are different enough that different products' voices will be distinct. For example, users can adjust the sound of the synthesized voice by making it more male or younger or having a stronger Texas accent. Such configurability has the benefit of enabling rapid experimentation and testing of voices that can affect the perception and relatability of machines that employ speech synthesis as configured.
Instructions for causing a computer system to configure a speech synthesizer in accordance with the present disclosure may be embodied on a computer-readable medium. For example, in accordance with an embodiment, a backend system can receive at least one voice property value. The backend system can generate code for execution by a computer, the code implementing a neural network wherein a node in a hidden layer includes, in its summation, a constant term derived from the product of the voice property value and a weight learned from a training process. The backend system can output the code, wherein the code implements a speech synthesis function within the speech synthesizer.
Embodiments provide a variety of advantages. For example, in accordance with various embodiments, computer-based approaches for configuring a speech synthesizer can be utilized by content providers, device manufacturers, etc., and consumers of the content providers and device manufacturers. The speech synthesizer systems and approaches can improve the operation and performance of the computing devices on which they are implemented by, among other advantages, generating computer code for a speech synthesizer in which the TTS voice is frozen as configured by the at least one voice property value. This allows for creating embedded system devices or other systems that have a specific voice. Such systems can integrate the computer code in a modular way that simplifies the design of such systems. Further, it becomes impractical to change the voice such that once a user chooses and pays for a voice, they cannot change it without a second performing of the method.
The speech synthesizer system and approaches can be used by computer-based techniques to optimize resource utilization of various resources, for example, by generating code in a binary format. This improves modularity and further frustrates attempts at reverse engineering or changing the sound of the synthesized voice.
Further, because the voice property value may constitute a voice property vector, the speech synthesizer system and approaches allow for reading at least one stored voice property vector from a brand database and computing a distance between the stored voice property vector and the received property vector. This advantageously allows for a measurable comparison of the similarity of any two voices. For example, in response to the computed distance being closer than a threshold distance, an error message can be generated, which can be used to alert and/or prevent users from configuring a voice that is too similar to another voice. This avoids having different products in the marketplace with voices so similar that users of the product could be confused about which one is producing synthesized voices. In another example, in response to the computed distance being farther than a threshold distance, the received property value can be stored in the brand database. This allows for creating a database that is useful for comparing to future voice configurations to ensure branded voice differentiation.
Further still, the speech synthesizer system and approaches allow for examining trademarks. For example, the speech synthesizer system and approaches comprise receiving a specimen of speech audio with an application for a trademark registration; applying a discriminator of a plurality of voice property values to the specimen to compute a voice property vector; computing distances between the computed voice property vector and other voice property vectors stored in a database; and determining allowability of the application in dependence upon the smallest computed distance being greater than a threshold. Such an approach enables a government to examine voice trademark registration applications quickly and effectively to allow registrants to prevent the use of synthesized voices that could cause confusion as to the source of goods and services.
Various other functions and advantages are described and suggested below as may be provided in accordance with the various embodiments.
As described, speech synthesis is starting to become commonplace in computers, smartphones, and embedded systems such as smart speakers, robots, automobiles, mobile, portable, and wearable devices, computer terminal interfaces, telephone interactive voice response systems, public address systems, and others.
Certain companies and brands have invested in creating identifiable and sometimes trademarked sounds. For example, the roar of the lion at the beginning of Metro Goldwyn Mayer movies, the sound of a lightsaber in Star Wars, the jingle of T-Mobile phones, the DaDaDa DaDaDa sound of the ESPN sports entertainment network, the bloop of a Tivo remote control operation, and Homer Simpson's D'oh annoyed grunt. Huge variations of human voices are possible and yet some are clearly identifiable. For example, many people can recognize the voices of James Earl Jones, Jack Nicholson, or Kathleen Turner even without seeing their image.
As ever more different systems synthesize speech it is increasingly common for different systems to have similar-sounding voices, which is undesirable in part because it can create confusion among users and in part because it means that the systems associated with brands do not have a unique identity. Though synthesized speech can say essentially any words, people can recognize the sound of a voice no matter what words it says. To create recognizable brands, makers of voice-enabled systems desire for their systems to have voices that are both distinctive and have certain properties. It is also desirable for the providers of neural speech synthesis and related technologies to be able to provide such unique voices.
Voice designers want to be able to configure the voices by making changes and adjustments in ways that they expect. For example, they might want a voice that sounds a little bit younger or a little bit more like it has a New York accent. In another example, it may be desirable for user 102 to interact with game characters having different and varying voices. In this way, in an embodiment, a speech synthesis system should take as input voice property values along dimensions that are perceptibly meaningful such as gender, age, and accent.
Accordingly, in accordance with various embodiments, embodiments provide for configurable neural speech synthesis, which uses parametric speech synthesis that uses a neural network architecture to generate speech audio features. Configurable neural speech synthesis may be configurable by parameters, the values (e.g., gender, age, and accent) of which relate to voice properties in a way that has perceptible meaning. In an embodiment, TTS voice properties include natural voice characteristics, accents, and attitudes. Voice characteristics relate to physiological attributes of a voice, such as ones that vary distinguishably between gender and age. Accent relates to learned ways of producing phonemes, such as the variations between regions and ethnicities. Attitudes relate to feelings such as happiness, calmness, and formalness.
This is in contrast to voices defined by voice embeddings in a machine-learned space such as X-vectors. The combined configurable range of each voice property parameter enables the speech synthesizer to synthesize a wide range of human-sounding voices. Furthermore, configurable neural speech synthesis may be language-specific or universal.
In various embodiments, beyond merely configuring voice properties as input parameters to speech synthesis, tags within the text to synthesize, in a format such as speech synthesis markup language (SSML), can indicate dynamic voice parameter values along dimensions learned by a neural network.
The resource provider environment 206 can provide speech synthesis services. These services can, for example, train a model (e.g., a neural speech synthesis model or a speech synthesis model) that can generate speech audio (also referred to as voice data) conditioned on a value of a voice property. This allows a user to quickly and easily try different voice sounds and thereby find a voice that meets the needs of their product or use. Further, it allows for saving the property values and comparing them to others to ensure that they are different enough that different products' voices will be distinct. In various embodiments, the speech synthesis services can be performed in hardware and software, or in combination thereof.
The network(s) 204 can include any appropriate network, including an intranet, the Internet, a cellular network, a local area network (LAN), or any other such network or combination, and communication over the network can be enabled via wired and/or wireless connections.
The resource provider environment 206 can include any appropriate components for training a model (e.g., a neural speech synthesis model or a speech synthesis model) that can generate speech audio (also referred to as voice data) conditioned on a value of a voice property, receiving speech data, presenting interfaces, etc. It should be noted that although the techniques described herein may be used for a wide variety of applications, for clarity of presentation, examples relate to speech synthesizing applications. The techniques described herein, however, are not limited to speech synthesizing applications, and approaches may be applied to other situations where managing voice data is desirable, such as creating voice banks, verifying voice data, trademarks, etc.
The resource provider environment 206 might include Web servers and/or application servers for obtaining and processing voice data to train a model (e.g., a neural speech synthesis model or a speech synthesis model) that can generate speech audio (also referred to as voice data) conditioned on a value of a voice property. While this example is discussed with respect to the internet, web services, and internet-based technology, it should be understood that aspects of the various embodiments can be used with any appropriate services available or offered over a network in an electronic environment, or devices otherwise not connected or intermittently connected to the internet.
In various embodiments, resource provider environment 206 may include various types of resources 214 that can be used to facilitate speech synthesis services. The resources can facilitate, for example, custom voice system 222, voice training system 224, application servers operable to process instructions provided by a user or database servers operable to process data stored in one or more data stores 216 in response to a user request.
Custom voice system 222 is operable to receive a string of text and at least one voice property value. Custom voice system 222 evaluates the string of text and the voice property value to convert the text to speech audio in a voice based on a value of the voice property value. Custom voice system 222 is described in greater detail below.
Voice training system 224 is operable to train a model (e.g., a neural speech synthesis model or a speech synthesis model) that can generate speech audio (also referred to as voice data) conditioned on a value of a voice property. For example, source samples of speech audio (e.g., voice data from an individual such as a voice donor or machine-generated voice data from a TTS system or other audio generation system) are obtained and the source samples are labeled with discrete values of a voice property. Voice training system 224 trains a discriminator from the source samples and labels. Voice training system 224 trains a model (e.g., neural speech synthesis model or synthesis model) by synthesizing a multiplicity of synthesized speech samples using the model with a diverse set of voice property values. Corresponding properties are generated for the synthesized speech samples using the discriminator. Voice training system 224 computes a property-learning weight adjustment by back-propagating changes to minimize a loss function that depends on the difference between the voice property values and corresponding probabilities.
In at least some embodiments, an application executing on the client device 202 that needs to access resources of the provider environment 206, for example, to initiate an instance of custom voice system 222 can submit a request that is received to interface layer 208 of the provider environment 206. The interface layer 208 can include application programming interfaces (APIs) or other exposed interfaces, enabling a user to submit requests, such as Web service requests, to the provider environment 206. Interface layer 208 in this example can also include other components as well, such as at least one Web server, routing components, load balancers, and the like.
When a request to access a resource is received at the interface layer 208 in some embodiments, information for the request can be directed to resource manager 210 or other such systems, service, or component configured to manage user accounts and information, resource provisioning and usage, and other such aspects. Resource manager 210 can perform tasks such as communicating the request to a management component or other control component which can be used to manage one or more instances of a custom voice system as well as other information for host machines, servers, or other such computing devices or assets in a network environment, authenticate an identity of the user submitting the request, as well as to determine whether that user has an existing account with the resource provider, where the account data may be stored in at least one data store 212 in the resource provider environment 206. For example, the request can be used to instantiate custom voice system 222 on host machine 230.
It should be noted that although host device 230 is shown outside the provider environment, in accordance with various embodiments, one or more components of custom voice system 222 can be included in provider environment 206, while in other embodiments, some of the components may be included in the provider environment. It should be further noted that host machine 230 can include or at least be in communication with other components, for example, content training and classification systems, image analysis systems, audio analysis systems, etc.
The various computing devices described herein are exemplary and for illustration purposes only. The system may be reorganized or consolidated, as understood by a person of ordinary skill in the art, to perform the same tasks on one or more other servers or computing devices without departing from the scope of the invention. The resources may be hosted on multiple server computers and/or distributed across multiple systems. Additionally, the components may be implemented using any number of different computers and/or systems. Thus, the components may be separated into multiple services and/or over multiple different systems to perform the functionality described herein. In some embodiments, at least a portion of the resources can be “virtual” resources supported by these and/or other components.
One or more links couple one or more systems, engines or devices to the network 204. In particular embodiments, one or more links each includes one or more wired, wireless, or optical links. In particular embodiments, one or more links each includes an intranet, an extranet, a VPN, a LAN, a WLAN, a WAN, a MAN, a portion of the Internet, or another link or a combination of two or more such links. The present disclosure contemplates any suitable links coupling one or more systems, engines or devices to the network 204.
In particular embodiments, each system or engine may be a unitary server or may be a distributed server spanning multiple computers or multiple datacenters. Systems may be of various types, such as, for example and without limitation, web server, advertising server, file server, application server, or proxy server. In particular embodiments, each system may include hardware, software, or embedded logic components or a combination of two or more such components for carrying out the appropriate functionalities implemented or supported by their respective servers. For example, a web server is generally capable of hosting websites containing web pages or particular elements of web pages. More specifically, a web server may host HTML files or other file types or may dynamically create or constitute files upon a request and communicate them to client devices or other devices in response to HTTP or other requests from client devices or other devices.
In particular embodiments, one or more data storages may be communicatively linked to one or more servers via one or more links. In particular embodiments, data storages may be used to store various types of information. In particular embodiments, the information stored in data storages may be organized according to specific data structures. In particular embodiment, each data storage may be a relational database. Particular embodiments may provide interfaces that enable servers or clients to manage, e.g., retrieve, modify, add, or delete, the information stored in data storage.
The system may also contain other subsystems and databases, which are not illustrated in
Configurable neural speech synthesis uses a generative neural network that is a product of a training process, such as one implemented using voice training system 224. Multiple approaches to training are possible, and some examples are described below and can be utilized in voice training system 224. Some examples of a training process use supervised or semi-supervised learning, which requires samples of speech labeled according to discrete values of a voice property. Training labels are discrete values such as Booleans or enumerated types. Some examples of types of labels for training samples include child or not, male or female, one of several languages, one of several regional accents of a language such as New York, Texas, or
China, timbre such as nasal, bright, or croaky, happy or sad, calm or excited, and formal or casual. Limiting the possible values of labels makes it easier for humans to label training samples at an acceptable rate. Asking human labelers to listen to speech recordings and estimate values on a continuous scale would slow labeling down.
InferenceA model capable of inferring probabilities of properties of certain input samples is both a part of training and a result of training a configurable neural speech synthesis model.
In this example, custom voice system 222 can include ingestion component 302, voice synthesis engine 306, text data store 304, and voice property value data store 308. Voice synthesis engine 306 can include configurable neural speech synthesis inference model 310.
Ingestion component 302 is operable to obtain text data and user preference data (e.g., voice property value data) from various sources via an interface. Sources may include one or more content providers. Content providers can include, for example, users, movie agencies, broadcast companies, cable companies, internet companies, game companies, vending and retail services companies, music and video distribution companies, government agencies, automobile companies, etc. In an embodiment, once the sources are identified, a variety of methodologies may be used to retrieve the relevant media data via the interface, including but not limited to, data scrapes, API access, etc. The text data may be stored in text data store 304 and the voice property value data may be stored in voice property value data store 308.
In an embodiment, the interface may include a data interface and a service interface that may be configured to periodically receive text, voice property value data, and/or other data. The interface can include any appropriate components known or used to receive requests or other data from across a network, such as may include one or more application programming interfaces (APIs) or other such interfaces for receiving such requests and/or data.
Configurable neural speech synthesis inference model 310 is capable of inferring probabilities of properties of certain input samples and is both a part of training and a result of training a configurable neural speech synthesis model. For example, configurable neural speech synthesis inference model 310 is operable to receive input text and one or more voice property values and generate synthesized speech audio as an output. The output can be stored in synthesized audio data store 312 or other appropriate data store, and/or otherwise utilized. FIG. 3B illustrates example 320 of an audio wave 322 of synthesized voice audio output from voice synthesis engine 306.
In an embodiment, some neural speech synthesis models may use more than one internal neural network. For example, one may be trained to produce an audio spectrogram, and another uses the spectrogram to produce a waveform. Other ways of dividing the work of speech synthesis between different neural and expert-designed models are possible.
In an embodiment, one example of neural speech synthesis uses a discriminator as part of the training process. The discriminator takes in an audio sample sourced from a corpus of training audio samples and computes a probability of it being associated with one or more specific labels. In some examples, the discriminator is a model trained using machine learning such as a neural network, supervised or semi-supervised training can be possible. It is also possible to use an expert-designed model that is not trained from data.
loss=probability of property−Boolean property label Eq. (1)
It should be noted that other loss functions are possible, such as ones that sum the loss of multiple properties. Such sums could be weighted based on the relative importance of each property. Other mathematical functions in the loss function may be appropriate for specific system constraints.
The training process 502 proceeds to compute, for parameters of the discriminator neural network, error gradients. It is not strictly necessary to compute a gradient for each parameter. The training process 502 proceeds to apply adjustments to the weights of the discriminator model 504 according to the gradients. The amount of adjustment can be scaled by a factor that controls the learning rate. Various other machine learning techniques for training neural networks are possible.
Different source samples will produce different probabilities within the range of 0 to 1. A trained discriminator may tend to produce output values as being between 1 to 0, advantageously providing diversity of output probabilities. For example, if diversity is low, some experimentation with removing a SoftMax output or having independent sigmoid outputs for different properties can be helpful. Limiting the amount of training, and therefore the prediction certainty, can also be helpful. The requirements may be application specific.
Transfer TrainingA trained neural speech synthesis model can be a baseline model, which can be adapted to vary based on parameter input values as expected by users. Training neural speech synthesis models, such as Tacotron and its progeny, can use a loss function that compares model output to source training samples. This can be done, for example, by comparing spectrograms with a loss function such as one represented by:
loss=sum over bins(abs(recording spectrogram bin−speech spectrogram bin)) Eq. (2)
Mean squared error or other alternatives to an absolute value are appropriate for some models and applications.
In an embodiment, the training process 602 proceeds to compute an error gradient for each parameter of the speech synthesis model 604. In certain embodiments, a gradient for selected parameters are computed. The training process 602 proceeds to apply adjustments to the weights of the speech synthesis model 604 according to the gradients. The amount of adjustment can be scaled by a factor that controls the learning rate. In certain embodiments, the factor is dynamic. For example, the factor can be based on one or more performance metrics. Various other machine learning techniques for training neural networks are possible in accordance with embodiments described herein.
A pre-trained baseline speech synthesis model generates a particular voice for the speech that it synthesizes. For example, a target voice with a general accent, middle to young age, and neutral sounding gender may be preferred. After having pre-trained a baseline speech synthesis model, it is possible to perform transfer training by training an improved speech synthesis model that has one or more additional input nodes to the neural network, the nodes indicating voice property values. This can enable the speech synthesis model to learn how to adapt the sound of the synthesized voice according to the voice property values.
For example,
loss=probability of property−voice property value Eq. (3)
This has an effect equivalent to minimizing the cross-entropy loss between two models, where, effectively, the output of one of the models is defined by the voice property values. It should be noted other loss functions are possible in accordance with embodiments described herein. The training process 708 proceeds to compute an error gradient for parameters of speech synthesis model 704. For example, training process 708 computes an error gradient for one or more parameters. Training process 708 proceeds to apply adjustments to the weights of the speech synthesis model 704 according to the gradients. The amount of adjustment can be scaled by a factor that controls the learning rate. Various other machine learning techniques for training neural networks are possible in accordance with various embodiments.
Joint TrainingRather than pre-training a baseline neural speech synthesis model and using transfer training to turn it into a configurable neural speech synthesis model, it is possible to train a model jointly to simultaneously learn speech synthesis in general and configurability according to voice parameters.
Training process 802 compares the synthesized speech with the source training audio sample corresponding to the text transcription. Training process 802 proceeds to compute a loss value and/or weight adjustment according to an error gradient for parameters of the speech synthesis model 804.
Discriminator 810, trained as described above in
A combination 806 of the weight adjustment or computation of weight adjustments from loss values from training process 802 and training process 808 produces a combined weight adjustment according to the loss function represented by:
loss=WS (sum over bins(abs(recording spectrogram bin−speech spectrogram bin)))+WP (probability of property−voice property value) Eq. (4)
where WS and WP are relative weightings of the effect of voice property value matching and training sample voice matching. This has the effect of training a synthesis model that can generate sounds according to voice property values but not learning to generate an undesirable output or other output signal satisfying the voice property values without generating the sounds represented by the input text.
In an embodiment, during a manual approach, the relative weightings that give the most accuracy per training time can be determined through experimentation. Additionally, or alternatively, the relative weightings can be based on one or more performance metrics or other such factors. The combined weight adjustment is applied to the weights of the speech synthesis model 804 according to the gradients. The amount of adjustment can be scaled by a factor that controls the learning rate. Various other machine learning techniques for training neural networks are possible.
The result is a speech synthesis model 804 that can take in text and one or more voice property values that the model 804 has learned and produce synthesized speech audio with a voice as defined by a user's setting of the voice property values.
Synthesis Using the ModelIn an embodiment, a service of synthesizing speech audio from text and a vector of voice property values for a specific desirable voice is provided. This is useful, for example, to create pre-recorded messages for a telephone service interactive voice response (IVR) menu with menu messages such as “to continue in English, press 1” or “to check your account balance, press 2”. It is also useful for pre-recorded messages in devices such as voice interactive web sites, mobile apps, advertisements, robots, or automobiles with messages such as “opening windows” or “as you wish”. The voice, and its configuration, create a brand identity that users and consumers recognize.
The configuration operations can be provided through an application programming interface (API) that gives user-controlled access to the synthesis operation on a server across a network. The synthesis can be performed directly or locally. An API request or local function call can take as arguments relevant voice parameters such as accent, vocal tract parameters such as deepness, and attitudes such as speed or excitement level.
A user, such as a system engineer, or a higher-level function that calls the speech synthesis engine 902 can then incorporate the audio samples into a product. Providing a configurable speech synthesis service may be part of a company's business model in which they charge money, for example, per-message, as a subscription, per-project, or in per-unit royalty agreement.
Users may call a speech synthesis function using a command line program such as one in a Linux shell or a software development environment in Linux®, Macintosh®, or Windows®. It is also possible to provide a web or browser-based graphical user interface (GUI) for system designers to synthesize speech audio with values of configurable speech parameters.
For example,
Sliders in the GUI of
After configuring a set of parameters 1004, a user can select a play button 1006 to hear a sample of some or all of the text synthesized into speech audio played from the browser. This allows experimentation with the sound of the voice before committing to a final output audio file. Some systems only synthesize and play a portion or multiple non-contiguous portions of the entered text to make it difficult for a user to capture and save the playback sample without paying for the custom-configured synthesized audio.
After a user is satisfied with the sound of the voice that they have configured, they may select a button 1008 to download a file with the synthesized speech audio of their input text. In an embodiment, a charge or other consideration may be debited for the download according to some business models.
Configuring a Speech SynthesizerSome developers of computerized applications and embedded systems such as automobiles, robots, smart speakers, appliances, and servers provide voice interfaces for such systems that require an ability to generate speech audio for essentially any words at essentially any time that it is needed to provide a user experience. To provide a desired brand voice, such systems can utilize a speech synthesis engine configured for their specific voice but not configurable for any other voice. In other words, a speech synthesis engine that is locked to a custom voice configuration is “frozen” with locked voice property values. A frozen or locked voice property value is a voice property value that remains the same or constant. Speech synthesis technology providers can support that by providing speech synthesis engines generated with selected voice property parameter values and configured by a configurator interface.
A configurator can be provided through an application programming interface (API), a software development kit (SDK) or similar methods. The configurator can provide user-controlled access to the synthesis operation on a server across a network. In certain embodiments, the configurator can be provided directly or locally. An API request or local function call may take as arguments relevant voice parameters such as accent, vocal tract parameters such as deepness, and attitudes such as speed or excitement level.
A user, such as a system engineer, or a higher-level function can then incorporate the generated speech synthesis engine into a product. In an embodiment, providing a speech synthesis engine configurator service may be part of a company's business model in which they charge money for example per-message, as a subscription, per-project, or in per-unit royalty agreement.
Voice designers may access a configurator using a command line program such as one in a Linux® shell or a software development environment in Linux®, Macintosh®, or Windows®. It is also possible to provide a web or browser-based graphical user interface (GUI) for system designers to configure a speech synthesis engine.
The speech synthesis engine may be provided as an executable binary, as human-
readable programming code in a language such as Python, or as a neural network architecture parameter set for use by standard neural network software. Some generated speech synthesis engines that are delivered as executables or source code may support SSML tags or other dynamic tags to affect the sound of synthesized speech.
Freezing Voice ParametersAfter a user requests that the system generates a speech synthesis engine with a frozen set of voice parameters, the method of generation starts by treating the voice parameter values as a set of neural network input features to a neural network trained to be configurable according to the voice values. The system then treats those input values to the network as constants and forward propagates the constants into the hidden layer(s) of the neural network. Whereas the speech synthesis engine 902 for
Each node of the first hidden layer comprises an activation function fed by a sum of input parameters multiplied by weights. The weights are learned from the training process of the speech synthesis neural network such as the processes described in
The result is a neural network comprising one or more inputs for text but no inputs for the frozen voice parameters. The multiplications, additions, and activation functions in appropriate combinations may be provided as human-readable source code in a language such as Python and/or in a framework such as TensorFlow. They may be compiled into an executable. Before the compiling or as part of the compilation process, hardware-architecture-specific optimizations may be performed such as parallelizing functions to make use of single instruction multiple data (SIMD) instructions within high-performance general-purpose processors and digital signal processing (DSP) processors or may be divided as appropriate for the processing elements within graphics processing units (GPU).
Sets of voice properties constitute a voice vector. The speech configurator of
Another possible service and method is to accept, through a user interface, a recording of speech by a person with a voice that has approximately the sound desired for a product identity. A system can process the recordings using a discriminator such as the discriminator 504 trained in the example of
Some end-user systems that provide configurable neural speech synthesis present a visual character to the user. Such a character may appear as an avatar, hologram, or other graphically generated display of a character that can speak. Users may interact with the system through typing, mouse-clicking, touch, gestures, or voice control. The user may configure the character that they see. The configuration may be done through a menu, keyboard commands, or voice commands. An example of a menu would look similar to that of
A provider of voices may maintain a database. Also, or instead, an industry-standard body may maintain a database or one or more national trademark offices may maintain a database. The database being one that stores voice vectors that produce voices associated with brands. The database can be used to ensure that no two brands have the same voice or voices that are confusingly similar. However, it may be permissible for different brands to use similar voices as long as the brands are for different classes of goods and services.
If the smallest cosine distance of the requested voice property vector to voice property vectors from the brand database 1214 is below a threshold distance, the method proceeds 1212 to provide an error message. It may then proceed to the step of receiving a voice property vector 1202 for a new voice property vector. If the smallest cosine distance is above the threshold distance, the method may proceed to a step 1208 of storing the requested voice vector in the brand database 1214 so that it may be compared to future requested voice vectors. After storing the requested voice vector, the method may proceed to a further step 1210 of generating code for a speech synthesizer. Additionally, or alternatively, the method may proceed to synthesize input text in the voice defined by the requested voice property vector. There may be other intermediate steps within implementations of the method of
It is also possible to store in the brand database 1214 an allowable distance of exclusivity associated with each brand's voice property vector. Accordingly, the threshold for comparison is based on the exclusivity distance associated with each brand's voice property vector. Brand owners may pay to have a larger exclusivity distance. That will give them a more distinct voice.
The allowable distance may be dynamic. For example, the allowable distance may depend on the distance between goods and services within the same or different classes. For example, goods and services within the same or similar class may be associated with stricter thresholds than goods and services in different classes.
Trademark ExaminationIt is in the public interest for consumers to be able to identify the source of goods and services. Most major countries of the world have legal systems to prevent passing off of goods and services. To support the enforcement of the uniqueness of identifiers of goods and services, such countries keep registries of trademarks. These can include names, descriptive words, logos, distinguishing colors, and sounds. As we enter an era of voice-enabled goods and services, where the voices are distinctive to brands, it is desirable to register voices as trademarks. Such voices can be defined with voice property vectors as described above. To ensure that an application for trademark registration is requesting an appropriately distinctive trademark, it is necessary to examine trademarks. However, a problem arises in the fact that it is difficult for a human examiner to compare a voice specified in a trademark application with other existing voice trademarks.
The trademark office 1304 performs a step 1310 of applying a discriminator to the audio segment 1306. A discriminator such as the one shown in
If the smallest computed distance between the voice property vector of the audio segment 1306 of the registration application is within a threshold distance of another voice in the database 1316 for a claimed set of goods and services in a matching class then the trademark registration is to be refused. Otherwise, it may be further examined for possible registration. The trademark office 1304 proceeds to prepare an office action 1314 for the brand owner 1302 indicating whether the trademark registration is refused because of similarity to other registered voice trademarks.
Some examples described above are best performed on servers such as ones in data centers. For example, training of neural networks and hosting of APIs for speech synthesis or synthesis engine generation tend to be performed on servers. The servers run software stored on non-transitory computer readable media.
Some implementations described above are best performed on personal computers such as laptops, mobile devices such as mobile phones and tablets, and embedded systems such as automobiles, robots, and appliances. For example, requesting configurable neural speech synthesis through an API, downloading and running speech synthesis engines, and running trademark examination software are best performed on such devices.
The skilled person will be aware of a range of possible modifications of the various embodiments described above. Accordingly, the present invention is defined by the claims and their equivalents.
Any type of computer-readable medium is appropriate for storing code comprising instructions according to various embodiments.
The ServerServers, such as ones common in data centers, are often implemented as rack-mounted server blades. They have invisible fans behind cooling openings, blinking lights, and cable connections.
It comprises a multiplicity of network-connected computer processors that run software in parallel.
Some embodiments function by running software on general-purpose programmable processors (CPUs) such as ones with ARM or x86 architectures. Some power-sensitive embodiments and some embodiments that require especially high performance such as for neural network algorithms use hardware optimizations. Some embodiments use application-customizable processors with configurable instruction sets in specialized systems-on-chip, such as ARC processors from Synopsys and Xtensa processors from Cadence. Some embodiments use dedicated hardware blocks burned into field programmable gate arrays (FPGAs). Some embodiments use arrays of graphics processing units (GPUs). Some embodiments use application-specific-integrated circuits (ASICs) with customized logic to give the best performance. Some embodiments are in hardware description language code such as code written in the language Verilog.
Some embodiments of physical machines described and claimed herein are programmable in numerous variables, combinations of which provide essentially an infinite variety of operating behaviors. Some embodiments herein are configured by software tools that provide numerous parameters, combinations of which provide for essentially an infinite variety of embodiments of the invention described and claimed.
Hardware blocks, custom processor instructions, co-processors, and hardware accelerators perform neural network processing or parts of neural network processing algorithms with particularly high performance and power efficiency. This provides long battery life for battery-powered devices and reduces heat removal costs in data centers that serve many client devices simultaneously.
As used herein any reference to “one embodiment” or “an embodiment” means that a particular element, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
Some embodiments may be described using the expression “coupled” and “connected” along with their derivatives. For example, some embodiments may be described using the term “coupled” to indicate that two or more elements are in direct physical or electrical contact. The term “coupled,” however, may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other. The embodiments are not limited in this context.
As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, unless expressly stated to the contrary, “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).
In addition, use of the word “a” or “an” are employed to describe elements and components of the embodiments herein. This is done merely for convenience and to give a general sense of the invention. This description should be read to include one or at least one and the singular also includes the plural unless it is obvious that it is meant otherwise.
Upon reading this disclosure, those of skill in the art will appreciate still additional alternative structural and functional designs for a system and a process for creating an interactive message through the disclosed principles herein. Thus, while particular embodiments and applications have been illustrated and described, it is to be understood that the disclosed embodiments are not limited to the precise construction and components disclosed herein. Various apparent modifications, changes and variations may be made in the arrangement, operation and details of the method and apparatus disclosed herein without departing from the spirit and scope defined in the appended claims.
Claims
1. A computerized process of training a neural speech synthesis model that can generate speech audio conditioned on a value of a voice property, the computerized process comprising:
- obtaining source samples of speech audio;
- labeling the source samples with discrete values of a voice property;
- training, from the source samples and labels, a discriminator that can compute a probability of the voice property from a sample of speech audio; and
- training the neural speech synthesis model by:
- synthesizing a multiplicity of synthesized speech samples using the neural speech synthesis model with a multiplicity of values of the voice property to generate synthesized speech samples,
- computing corresponding probabilities for the synthesized speech samples using the discriminator, and
- computing a property-learning weight adjustment to the neural speech synthesis model by back-propagating changes to minimize a loss function that depends on differences between values of the voice property and corresponding probabilities.
2. A speech synthesis model obtained by the computerized process of claim 1.
3. The speech synthesis model obtained of claim 2, wherein the speech synthesis model is configured to:
- receive a string of text and at least one voice property value with a perceptible meaning;
- synthesize speech audio corresponding to the string of text using a neural speech synthesis model that conditions a sound of speech audio on the at least one voice property value to generate synthesized speech audio; and
- output the synthesized speech audio, wherein the sound of the synthesized speech audio perceptually relates to the at least one voice property value.
4. The speech synthesis model of claim 3, wherein the at least one voice property value includes at least one of a gender voice property, an age voice property, an accent voice property, a timbre voice property, or an attitude voice property.
5. The speech synthesis model of claim 3, wherein the speech synthesis model is further configured to:
- enable download of the synthesized speech audio.
6. The speech synthesis model of claim 3, wherein the speech synthesis model is further configured to:
- enable playback of the synthesized speech audio.
7. The speech synthesis model of claim 3, wherein the speech synthesis model is further configured to:
- provide a graphical user interface that includes one of a text input field or a voice property value input field.
8. The speech synthesis model of claim 3, wherein the string of text is associated with at least one text tag.
9. The speech synthesis model of claim 3, wherein the string of text indicates dynamically configurable voice parameter values.
10. The computerized process of claim 1, wherein the synthesizing uses a transcription of source samples, the computerized process further comprising:
- computing a source-matching weight adjustment by back-propagating changes to minimize a loss function that depends on differences between the source samples and the synthesized speech samples.
11. A speech synthesis model obtained by the computerized process of claim 10.
12. The computerized process of claim 1, wherein the source samples of the speech audio are obtained from one of a person and an audio generation system.
13. The computerized process of claim 1, wherein the voice property includes at least one of a gender voice property, an age voice property, an accent voice property, a timbre voice property, or an attitude voice property.
14. A computer system for training a neural speech synthesis model to generate speech audio conditioned on a value of a voice property, comprising: obtain source samples of speech audio; label the source samples with discrete values of a voice property; train, from the source samples and labels, a discriminator that can compute a probability of the voice property from a sample of speech audio; and train the neural speech synthesis model by: synthesize a multiplicity of synthesized speech samples using the neural speech synthesis model with a multiplicity of values of the voice property to generate synthesized speech samples, compute corresponding probabilities for the synthesized speech samples using the discriminator, and compute a property-learning weight adjustment to the neural speech synthesis model by back-propagating changes to minimize a loss function that depends on differences between values of the voice property and corresponding probabilities.
- at least one processor; and
- memory including instructions that, when executed by the at least one processor, cause the computer system to:
15. The computer system of claim 14, wherein the at least one voice property value includes at least one of a gender voice property, an age voice property, an accent voice property, a timbre voice property, or an attitude voice property.
16. The computer system of claim 14, wherein the neural speech synthesis model is further configured to:
- enable download of the synthesized speech audio.
17. The computer system of claim 14, wherein the neural speech synthesis model is further configured to:
- enable playback of the synthesized speech audio.
18. The computer system of claim 14, wherein the neural speech synthesis model is further configured to:
- provide a graphical user interface that includes one of a text input field or a voice property value input field.
19. The computer system of claim 14, wherein the string of text is associated with at least one text tag.
20. The computer system of claim 14, wherein the system uses a transcription of source samples, and wherein the instructions when executed further cause the computer system to:
- compute a source-matching weight adjustment by back-propagating changes to minimize a loss function that depends on differences between the source samples and the synthesized speech samples.
Type: Application
Filed: Jul 14, 2023
Publication Date: Jan 18, 2024
Applicant: SoundHound, Inc. (Santa Clara, CA)
Inventor: Andrew RICHARDS (Toulouse)
Application Number: 18/352,980