Abstract: A system and method are disclosed for ignoring a wakeword received at a speech-enabled listening device when it is determined the wakeword is reproduced audio from an audio-playing device. Determination can be by detecting audio distortions, by an ignore flag sent locally between an audio-playing device and speech-enabled device, by and ignore flag sent from a server, by comparison of received audio played audio to a wakeword within an audio-playing device or a speech-enabled device, and other means.
Type:
Application
Filed:
February 4, 2020
Publication date:
August 5, 2021
Applicant:
SoundHound, Inc.
Inventors:
Hsuan Yang, Qìndí Zhäng, Warren S. Heit
Abstract: A method of providing a platform for configuring device-specific speech recognition is provided. The method includes providing a user interface for developers to select a set of at least two acoustic models appropriate for a specific type of a device, receiving, from a developer, a selection of the set of the at least two acoustic models, and configuring a speech recognition system to perform device-specific speech recognition by using one acoustic model selected from the at least two acoustic models of the set.
Abstract: A method of building a natural language understanding application is provided. The method includes receiving at least one electronic record containing programming code and creating executable code from the programming code. Further, the executable code, when executed by a processor, causes the processor to create a parse and an interpretation of a sequence of input tokens, the programming code includes an interpret-block and the interpret-block includes an interpret-statement. Additionally, the interpret-statement includes a pattern expression and the interpret-statement includes an action statement.
Type:
Application
Filed:
April 8, 2021
Publication date:
July 22, 2021
Applicant:
SoundHound, Inc.
Inventors:
Bernard Mont-Reynaud, Seyed M. Emami, Chris Wilson, Keyvan Mohajer
Abstract: A voice morphing apparatus having adjustable parameters is described. The disclosed system and method include a voice morphing apparatus that morphs input audio to mask a speaker's identity. Parameter adjustment uses evaluation of an objective function that is based on the input audio and output of the voice morphing apparatus. The voice morphing apparatus includes objectives that are based adversarially on speaker identification and positively on audio fidelity. Thus, the voice morphing apparatus is adjusted to reduce identifiability of speakers while maintaining fidelity of the morphed audio. The voice morphing apparatus may be used as part of an automatic speech recognition system.
Abstract: A method and system for responding to multiple voice requests sent from a group of devices in substantive response to a single spoken utterance of a user. In one embodiment, if the devices have a same group ID, a server determines if any of the group of received voice requests are duplicate. In one embodiment, voice requests received within a predetermined time window are examined to determine if they are duplicate. If so, the server deems one of the received voice requests as non-duplicate and the others as duplicate and sends a substantive response for the non-duplicate voice request. In some embodiments, a no-op is sent to the devices that do not receive the substantive response.
Type:
Application
Filed:
January 6, 2020
Publication date:
July 8, 2021
Applicant:
SoundHound, Inc.
Inventors:
Arvinderpal S. Wander, Evelyn Jiang, Matthias Eichstaedt, Timothy Calhoun
Abstract: Systems and methods for training a voice morphing apparatus are described. The voice morphing apparatus is trained to morph input audio data to mask an identity of a speaker. Training is performed by evaluating an objective function that is a function of the input audio data and an output of the voice morphing apparatus. The objective function may have a first term that is based on speaker identification and a second term that is based on audio fidelity. By optimizing the objective function, parameters of the voice morphing apparatus may be adjusted so as to reduce a confidence of speaker identification and maintain an audio fidelity of the morphed audio data. The voice morphing apparatus, once trained, may be used as part of an automatic speech recognition system.
Abstract: A system and method are disclosed for capturing a segment of speech audio, performing phoneme recognition on the segment of speech audio to produce a segmented phoneme sequence, comparing the segmented phoneme sequence to stored phoneme sequences that represent incorrect pronunciations of words to determine if there is a match, and identifying an incorrect pronunciation for a word in the segment of speech audio. The system builds a library based on the data collected for the incorrect pronunciations.
Abstract: Systems and methods for distributed training of a neural network model are described. Various embodiments include a master device and a slave device. The master device has a first version of the neural network model. The slave device is communicatively coupled to a first data source and the master device, and the first data source is inaccessible by the master device, in accordance with one embodiment. The slave device is remote from the master device. The master device is configured to output first configuration data for the neural network model based on the first version of the neural network model. The slave device is configured to use the first configuration data to instantiate a second version of the neural network model. The slave device is configured to train the second version of the neural network model using data from the first data source and to output second configuration data for the neural network model.
Abstract: Training and enhancement of neural network models, such as from private data, are described. A slave device receives a version of a neural network model from a master. The slave accesses a local and/or private data source and uses the data to perform optimization of the neural network model. This can be done such as by computing gradients or performing knowledge distillation to locally train an enhanced second version of the model. The slave sends the gradients or enhanced neural network model to a master. The master may use the gradient or second version of the model to improve a master model.
Abstract: A system and method are disclosed capable of parsing a spoken utterance into a natural language request and a speech audio segment, where the natural language request directs the system to use the speech audio segment as a new wakeword. In response to this wakeword assignment directive, the system and method are further capable of immediately building a new wakeword spotter to activate the device upon matching the new wakeword in the input audio. Different approaches to promptly building a new wakeword spotter are described. Variations of wakeword assignment directives can make the new wakeword public or private. They can also add the new wakeword to earlier wakewords, or replace earlier wakewords.
Abstract: A neural speech-to-meaning system is trained on speech audio expressing specific intents. The system receives speech audio and produces indications of when the speech in the audio matches the intent. Intents may include variables that can have a large range of values, such as the names of places. The neural speech-to-meaning system simultaneously recognizes enumerated values of variables and general intents. Recognized variable values can serve as arguments to API requests made in response to recognized intents. Accordingly, neural speech-to-meaning supports voice virtual assistants that serve users based on API hits.
Abstract: To train a speech recognizer, such as for recognizing variables in a neural speech-to-meaning system, compute, within an embedding space, a range of vectors of features of natural speech. Generate parameter sets for speech synthesis and synthesis speech according to the parameters. Analyze the synthesized speech to compute vectors in the embedding space. Using a cost function that favors an even spread (minimal clustering) generates a multiplicity of speech synthesis parameter sets. Using the multiplicity of parameter sets, generate a multiplicity of speech of known words that can be used as training data for speech recognition.
Abstract: A method is provided for advertisement selection. The method includes recognizing words from user speech over a large number of interactions, computing a number of unique words uttered during the interactions, classifying the user by the number of unique words uttered during the interactions, and selecting an advertisement targeted to the classified users.
Type:
Grant
Filed:
April 18, 2019
Date of Patent:
June 8, 2021
Assignee:
SoundHound, Inc.
Inventors:
Jun Huang, Kiran Garaga Lokeswarappa, Joel Gedalius, Bernard Mont-Reynaud
Abstract: A method for processing a natural language query. The method includes receiving a text query, the query referring to a plurality of objects, attributes, qualifiers and other arguments and parsing the query to produce an argument tree representing the substance and structure of the query. The method also includes the capability to define qualifiers as being possibly projectable onto other arguments and indicate their direction of projectability and the capability to denote nodes of the argument tree as foldable, as splittable, or as containing sequences of qualifier arguments. The method additionally includes defining validity rules for a domain of knowledge, used to determine whether a list of arguments form a valid granular query component and processing of the argument tree, in view of the above in order to derive a corresponding plurality of granular query components that collectively request the plurality of pieces of information representing the intent of the query.
Abstract: The technology disclosed relates to performing speech recognition for a plurality of different devices or devices in a plurality of conditions. This includes storing a plurality of acoustic models associated with different devices or device conditions, receiving speech audio including natural language utterances, receiving metadata indicative of a device type or device condition, selecting an acoustic model from the plurality in dependence upon the received metadata, and employing the selected acoustic model to recognize speech from the natural language utterances included in the received speech audio. Each of speech recognition and the storage of acoustic models can be performed locally by devices or on a network-connected server. Also provided is a platform and interface, used by device developers to select, configure, and/or train acoustic models for particular devices and/or conditions.
Abstract: A command-processing server provides natural language processing services to applications. The command-processing server stores a set of code blocks, each code block being able to interpret a set of corresponding natural language expressions. The command-processing server accepts natural language expressions and identifies the code blocks that are capable of interpreting those expressions by attempting to parse the natural language expressions using the code blocks. The command-processing server then provides a list of the identified code blocks to the developers, who can then incorporate the code blocks into their applications.
Abstract: The technology disclosed relates to authoring of vertical applications of natural language understanding (NLU), which analyze text or utterances and construct their meaning. In particular, it relates to new programming constructs and tools and data structures implementing those new applications.
Type:
Grant
Filed:
December 4, 2018
Date of Patent:
May 4, 2021
Assignee:
SoundHound, Inc.
Inventors:
Keyvan Mohajer, Seyed M. Emami, Chris Wilson, Bernard Mont-Reynaud
Abstract: [Object] Technology is provided to enable a mobile terminal to function as a digital assistant even when the mobile terminal is in a state where it cannot communicate with a server apparatus. [Solution] When a user terminal 200 receives a query A from a user, user terminal 200 sends query A to a server 100. Server 100 interprets the meaning of query A using a grammar A. Server 100 obtains a response to query A based on the meaning of query A and sends the response to user terminal 200. Server 100 further sends grammar A to user terminal 200. That is, server 100 sends to user terminal 200 a grammar used to interpret the query received from user terminal 200.
Abstract: A system and method for masking an identity of a speaker of natural language speech, such as speech clips to be labeled by humans in a system generating voice transcriptions for training an automatic speech recognition model. The natural language speech is morphed prior to being presented to the human for labeling. In one embodiment, morphing comprises pitch shifting the speech randomly either up or down, then frequency shifting the speech, then pitch shifting the speech in a direction opposite the first pitch shift.
Abstract: The technology disclosed relates to authoring of vertical applications of natural language understanding (NLU), which analyze text or utterances and construct their meaning. In particular, it relates to new programming constructs and tools and data structures implementing those new applications.
Type:
Grant
Filed:
March 15, 2013
Date of Patent:
March 23, 2021
Assignee:
SoundHound, Inc.
Inventors:
Keyvan Mohajer, Seyed Majid Emami, Chris Wilson, Bernard Mont-Reynaud