Abstract: A transcription of a query for content discovery is generated, and a context of the query is identified, as well as a first plurality of candidate entities to which the query refers. A search is performed based on the context of the query and the first plurality of candidate entities, and results are generated for output. A transcription of a second voice query is then generated, and it is determined whether the second transcription includes a trigger term indicating a corrective query. If so, the context of the first query is retrieved. A second term of the second query similar to a term of the first query is identified, and a second plurality of candidate entities to which the second term refers is determined. A second search is performed based on the second plurality of candidates and the context, and new search results are generated for output.
Type:
Grant
Filed:
March 2, 2023
Date of Patent:
April 16, 2024
Assignee:
Rovi Guides, Inc.
Inventors:
Jeffry Copps Robert Jose, Sindhuja Chonat Sri
Abstract: Speech processing techniques are disclosed that enable determining a text representation of alphanumeric sequences in captured audio data. Various implementations include determining a contextual biasing finite state transducer (FST) based on contextual information corresponding to the captured audio data. Additional or alternative implementations include modifying probabilities of one or more candidate recognitions of the alphanumeric sequence using the contextual biasing FST, where the FST further comprises a grammar as well as a speller finite state transducer.
Abstract: An information processing system includes at least one memory storing a program and at least one processor. The at least one processor implements the program to input a piece of sound source data obtained by encoding a first identification data representative of a sound source, a piece of style data obtained by encoding a second identification data representative of a performance style, and synthesis data representative of sounding conditions into a synthesis model generated by machine learning, and to generate, using the synthesis model, feature data representative of acoustic features of a target sound of the sound source to be generated in the performance style and according to the sounding conditions, and to generate an audio signal corresponding to the target sound using the generated feature data.
Abstract: An apparatus for encoding an audio signal includes: a core encoder for core encoding first audio data in a first spectral band; a parametric coder for parametrically coding second audio data in a second spectral band being different from the first spectral band, wherein the parametric coder includes: an analyzer for analyzing first audio data in the first spectral band to obtain a first analysis result and for analyzing second audio data in the second spectral band to obtain a second analysis result; a compensator for calculating a compensation value using the first analysis result and the second analysis result; and a parameter calculated for calculating a parameter from the second audio data in the second spectral band using the compensation value.
Type:
Grant
Filed:
August 11, 2022
Date of Patent:
March 19, 2024
Assignee:
Fraunhofer-Gesellschaft zur Foerderung der angewandten Forschung e.V.
Inventors:
Sascha Disch, Franz Reutelhuber, Jan Büthe, Markus Multrus, Bernd Edler
Abstract: A method for generating speech includes uploading a reference set of features that were extracted from sensed movements of one or more target regions of skin on faces of one or more reference human subjects in response to words articulated by the subjects and without contacting the one or more target regions. A test set of features is extracted a from the sensed movements of at least one of the target regions of skin on a face of a test subject in response to words articulated silently by the test subject and without contacting the one or more target regions. The extracted test set of features is compared to the reference set of features, and, based on the comparison, a speech output is generated, that includes the articulated words of the test subject.
Type:
Grant
Filed:
March 7, 2023
Date of Patent:
February 20, 2024
Assignee:
Q (Cue) Ltd.
Inventors:
Aviad Maizels, Avi Barliya, Yonatan Wexler
Abstract: An apparatus for generating an enhanced signal from an input signal, wherein the enhanced signal has spectral values for an enhancement spectral region, the spectral values for the enhancement spectral regions not being contained in the input signal, includes a mapper for mapping a source spectral region of the input signal to a target region in the enhancement spectral region, the source spectral region including a noise-filling region; and a noise filler configured for generating first noise values for the noise-filling region in the source spectral region of the input signal and for generating second noise values for a noise region in the target region, wherein the second noise values are decorrelated from the first noise values or for generating second noise values for a noise region in the target region, wherein the second noise values are decorrelated from first noise values in the source region.
Type:
Grant
Filed:
January 19, 2022
Date of Patent:
February 20, 2024
Assignee:
Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.
Inventors:
Sascha Disch, Ralf Geiger, Andreas Niedermeier, Matthias Neusinger, Konstantin Schmidt, Stephan Wilde, Benjamin Schubert, Christian Neukam
Abstract: A plurality of pieces of emotional state information corresponding to a plurality of speech frames in a current utterance are obtained based on a first neural network model; statistical operation is performed on the plurality of pieces of emotional state information, to obtain a statistical result, and then the emotional state information corresponding to the current utterance is obtained based on a second neural network device, the statistical result corresponding to the current utterance, and statistical results corresponding to a plurality of utterances before the current utterance.
Type:
Grant
Filed:
October 15, 2021
Date of Patent:
February 13, 2024
Assignee:
Huawei Technologies Co., Ltd.
Inventors:
Yang Zhang, Oxana Verkholyak, Alexey Karpov, Li Qian
Abstract: This application discloses an audio processing method and a terminal. The method may include: collecting, by a first terminal, an original speech of a first user, translating the original speech of the first user into a translated speech of the first user, receiving an original speech of a second user that is sent by a second terminal, and translating the original speech of the second user into a translated speech of the second user; sending at least one of the original speech of the first user, the translated speech of the first user, and the translated speech of the second user to the second terminal based on a first setting; and playing at least one of the original speech of the second user, the translated speech of the second user, and the translated speech of the first user based on a second setting.
Abstract: The present invention provides a multilingual speech recognition and translation method for a conference. The conference includes at least one attendee, and the method includes: receiving, at a server, at least one piece of audio data and at least one piece of video data generated by at least one terminal apparatus; analyzing the video data to generate a video recognition result related to an attendance, and an ethnic of the attendee and a body movement, and a facial movement of the attendee when talking; generating at least one language family recognition result according to the video recognition result and the audio data, and obtaining a plurality of audio segments corresponding to the attendee; performing speech recognition on and translating the audio segments; and displaying a translation result on the terminal apparatus. The method further determines a quantity of conference attendees according to their respective distances from their device microphones.
Abstract: A method is described that processes an audio signal. A discontinuity between a filtered past frame and a filtered current frame of the audio signal is removed using linear predictive filtering.
Type:
Grant
Filed:
February 3, 2022
Date of Patent:
January 9, 2024
Assignee:
Fraunhofer-Gesellschaft zur Foerderung der angewandten Forschung e. V.
Inventors:
Emmanuel Ravelli, Manuel Jander, Grzegorz Pietrzyk, Martin Dietz, Marc Gayer
Abstract: An apparatus for processing an encoded audio signal, which includes a sequence of access units, each access unit including a core signal with a first spectral width and parameters describing a spectrum above the first spectral width, has a demultiplexer generating, from an access unit of the encoded audio signal, the core signal and a set of the parameters, an upsampler upsampling the core signal of the access unit and outputting a first upsampled spectrum and a timely consecutive second upsampled spectrum, the first upsampled spectrum and the second upsampled spectrum, both, having a same content as the core signal and having a second spectral width being greater than the first spectral width of the core spectrum, a parameter converter converting parameters of the set of parameters of the access unit to obtain converted parameters, and a spectral gap filling processor processing the first upsampled spectrum and the second upsampled spectrum using the converted parameters.
Type:
Grant
Filed:
August 19, 2021
Date of Patent:
January 2, 2024
Assignee:
Fraunhofer-Gesellschaft zur Foerderung der angewandten Forschung e.V.
Abstract: A method and system for extracting and labeling Named-Entity Recognition (NER) data in a target language for use in a multi-lingual software module has been developed. First, a textual sentence is translated to the target language using a translation module. A named entity is identified and extracted within the translated sentence. The named entity is identified by either: exact mapping; a semantically similar translated named entity that meets a predetermined minimum threshold of similarity; or utilizing a rule-based library for the target language. Once identified, the named entity is labeled with a pre-determined category and stored in a retrievable electronic database.
Abstract: Systems and methods for matching entities to target objects using an ensemble model are disclosed. The ensemble model includes a general trained machine learning (ML) model (which is trained using the entirety of a training dataset) and a subarea trained ML model (which is trained using a subset of the training dataset corresponding to a specific, defined subarea) that provides potential matches to a meta-model of the ensemble model to generate a final match. The ensemble model may also include a general trained natural language processing (NLP) model and a subarea trained NLP model that provides potential matches to the meta-model. The meta-model of a quad-ensemble ML model combines the four potential matches (such as probabilities and similarities of matching specific pairs of targets objects and entities) to generate a final match (such as a final probability used to identify the final match).
Type:
Grant
Filed:
November 21, 2022
Date of Patent:
December 12, 2023
Assignee:
Intuit Inc.
Inventors:
Natalie Bar Eliyahu, Noga Noff, Omer Wosner, Yair Horesh
Abstract: The system provides a synthesized speech response to a voice input, based on the prosodic character of the voice input. The system receives the voice input and calculates at least one prosodic metric of the voice input. The at least one prosodic metric can be associated with a word, phrase, grouping thereof, or the entire voice input. The system also determines a response to the voice input, which may include the sequence of words that form the response. The system generates the synthesized speech response, by determining prosodic characteristics based on the response, and on the prosodic character of the voice input. The system outputs the synthesized speech response, which includes a more natural, relevant, or both answer to the call of the voice input. The prosodic character of the voice input and/or response may include pitch, note, duration, prominence, timbre, rate, and rhythm, for example.
Abstract: According to one embodiment, a signal processing apparatus correlates a plurality of communication terminals as a group and enables one-to-many communications in the group. The signal processing apparatus includes processing circuitry. The processing circuitry assigns a transmission right to one of the communication terminals in the group. The processing circuitry generates text data based on voice data from said one of the communication terminals in possession of the transmission right. The processing circuitry gives a texting completion notice indicative of completion of texting processing to the communication terminals in the group. The processing circuitry transmits, after the texting completion notice is given, the generated text data to at least one of the communication terminals in the group.
Abstract: The present disclosure provides techniques for graphics translation. A plurality of natural language image descriptions is collected for an image of a product. An overall description for the image is generated using one or more models, based on the plurality of natural language image descriptions, by: identifying a set of shared descriptors used in at least a subset of the plurality of natural language image descriptions, and aggregating the set of shared descriptors to form the overall description. A first request to provide a description of the first image is received, and the overall description is returned in response to the first request, where the overall description is output using one or more text-to-speech techniques.
Type:
Grant
Filed:
July 27, 2021
Date of Patent:
October 31, 2023
Assignee:
Toshiba Global Commerce Solutions Holdings Corporation
Inventors:
Manda Miller, Kirk Goldman, Jon A. Hoffman, John Pistone, Dimple Nanwani, Theodore Clark
Abstract: A system of reducing transmissions of packetized data in a voice activated data packet based computer network environment is provided. A natural language processor component can parse an input audio signal to identify a request and a trigger keyword. Based on the input audio signal, a direct action application programming interface can generate a first action data structure, and a content selector component can select a content item. An interface management component can identify candidate interfaces and determine if prior instances of the packetized data was transmitted to the candidate interfaces. The interface management component can prevent the transmission of the packetized data if determined to be redundant, such as having previously received the data, and instead transmit it to a separate client device of a different device type.
Abstract: Methods and systems for voice-based identification of related products/services are provided. Exemplary systems may include a wireless communication-based tag reader that polls for a wireless transmission-based tag and reads information associated with the wireless transmission-based tag and a processor that executes instructions to identify a product/service associated with the wireless transmission-based tag, identify a plurality of products/services stored in a product/service database identified as related to the product/service associated with the wireless transmission-based tag based on a trend related to prior purchases to identify a related product/service, and generate a voice-based utterance based on the identified set of one or more related products/services.
Abstract: Example techniques involve systems with multiple acoustic echo cancellers. An example implementation captures first audio within an acoustic environment and detecting, within the captured first audio content, a wake-word. In response to the wake-word and before playing an acknowledgement tone, the implementation activates (a) a first sound canceller when one or more speakers are playing back audio content or (b) a second sound canceller when the one or more speakers are idle. In response to the wake-word and after activating either (a) the first sound canceller or (b) the second sound canceller, the implementation outputs the acknowledgement tone via the one or more speakers. The implementation captures second audio within the acoustic environment and cancelling the acoustic echo of the acknowledgement tone from the captured second audio using the activated sound canceller.
Abstract: Extracting data from documents is challenging due to the variation in structure, content, styles across geographies and functional areas. Further complex relation types are characterized by one or more of N-ary entity mention arguments, cross sentence span of entity mentions for a relation mention, missing entity mention arguments and entity mention arguments being multi-valued. The present disclosure addresses these gaps in the art to extract entity mentions and relation mentions using a joint neural network model including two sequence labelling layers which are trained jointly. The mentions are extracted from documents to facilitate downstream processing. A first RNN layer creates sentence embeddings for each sentence in the document being processed and predicts entity mentions. A second RNN layer predicts labels for each sentence span corresponding to a relation type.