Abstract: An apparatus receives a first audio signal captured by a first microphone of a device and at least a second audio signal captured by at least a second microphone of the device. The apparatus estimates a diffuseness of sound based on the received first and at least second audio signals. The apparatus may then form at least one final audio signal based on at least one of the received first audio signal and the received at least second audio signal by adjusting an audibility of diffuse sound for the final audio signal in response to the estimated diffuseness, in order to enable an enhanced perception of sound with respect to at least one criterion with the at least one final audio signal.
Abstract: An encoder for encoding an audio signal includes an analyzer for analyzing the audio signal and for determining analysis prediction coefficients from the audio signal. The encoder includes a converter for deriving converted prediction coefficients from the analysis prediction coefficients, a memory for storing a multitude of correction values and a calculator. The calculator includes a processor for processing the converted prediction coefficients to obtain spectral weighting factors. The calculator includes a combiner for combining the spectral weighting factors and the multitude of correction values to obtain corrected weighting factors. A quantizer of the calculator is configured for quantizing the converted prediction coefficients using the corrected weighting factors to obtain a quantized representation of the converted prediction coefficients.
Type:
Grant
Filed:
May 5, 2016
Date of Patent:
November 14, 2017
Assignee:
Fraunhofer-Gesellschaft zur Foerderung der angewandten Forschung e.V.
Inventors:
Konstantin Schmidt, Guillaume Fuchs, Matthias Neusinger, Martin Dietz
Abstract: Audio encoding methods/terminals, audio decoding methods/terminals, and audio codec systems are provided. A plurality of audio signals that are continuous is obtained. It is determined whether each audio signal of the plurality of audio signals includes a designated signal type, according to an audio parameter of each audio signal. A marked audio encoding stream is obtained by performing a marking to each audio signal as having or not having the designated signal type. The marking is used, at a decoding terminal, to perform an enhancement-process to one or more audio signals having the designated signal type. The enhancement-process is not performed to audio signals that do not have the designated signal type.
Type:
Grant
Filed:
January 14, 2015
Date of Patent:
November 7, 2017
Assignee:
TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED
Inventors:
Guoming Chen, Yuanjiang Peng, Wenjun Ou, Hong Liu
Abstract: A comfort noise controller for generating CN (Comfort Noise) control parameters is described. A buffer of a predetermined size is configured to store CN parameters for SID (Silence Insertion Descriptor) frames and active hangover frames. A subset selector is configured to determine a CN parameter subset relevant for SID frames based on the age of the stored CN parameters and on residual energies. A comfort noise control parameter extractor (50B) is configured to use the determined CN parameter subset to determine the CN control parameters for a first SID frame following an active signal frame.
Abstract: Example methods and systems use multiple sensors to determine whether a speaker is speaking. Audio data in an audio-channel speech band detected by a microphone can be received. Vibration data in a vibration-channel speech band representative of vibrations detected by a sensor other than the microphone can be received. The microphone and the sensor can be associated with a head-mountable device (HMD). It is determined whether the audio data is causally related to the vibration data. If the audio data and the vibration data are causally related, an indication can be generated that the audio data contains HMD-wearer speech. Causally related audio and vibration data can be used to increase accuracy of text transcription of the HMD-wearer speech. If the audio data and the vibration data are not causally related, an indication can be generated that the audio data does not contain HMD-wearer speech.
Type:
Grant
Filed:
August 17, 2015
Date of Patent:
October 3, 2017
Assignee:
Google Inc.
Inventors:
Michael Patrick Johnson, Jianchun Dong, Mat Balez
Abstract: The present disclosure relates to an audio signal coding method and apparatus. The method includes categorizing audio signals into high-frequency audio signals and low-frequency audio signals, coding the low-frequency audio signals using a corresponding low-frequency coding manner according to characteristics of low-frequency audio signals, and selecting a bandwidth extension mode to code the high-frequency audio signals according to the low-frequency coding manner and/or characteristics of the audio signals.
Abstract: An audio encoding method and an apparatus are provided. The method includes: determining sparseness of distribution, on spectrums, of energy of N input audio frames (101), where the N audio frames include a current audio frame, and N is a positive integer; and determining, according to the sparseness of distribution, on the spectrums, of the energy of the N audio frames, whether to use a first encoding method or a second encoding method to encode the current audio frame (102), where the first encoding method is an encoding method that is based on time-frequency transform and transform coefficient quantization and that is not based on linear prediction, and the second encoding method is a linear-predication-based encoding method. The method can reduce encoding complexity and ensure that encoding is of relatively high accuracy.
Abstract: An objective of the present invention is to correct a temporal envelope shape of a decoded signal with a small information volume and to reduce perceptible distortions.
Abstract: Computer-implemented method for estimating user interests, executable by a computing device in communication with an output device, comprising: determining a first input vector corresponding to a first user event and a second input vector corresponding to a second user event; mapping first input vector to a first output vector and second input vector to a second 5 output vector in a first multidimensional space using a first vector-mapping module; determining a third input vector based on first output vector and second output vector; mapping third input vector to a third output vector in a second multidimensional space using a second vector-mapping module; determining a message to be provided to a user based on an analysis of at least one of first output vector and third output vector; and causing output 10 device to provide message to user. Also non-transitory computer-readable medium storing program instructions for carrying out the method.
Abstract: A method of processing an audio signal includes determining an average signal-to-noise ratio for the audio signal over time. The method includes, based on the determined average signal-to-noise ratio, a formant-sharpening factor is determined. The method also includes applying a filter that is based on the determined formant-sharpening factor to a codebook vector that is based on information from the audio signal.
Type:
Grant
Filed:
September 13, 2013
Date of Patent:
August 8, 2017
Assignee:
QUALCOMM Incorporated
Inventors:
Venkatraman S. Atti, Vivek Rajendran, Venkatesh Krishnan
Abstract: In some embodiments, a server establishes communication with user devices; the server receives location data identifying a location of each user device; the server determines a geographical region based on the location of each user device, each geographical region being different for user devices not at the same location; the server receives an audio stream from the first of the user devices; the server determines which of the other user devices are within the geographical region of the first user device; and the server transmits the audio stream to the other user devices that are within the geographical region of the first user device. In some embodiments, the server creates altered audio streams based on the distances between user devices. In some embodiments, the server changes the number of the other user devices that are within the geographical region of the first user device.
Abstract: An integrated sensor-array processor and method includes sensor array time-domain input ports to receive sensor signals from time-domain sensors. A sensor transform engine (STE) creates sensor transform data from the sensor signals and applies sensor calibration adjustments. Transducer time-domain input ports receive time-domain transducer signals, and a transducer output transform engine (TTE) generates transducer output transform data from the transducer signals. A spatial filter engine (SFE) applies suppression coefficients to the sensor transform data, to suppress target signals received from noise locations and/or amplification locations. A blocking filter engine (BFE) applies subtraction coefficients to the sensor transform data, to subtract the target signals from the sensor transform data. A noise reduction filter engine (NRE) subtracts noise signals from the BFE output. An inverse transform engine (ITE) generates time-domain data from the NRE output.
Abstract: Provided are methods and systems for providing situation-dependent transient noise suppression for audio signals. Different strategies (e.g., levels of aggressiveness) of transient suppression and signal restoration are applied to audio signals associated with participants in a video/audio conference depending on whether or not each participant is speaking (e.g., whether a voiced segment or an unvoiced/non-speech segment of audio is present). If no participants are speaking or there is an unvoiced/non-speech sound present, a more aggressive strategy for transient suppression and signal restoration is utilized. On the other hand, where voiced audio is detected (e.g., a participant is speaking), the methods and systems apply a softer, less aggressive suppression and restoration process.
Abstract: An apparatus and a method for integrally encoding and decoding a speech signal and a audio signal. The encoding apparatus may include: an input signal analyzer to analyze a characteristic of an input signal; a first conversion encoder to convert the input signal to a frequency domain signal, and to encode the input signal when the input signal is a audio characteristic signal; a Linear Predictive Coding (LPC) encoder to perform LPC encoding of the input signal when the input signal is a speech characteristic signal; a frequency band expander with the spectral band replication (SBR) standard for expanding the frequency band of the input signal whose output is transmitted to either the first conversion encoder or the LPC encoder based on the input characteristic; and a bitstream generator to generate a bitstream using an output signal of the first conversion encoder and an output signal of the LPC encoder.
Type:
Grant
Filed:
January 26, 2015
Date of Patent:
July 18, 2017
Assignees:
ELECTRONICS AND TELECOMMUNICATIONS RESEARCH INSTITUTE, KWANGWOON UNIVERSITY INDUSTRY-ACADEMIC COLLABORATION FOUNDATION
Inventors:
Tae Jin Lee, Seung Kwon Baek, Min Je Kim, Dae Young Jang, Jeongil Seo, Kyeongok Kang, Jin Woo Hong, Hochong Park, Young Cheol Park
Abstract: Apparatus, system and method for encoding and/or decoding persistent universal media identification (ID) codes embedded in audio. For encoding, a persistent identifier code is generated or received from a registry database, where the code includes data for uniquely identifying a media object. Audio code components including frequency characteristics are generated to represent symbols of the persistent identifier code and the audio code components are psychoacoustically embedded into an audio portion of the media object to include the persistent identifier code within one or more of a plurality of encoding layers. Such embedded audio may be subsequently decoded by transforming the audio data into a frequency domain and processing the transformed audio data to detect the persistent identifier code.
Abstract: A display apparatus includes: a display device; a display device driver which drives the display device; a compression section adapted to an operation of generating compression data by compression processing performed on image data; and a transmission section which, when receiving compressed data from the compression section, transmits the compressed data to the display device driver by using a serial data signal. The compression section performs the compression processing with a data compression ratio selected in response to a frame rate with which the display device driver drives the display device. The display device driver receives the serial data signal from the transmission section, generates decompressed data by decompressing the compressed data transmitted by the serial data signal, and drives the display device in response to the decompressed data.
Abstract: A system and method is presented for performing dual mode speech recognition, employing a local recognition module on a mobile device and a remote recognition engine on a server device. The system accepts a spoken query from a user, and both the local recognition module and the remote recognition engine perform speech recognition operations on the query, returning a transcription and confidence score, subject to a latency cutoff time. If both sources successfully transcribe the query, then the system accepts the result having the higher confidence score. If only one source succeeds, then that result is accepted. In either case, if the remote recognition engine does succeed in transcribing the query, then a client vocabulary is updated if the remote system result includes information not present in the client vocabulary.
Type:
Grant
Filed:
March 30, 2016
Date of Patent:
June 27, 2017
Assignee:
SoundHound, Inc.
Inventors:
Timothy Stonehocker, Keyvan Mohajer, Bernard Mont-Reynaud
Abstract: A particular method includes encoding a first frame of an audio signal using a first encoder. The method also includes generating, during encoding of the first frame, a baseband signal that includes content corresponding to a high band portion of the audio signal. The method further includes encoding a second frame of the audio signal using a second encoder, where encoding the second frame includes processing the baseband signal to generate high band parameters associated with the second frame.
Type:
Grant
Filed:
March 27, 2015
Date of Patent:
June 20, 2017
Assignee:
QUALCOMM Incorporated
Inventors:
Venkatraman S. Atti, Venkatesh Krishnan
Abstract: Improved methods for coding an ensemble of pulse vectors utilize statistical models (i.e., probability models) for the ensemble of pulse vectors, to more efficiently code each pulse vector of the ensemble. At least one pulse parameter describing the non-zero pulses of a given pulse vector is coded using the statistical models and the number of non-zero pulse positions for the given pulse vector. In some embodiments, the number of non-zero pulse positions are coded using range coding. The total number of unit magnitude pulses may be coded using conditional (state driven) bitwise arithmetic coding. The non-zero pulse position locations may be coded using adaptive arithmetic coding. The non-zero pulse position magnitudes may be coded using probability-based combinatorial coding, and the corresponding sign information may be coded using bitwise arithmetic coding. Such methods are well suited to coding non-independent-identically-distributed signals, such as coding video information.
Abstract: The present invention provides a bandwidth extension method and apparatus. The method includes: acquiring a bandwidth extension parameter, where the bandwidth extension parameter includes one or more of the following parameters: a linear predictive coefficient (LPC), a line spectral frequency (LSF) parameter, a pitch period, a decoding rate, an adaptive codebook contribution, and an algebraic codebook contribution; and performing, according to the bandwidth extension parameter, bandwidth extension on a decoded low-frequency signal, to obtain a high frequency band signal. The high frequency band signal recovered by using the bandwidth extension method and apparatus in the embodiments of the present invention is close to an original high frequency band signal, and the quality is satisfactory.
Abstract: A system and method for detecting and correcting bit errors in received packets is disclosed. The presence of bit errors in a received packet are detected using CRC bits carried in the received packet. One or more erroneous bits may be identified in a header of the packet. The erroneous bits are corrected by setting the erroneous bits to match the expected bit settings. The corrected packet is then error-checked using the CRC bits. Errors may be detected in two sequential packets where a second packet is a retransmission of a first packet. Differing bits are identified in the two sequential packets. A packet is modified to include additional combinations of the differing bits and then error-checked with each combination of the differing bits. If a modified packet passes error checking, then process the modified packet.
Abstract: A communication system comprising a server capable of establishing telephone communications between at least two users of a communication network and transcribe audio and/or voice communication signals of either or both users of a telephone call established by the server and the transcription is done in accordance with user selectable feature information entered by the users during a registration procedure to the communication system at a website residing in a registration server in communication with the communication system. The communication system may be part of the PSTN or the Internet or both.
Abstract: A system has multiple audio-enabled devices that communicate with one another over an open microphone mode of communication. When a user says a trigger word, the nearest device validates the trigger word and opens a communication channel with another device. As the user talks, the device receives the speech and generates an audio signal representation that includes the user speech and may additionally include other background or interfering sound from the environment. The device transmits the audio signal to the other device as part of a conversation, while continually analyzing the audio signal to detect when the user stops talking. This analysis may include watching for a lack of speech in the audio signal for a period of time, or an abrupt change in context of the speech (indicating the speech is from another source), or canceling noise or other interfering sound to isolate whether the user is still speaking.
Abstract: The present disclosure provides techniques for adjusting a temporal gain parameter and for adjusting linear prediction coefficients. A value of the temporal gain parameter may be based on a comparison of a synthesized high-band portion of an audio signal to a high-band portion of the audio signal. If a signal characteristic of an upper frequency range of the high-band portion satisfies a first threshold, the temporal gain parameter may be adjusted. A linear prediction (LP) gain may be determined based on an LP gain operation that uses a first value for an LP order. The LP gain may be associated with an energy level of an LP synthesis filter. The LP order may be reduced if the LP gain satisfies a second threshold.
Abstract: A method for a machine or group of machines to watermark an audio signal includes receiving an audio signal and a watermark signal including multiple symbols, and inserting at least some of the multiple symbols in multiple spectral channels of the audio signal, each spectral channel corresponding to a different frequency range. Optimization of the design incorporates minimizing the human auditory system perceiving the watermark channels by taking into account perceptual time-frequency masking, pattern detection of watermarking messages, the statistics of worst case program content such as speech, and speech-like programs.
Abstract: A method for speech retrieval includes acquiring a keyword designated by a character string, and a phoneme string or a syllable string, detecting one or more coinciding segments by comparing a character string that is a recognition result of word speech recognition with words as recognition units performed for speech data to be retrieved and the character string of the keyword, calculating an evaluation value of each of the one or more segments by using the phoneme string or the syllable string of the keyword to evaluate a phoneme string or a syllable string that is recognized in each of the detected one or more segments and that is a recognition result of phoneme speech recognition with phonemes or syllables as recognition units performed for the speech data, and outputting a segment in which the calculated evaluation value exceeds a predetermined threshold.
Abstract: A method for encoding encrypted data for further processing includes: receiving an input data vector of length m; splitting the input data vector to k multiple vectors; multiplying each of the multiple vectors by a power of 2 to obtain k number of intermediate vectors; summing the k number of intermediate vectors to obtain a single summed vector; encrypting the single summed vector to obtain an encrypted vector; sending the encrypted vector to an operational unit to have the encrypted vector operated on to obtain a processed encrypted vector; receiving the processed encrypted vector; decrypting the received encrypted vector; dividing the processed decrypted vector by a power of 2, modulus a power of 2 to obtain multiple transitional vectors of the same dynamic range and the same length; and concatenating the multiple transitional vectors to obtain a recovered vector of length m.
Abstract: A method for speech retrieval includes acquiring a keyword designated by a character string, and a phoneme string or a syllable string, detecting one or more coinciding segments by comparing a character string that is a recognition result of word speech recognition with words as recognition units performed for speech data to be retrieved and the character string of the keyword, calculating an evaluation value of each of the one or more segments by using the phoneme string or the syllable string of the keyword to evaluate a phoneme string or a syllable string that is recognized in each of the detected one or more segments and that is a recognition result of phoneme speech recognition with phonemes or syllables as recognition units performed for the speech data, and outputting a segment in which the calculated evaluation value exceeds a predetermined threshold.
Abstract: Improved methods for coding an ensemble of pulse vectors utilize statistical models (i.e., probability models) for the ensemble of pulse vectors, to more efficiently code each pulse vector of the ensemble. At least one pulse parameter describing the non-zero pulses of a given pulse vector is coded using the statistical models and the number of non-zero pulse positions for the given pulse vector. In some embodiments, the number of non-zero pulse positions are coded using range coding. The total number of unit magnitude pulses may be coded using conditional (state driven) bitwise arithmetic coding. The non-zero pulse position locations may be coded using adaptive arithmetic coding. The non-zero pulse position magnitudes may be coded using probability-based combinatorial coding, and the corresponding sign information may be coded using bitwise arithmetic coding. Such methods are well suited to coding non-independent-identically-distributed signals, such as coding video information.
Abstract: Techniques for providing self-voice feedback in a communications headset include processing signals carrying near-end speech in parallel digital and analog signal processing paths to produce a combined gain-adjusted near-end signal carrying the near-end speech for output to transducers of the communications device.
Type:
Grant
Filed:
June 13, 2014
Date of Patent:
April 11, 2017
Assignee:
Bose Corporation
Inventors:
Shaun Cassidy Keller, Paul D. Gjeltema, Jordan Jeffery Bonner, Simon G. Ravvin
Abstract: A processing circuit for a digital sensor, including: a control stage, which generates a control signal; a multiplexing stage, which may be electrically coupled to a plurality of sensing structures for receiving corresponding detection signals and generates a multiplexed signal, on the basis of one between the detection signals, as a function of the control signal; an analog-to-digital conversion stage, which is connected to the multiplexing stage and generates an encoded signal on the basis of the multiplexed signal; and an equalizer, which multiplies the encoded signal by a coefficient that depends upon the control signal.
Abstract: An apparatus, system, and method are provided for energy conversion. For example, the apparatus can include a trans-impedance node, a reactive element, and a trans-impedance circuit. The reactive element can be configured to transfer energy to the trans-impedance node. The trans-impedance circuit can be configured to receive one or more control signals and to dynamically adjust an impedance of the trans-impedance node. The trans-impedance node, as a result, can operate as an RF power switching supply based on the one or more control signals.
Abstract: Systems and methods for generating charging data records for commercial group-based messages broadcast by a telecommunications network to a group of subscribers are provided. In one aspect, a network element of a telecommunications network is configured to identify individual user devices that successfully received and processed a message that was broadcast over the telecommunications network to a target group of user devices. The identification of the particular devices in the target group of devices that successfully received and processed the broadcast message is used for generating sender based charging or sender-plus-receiver based charging data records for the subscribers of the telecommunication network.
Abstract: An encoding concept which is linear prediction based and uses spectral domain noise shaping is rendered less complex at a comparable coding efficiency in terms of, for example, rate/distortion ratio, by using the spectral decomposition of the audio input signal into a spectrogram having a sequence of spectra for both linear prediction coefficient computation as well as spectral domain shaping based on the linear prediction coefficients. The coding efficiency may remain even if such a lapped transform is used for the spectral decomposition which causes aliasing and necessitates time aliasing cancellation such as critically sampled lapped transforms such as an MDCT.
Type:
Grant
Filed:
August 14, 2013
Date of Patent:
March 14, 2017
Assignee:
Fraunhofer-Gesellschaft zur Foerderung der angewandten Forschung e.V.
Inventors:
Goran Markovic, Guillaume Fuchs, Nikolaus Rettelbach, Christian Helmrich, Benjamin Schubert
Abstract: A parametric stereo upmix method for generating a left signal and a right signal from a mono downmix signal based on spatial parameters includes predicting a difference signal comprising a difference between the left signal and the right signal based on the mono downmix signal scaled with a prediction coefficient. The prediction coefficient is derived from the spatial parameters. The method further includes deriving the left signal and the right signal based on a sum and a difference of the mono downmix signal and said difference signal.
Abstract: An encoding apparatus comprises a frame processor (105) which receives a multi channel audio signal comprising at least a first audio signal from a first microphone (101) and a second audio signal from a second microphone (103). An ITD processor 107 then determines an inter time difference between the first audio signal and the second audio signal and a set of delays (109, 111) generates a compensated multi channel audio signal from the multi channel audio signal by delaying at least one of the first and second audio signals in response to the inter time difference signal. A combiner (113) then generates a mono signal by combining channels of the compensated multi channel audio signal and a mono signal encoder (115) encodes the mono signal. The inter time difference may specifically be determined by an algorithm based on determining cross correlations between the first and second audio signals.
Abstract: In accordance with an implementation of the disclosure, systems and methods are provided for providing an estimate for noise in a speech signal. An instantaneous power value is received that corresponds to a frequency index of a portion of the speech signal. A first weighted power value is updated based on the instantaneous power value and a first weighting parameter. A second weighted power value is updated based on the first weighed power value and a second weighting parameter. An estimate of the noise is computed from the instantaneous power value and the second weighted power value.
Abstract: A method, system, and computer program product for enablement of a phone conversation. The method includes receiving a combined signal including an interference signal and a first voice signal from a first user having a communication with a second user. The interference signal can be used to prevent the first voice signal from being overheard by people near the first user. The first voice signal can be extracted from the combined signal based at least in part on the interference signal and transmitting the extracted first voice signal to the second user. The system and computer program product are also provided.
Type:
Grant
Filed:
October 16, 2015
Date of Patent:
February 7, 2017
Assignee:
INTERNATIONAL BUSINESS MACHINES CORPORATION
Inventors:
Feng Cao, Xing Zhi Sun, Jianbin Tang, Yini Wang
Abstract: Technologies for measuring a data throughput rate of a link typically used for transferring media catalogs and media between a media provider and an UPnP Control Point.
Abstract: A system, article, and method of acoustic signal mixing comprises use of a total pair that is a count of the number of addition coefficients and subtraction coefficients in a mixture configuration and used with a function applied to a frame of acoustic samples to determine an outcome.
Type:
Grant
Filed:
December 9, 2014
Date of Patent:
January 17, 2017
Assignee:
Intel Corporation
Inventors:
Phani Kumar Nyshadham, Niranjan Avadhanam, Shivakumar D R
Abstract: The present invention relates to an audio signal coding method and apparatus. The method includes: categorizing audio signals into high-frequency audio signals and low-frequency audio signals; coding the low-frequency audio signals by using a corresponding low-frequency coding manner according to characteristics of low-frequency audio signals; and selecting a bandwidth extension mode to code the high-frequency audio signals according to the low-frequency coding manner and/or characteristics of the audio signals.
Abstract: Methods, systems, and terminal devices for transmitting information are provided. An exemplary system includes a sending end and at least one receiving end. The sending end is configured to obtain audio data to be transmitted, encode the obtained audio data according to an M-bit unit length, and use a pre-set cross-platform audio interface to control an audio outputting device of the sending end to send the encoded audio data to the at least one receiving end. The M-bit unit length is an encoding length corresponding to each frequency of a number N of frequencies, N is greater than or equal to 2, and M is greater than 0. The at least one receiving end is configured to use the pre-set cross-platform audio interface to control an audio inputting device of the at least one receiving end to receive the encoded audio data.
Type:
Grant
Filed:
February 6, 2015
Date of Patent:
November 29, 2016
Assignee:
TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED
Abstract: Features described herein relate to providing the capability to playback audiovisual content in a comprehensible manner at a rate adjustable by the viewer. For example, if a viewer wishes to watch a one hour news program, but the viewer only has thirty minutes to view the program, playback of the program at twice the rate, but in a comprehensible manner is provided. To provide the playback of the video at the adjustable rate, substitute audio is generated by adding or removing audio content without changing the playback rate of the audio. The video at the adjusted playback rate and the substitute audio at the normal playback rate may have the same duration and in some embodiments, may be presented synchronously.
Abstract: An embodiment according to the invention provides a capability of automatically predicting how favorable a given speech signal is for statistical modeling, which is advantageous in a variety of different contexts. In Multi-Form Segment (MFS) synthesis, for example, an embodiment according to the invention uses prediction capability to provide an automatic acoustic driven template versus model decision maker with an output quality that is high, stable and depends gradually on the system footprint. In speaker selection for a statistical Text-to-Speech synthesis (TTS) system build, as another example context, an embodiment according to the invention enables a fast selection of the most appropriate speaker among several available ones for the full voice dataset recording and preparation, based on a small amount of recorded speech material.
Type:
Grant
Filed:
September 7, 2012
Date of Patent:
November 1, 2016
Assignee:
Nuance Communications, Inc.
Inventors:
Alexander Sorin, Slava Shechtman, Vincent Pollet
Abstract: Disclosed herein are systems, methods, and non-transitory computer-readable storage media for building an automatic speech recognition system through an Internet API. A network-based automatic speech recognition server configured to practice the method receives feature streams, transcriptions, and parameter values as inputs from a network client independent of knowledge of internal operations of the server. The server processes the inputs to train an acoustic model and a language model, and transmits the acoustic model and the language model to the network client. The server can also generate a log describing the processing and transmit the log to the client. On the server side, a human expert can intervene to modify how the server processes the inputs. The inputs can include an additional feature stream generated from speech by algorithms in the client's proprietary feature extraction.
Type:
Grant
Filed:
November 23, 2010
Date of Patent:
November 1, 2016
Assignee:
AT&T Intellectual Property I, L.P.
Inventors:
Enrico Bocchieri, Dimitrios Dimitriadis, Horst J. Schroeter
Abstract: In accordance with an example embodiment of the present invention, disclosed is a method and an apparatus for voice activity detection (VAD). The VAD comprises creating a signal indicative of a primary VAD decision and determining hangover addition. The determination on hangover addition is made in dependence of a short term activity measure and/or a long term activity measure. A signal indicative of a final VAD decision is then created.
Abstract: An audio encoder has a window function controller, a windower, a time warper with a final quality check functionality, a time/frequency converter, a TNS stage or a quantizer encoder, the window function controller, the time warper, the TNS stage or an additional noise filling analyzer are controlled by signal analysis results obtained by a time warp analyzer or a signal classifier. Furthermore, a decoder applies a noise filling operation using a manipulated noise filling estimate depending on a harmonic or speech characteristic of the audio signal.
Type:
Grant
Filed:
November 11, 2014
Date of Patent:
October 11, 2016
Assignee:
Fraunhofer-Gesellschaft zur Foerderung der angewandten Forschung e.V.
Inventors:
Stefan Bayer, Sascha Disch, Ralf Geiger, Guillaume Fuchs, Max Neuendorf, Gerald Schuller, Bernd Edler
Abstract: The present disclosure provides a voice recognition method for use in an electronic apparatus comprising a voice input module. The method comprises: receiving voice data by the voice input module; performing a first pattern voice recognition on the received voice data, including identifying whether the voice data comprises a first voice recognition information; performing a second pattern voice recognition on the voice data if the voice data comprises the first voice recognition information; and performing or refusing an operation corresponding to the first voice recognition information according to a result of the second pattern voice recognition. The present disclosure also provides a voice controlling method, an information processing method, and an electronic apparatus.
Abstract: A coding method, a decoding method, a coder, and a decoder are disclosed herein. A coding method includes: obtaining the pulse distribution, on a track, of the pulses to be encoded on the track; determining a distribution identifier for identifying the pulse distribution according to the pulse distribution; and generating a coding index that includes the distribution identifier. A decoding method includes: receiving a coding index; obtaining a distribution identifier from the coding index, wherein the distribution identifier is configured to identify the pulse distribution, on a track, of the pulses to be encoded on the track; determining the pulse distribution, on a track, of all the pulses to be encoded on the track according to the distribution identifier; and reconstructing the pulse order on the track according to the pulse distribution.
Abstract: The present invention relates to a method and a background estimator in voice activity detector for updating a background noise estimate for an input signal. The input signal for a current frame is received and it is determined whether the current frame of the input signal comprises non-noise. Further, an additional determination is performed whether the current frame of the non-noise input comprises noise by analyzing characteristics at least related to correlation and energy level of the input signal, and background noise estimate is updated if it is determined that the current frame comprises noise.