Xuejing Sun has filed for patents to protect the following inventions. This listing includes patent applications that are pending as well as patents that have already been granted by the United States Patent and Trademark Office (USPTO).
Abstract: Embodiments of the present invention relate to video content assisted audio object extraction. A method of audio object extraction from channel-based audio content is disclosed. The method comprises extracting at least one video object from video content associated with the channel-based audio content, and determining information about the at least one video object. The method further comprises extracting from the channel-based audio content an audio object to be rendered as an upmixed audio signal based on the determined information. Corresponding system and computer program product are also disclosed.
Abstract: The disclosure relates to handling nuisance in teleconference system. An endpoint device (400) for use in a teleconference includes an acquiring unit (401), a judging unit (402), a controller (403) and a processing unit (404). The acquiring unit acquires a media stream for presentation in the teleconference, and receives information from another device. The information includes a first estimation on whether the media stream is a nuisance to the teleconference. As the nuisance to a teleconference, audio or video signals are perceived by users as actually not relevant to the conference session or causing unpleasant feeling or confusion. The judging unit decides whether the media stream is the nuisance at least based on the information. The controller controls the processing of the media stream to degrade or suppress the presentation of the media stream in case that the media stream is decided as the nuisance. The processing unit processes the media stream under the control of the controller.
Abstract: A method of determining a near optimal forward error correction scheme for the transmission of audio data over a lossy packet switched network having preallocated estimated bandwidth, delay and packet losses, between at least a first and second communications devices, the method including the steps of: determining a first coding rate for the audio data; determining a peak redundancy coding rate for redundant versions of the audio data; determining an average redundancy coding rate over a period of time for redundant versions of the audio data; determining an objective function which maximizes a bitrate-perceptual audio quality mapping of the transmitted audio data including a playout function formulation; and optimizing the objective function to produce a forward error correction scheme providing a high bitrate perceptual audio quality.
Abstract: Various disclosed implementations involve processing and/or playback of a recording of a conference involving a plurality of conference participants. Some implementations disclosed herein involve analyzing conversational dynamics of the conference recording. Some examples may involve searching the conference recording to determine instances of segment classifications. The segment classifications may be based, at least in part, on conversational dynamics data. Some implementations may involve segmenting the conference recording into a plurality of segments, each of the segments corresponding with a time interval and at least one of the segment classifications. Some implementations allow a listener to scan through a conference recording quickly according to segments, words, topics and/or talkers of interest.
Abstract: Some implementations involve controlling a jitter buffer size during a teleconference according to a jitter buffer size estimation algorithm based, at least in part, on a cumulative distribution function (CDF). The CDF may be based, at least in part, on a network jitter parameter. The CDF may be initialized according to a parametric model. At least one parameter of the parametric model may be based, at least in part, on legacy network jitter information.
Abstract: In a packet switched voice delivery application which utilizes a jitter buffer for the delivery of sequential packet data, a method of determining a measure of the output jitter of taking packets out of the buffer, the method including the step of: (a) forming a pull jitter measure comprising the differential fetch times between sequential pull packets dived by an expected time interval between packets.
Abstract: A method for processing audio data, the method comprising: receiving audio data corresponding to a plurality of instances of audio, including at least one of: (a) audio data from multiple endpoints, recorded separately or (b) audio data from a single endpoint corresponding to multiple talkers and including spatial information for each of the multiple talkers; rendering the audio data in a virtual acoustic space such that each of the instances of audio has a respective different virtual position in the virtual acoustic space; and scheduling the instances of audio to be played back with a playback overlap between at least two of the instances of audio, wherein the scheduling is performed, at least in part, according to a set of perceptually-motivated rules.
Abstract: Example embodiments disclosed herein relate to filter coefficient updating in time domain filtering. A method of processing an audio signal is disclosed. The method includes obtaining a predetermined number of target gains for a first portion of the audio signal by analyzing the first portion of the audio signal. Each of the target gains is corresponding to a linear subband of the audio signal. The method also includes determining a filter coefficients for time domain filtering the first portion of the audio signal so as to approximate a frequency response given by the target gains. The filter coefficients are determined by iteratively selecting at least one target gain from the target gains and updating the filter coefficient based on the selected at least one target gain. Corresponding system and computer program product for processing an audio signal are also disclosed.
Abstract: Various disclosed implementations involve processing and/or playback of a recording of a conference involving a plurality of conference participants. Some implementations disclosed herein involve receiving speech recognition results data, including a plurality of speech recognition lattices and a word recognition confidence score for each of a plurality of hypothesized words of the speech recognition lattices, for a conference recording. A primary word candidate and alternative word hypotheses may be determined for hypothesized words in the speech recognition lattices. A term frequency metric may be calculated for sorting the primary word candidates and the alternative word hypotheses. Hypothesized words may be rescored according to an alternative hypothesis list.
Abstract: Various disclosed implementations involve processing and/or playback of a recording of a conference involving a plurality of conference participants. Some implementations disclosed herein involve receiving audio data corresponding to a recording of at least one conference involving a plurality of conference participants. In some examples, only a portion of the received audio data will be selected as playback audio data. The selection process may involve a topic selection process, a talkspurt filtering process and/or an acoustic feature selection process. Some examples involve receiving an indication of a target playback time duration. Selecting the portion of audio data may involve making a time duration of the playback audio data within a threshold time difference of the target playback time duration.
Abstract: Example embodiments disclosed herein relate to spatial congruency adjustment. A method for adjusting spatial congruency in a video conference is disclosed. The method in unwarping a visual scene captured by a video endpoint device into at least one rectilinear scene, the video endpoint device being configured to capture the visual scene in an omnidirectional manner, detecting spatial congruency between the at least one rectilinear scene and an auditory scene captured by an audio endpoint device that is positioned in relation to the video endpoint device. The spatial congruency being a degree of alignment between the auditory scene and the at least one rectilinear scene and in response to the detected spatial congruency being below the threshold, adjusting the spatial congruency. Corresponding system and computer program products are also disclosed.
Abstract: Embodiments are described for harmonicity estimation, audio classification, pitch determination and noise estimation. Measuring harmonicity of an audio signal includes calculation a log amplitude spectrum of audio signal. A first spectrum is derived by calculating each component of the first spectrum as a sum of components of the log amplitude spectrum on frequencies. In linear frequency scale, the frequencies are odd multiples of the component's frequency of the first spectrum. A second spectrum is derived by calculating each component of the second spectrum as a sum of components of the log amplitude spectrum on frequencies. In linear frequency scale, the frequencies are even multiples of the component's frequency of the second spectrum. A difference spectrum is derived subtracting the first spectrum from the second spectrum. A measure of harmonicity is generated as a monotonically increasing function of the maximum component of the difference spectrum within predetermined frequency range.
Abstract: Example embodiments disclosed herein relate to audio signal processing based on remote user control. A method of processing an audio signal in an audio sender device is disclosed. The method includes receiving, at a current device, a control parameter from a remote device, the control parameter being generated based on a user input of the remote device and specifying a user preference for an audio signal to be transmitted to the remote device. The method also includes processing the audio signal based on the received control parameter and transmitting the processed audio signal to the remote device. Corresponding computer program product of processing an audio signal and corresponding device are also disclosed. Corresponding method in an audio receiver device and computer program product of processing an audio signal as well as corresponding device are also disclosed.
Abstract: Example embodiments disclosed herein relate to separated audio analysis and processing. A system for processing an audio signal is disclosed. The system includes an audio analysis module configured to analyze an input audio signal to determine a processing parameter for the input audio signal, the input audio signal being represented in time domain. The system also includes an audio processing module configured to process the input audio signal in parallel with the audio analysis module. The audio processing module includes a time domain filter configured to filter the input audio signal to obtain an output audio signal in the time domain, and a filter controller configured to control a filter coefficient of the time domain filter based on the processing parameter determined by the audio analysis module. Corresponding method and computer program product of processing an audio signal are also disclosed.
Abstract: Example embodiments disclosed herein relate to user experience oriented audio signal processing. There is provided a method for user experience oriented audio signal processing. The method includes obtaining a first audio signal from an audio sensor of an electronic device; computing, based on the first audio signal, a compensation factor for an acoustic path from the electronic device to a listener and applying the compensation factor to a second audio signal outputted from the electronic device. Corresponding system and computer program products are disclosed.
Abstract: The present application provides an acoustic echo mitigation apparatus and method, an audio processing apparatus and a voice communication terminal. According to an embodiment, an acoustic echo mitigation apparatus is provided, including: an acoustic echo canceller for cancelling estimated acoustic echo from a microphone signal and outputting an error signal; a residual echo estimator for estimating residual echo power; and an acoustic echo suppressor for further suppressing residual echo and noise in the error signal based on the residual echo power and noise power. Here, the residual echo estimator is configured to be continuously adaptive to power change in the error signal. According to the embodiments of the present application, the acoustic echo mitigation apparatus and method can, at least, be well adaptive to the change of power of the error signal after the AEC processing, such as that caused by change of double-talk status, echo path properties, noise level and etc.
Abstract: Disclosed is a system and computer program product of encoding audio content and corresponding method. The method includes determining a characteristic of the audio content, the characteristic of the audio content including at least one of a type or a property of the audio content. Also the method includes classifying the audio content based on the characteristic of the audio content and determining probabilities for multiple predefined audio coding symbols associated with the audio content by calculating a probability for each of the audio coding symbols based on the result of the classification, the probability for an audio coding symbol indicating a frequency at which the audio coding symbol occurs in the audio content. Further, the method encoded the audio content based on the audio coding symbols and the corresponding probabilities to obtain a code value, the code value representing a compression coding format of the audio content.
April 13, 2016
March 22, 2018
DOLBY LABORATORIES LICENSING CORPORATION, DOLBY INTERNATIONAL AB
Abstract: In an audio processing system (300), a filtering section (350, 400): receives subband signals (410, 420, 430) corresponding to audio content of a reference signal (301) in respective frequency subbands; receives subband signals (411, 421, 431) corresponding to audio content of a response signal (304) in the respective subbands; and forms filtered inband references (412, 422, 432) by applying respective filters (413, 423, 433) to the subband signals of the reference signal.
Abstract: Methods and corresponding apparatuses for transmitting and receiving audio signals are described. A transformation is performed on the audio signals in units of frame in order to obtain transformed audio data of each frame, said transformed audio data consisting of multiple signal components in the frequency domain. These signal components of each frame are distributed into multiple adjacent packets in order to generate packets in which signal components distributed from multiple frames are interleaved. Subsequently, the generated packets are transmitted. Accordingly, in case that packet loss occurs during transmission, the audio signals can be recovered based on the received signal components without consuming additional bandwidth. Therefore, robustness against packet loss can be achieved with little overhead.
Abstract: Voice communication method and apparatus and method and apparatus for operating jitter buffer are described. Audio blocks are acquired in sequence. Each of the audio blocks includes one or more audio frames. Voice activity detection is performed on the audio blocks. In response to deciding voice onset for a present one of the audio blocks, a subsequence of the sequence of the acquired audio blocks is retrieved. The subsequence precedes the present audio block immediately. The subsequence has a predetermined length and non-voice is decided for each audio block in the subsequence. The present audio block and the audio blocks in the subsequence are transmitted to a receiving party. The audio blocks in the subsequence are identified as reprocessed audio blocks. In response to deciding non-voice for the present audio block, the present audio block is cached.