COMPUTATIONALLY EFFICIENT AUDIO CODER
The present invention provides a computationally efficient technique for compression encoding of an audio signal, and further provides a technique to enhance the sound quality of the encoded audio signal. This is accomplished by including more accurate attack detection and a computationally efficient quantization technique. The improved audio coder converts the input audio signal to a digital audio signal. The audio coder then divides the digital audio signal into larger frames having a long-block frame length and partitions each of the frames into multiple short-blocks. The audio coder then computes short-block audio signal characteristics for each of the partitioned short-blocks based on changes in the input audio signal. The audio coder further compares the computed short-block characteristics to a set of threshold values to detect presence of an attack in each of the short-blocks and changes the long-block frame length of one or more short-blocks upon detecting the attack in the respective one or more short-blocks.
Latest Sasken Communication Technologies Limited Patents:
This application is a Divisional of U.S. application Ser. No. 10/466,027, filed on May 20, 2004, which claims the priority benefit and is a National Stage Application under 371 of PCT Application Serial No. PCT/IB01/01371, published on Jul. 18, 2001 as WO 02/056297 A1, which applications and publication are incorporated herein by reference in their entirety.
FIELD OF THE INVENTIONThis invention relates generally to processing of information signals and more particularly pertains to techniques for encoding audio signals inclusive of voice and music using a perceptual audio coder.
BACKGROUNDA Perceptual audio coder is an apparatus that takes series of audio samples as input and compresses them to save disk space or bandwidth. The Perceptual audio coder uses properties of the human ear to achieve the compression of the audio signals.
The technique of compressing audio signals involves recording an audio signal through a microphone and then converting the recorded analog audio signal to a digital audio signal using an A/D converter. The digital audio signal is nothing but a series of numbers. The audio coder transforms the digital audio signal into large frames of fixed-length. Generally, the fixed length of each large frame is around 1024 samples. The analog signal is sampled at a specific rate (called the sampling frequency) and this results in a series of audio samples. Typically a frame of samples is a series of numbers. The audio coder can only process one frame at a time. This means that the audio coder can process only 1024 samples at a time. Then the audio coder transforms the received fixed-length frames (1024 samples) into a corresponding frequency domain. The transformation to a frequency domain is accomplished by using an algorithm, and the output of this algorithm is another set of 1024 samples representing a spectrum of the input. In the spectrum of samples, each sample corresponds to a frequency. Then the audio coder computes masking thresholds from the spectrum of samples. Masking thresholds are nothing but another set of numbers, which are useful in compressing the audio signal. The following illustrates the computing of masking thresholds.
The audio coder computes an energy spectrum by squaring the spectrum of the 1024 samples. Then the samples are further divided into series of bands. For example, the first 10 samples can be one band and the next 10 samples can be another subsequent band and so on. Note that the number of samples (width) in each band varies. The width of the bands is designed to best suit the properties of the human ear for listening to frequencies of sound. Then the computed energy spectrum is added to each of the bands separately to produce a grouped energy spectrum.
The audio coder applies a spreading function to the grouped energy spectrum to obtain an excitation pattern. This operation involves simulating and applying the effects of sounds in one critical band to a subsequent (neighboring) critical band. Generally this step involves convolution with a spreading function, which results in another set of fixed numbers.
Then, based on the tonal or noise-like nature of the spectrum in each critical band, a certain amount of frequency-dependent attenuation is applied to obtain initial masking threshold values. Then, by using an absolute threshold of hearing, the final masked thresholds are obtained. Absolute threshold of hearing is a set of amplitude values below which the human ear will not be able to hear.
Then the audio coder combines the initial masking threshold values with the absolute threshold values to obtain the final masked threshold values. Masked threshold value means a sound value below which a sound is not audible to the human ear (i.e., an estimate of maximum allowable noise that can be introduced during quantization).
Using the masked threshold values, the audio coder computes perceptual entropy (PE) of a current frame. The perceptual entropy is a measure of the minimum number of bits required to code a current frame of audio samples. In other words, the PE indicates how much the current frame of audio samples can be compressed. Various types of algorithms are currently used to compute the PE.
The audio coder receives the grouped energy spectrum, the computed masking threshold values, and the PE and quantizes (compresses) the audio signals. The audio coder has only a restricted number of bits allocated for each frame depending on a bit rate. It distributes these bits across the spectrum based on the masking threshold values. If the masking threshold value is high, then the audio signal is not important and is hence represented using a smaller number of bits. Similarly, if masking threshold is low, the audio signal is important and hence represented using a higher number of bits. Also, the audio coder checks to ensure that the allocated number of bits for the audio signals is not exceeded. The audio coder generally applies a two-loop strategy to allocate and monitor the number of bits to the spectrum. The loops are generally nested and are called Rate Control and Distortion Control Loops. The Rate Control Loop controls the distribution of the bits not to exceed the allocated number of bits, and the Distortion control loop does the distribution of the bits to the received spectrum. Quantization is a major part of the perceptual audio coder. The performance of the audio coder can be significantly improved by reducing the number of calculations performed in the control loops. The current quantization algorithms are very computation intensive and hence result in a slower operation.
Earlier we have seen that the audio coder receives one frame of samples (1024 samples in length) as input and converts the frame of samples into a spectrum and then quantizes using masking thresholds. Sometimes the input audio signal may vary quickly (when the properties of a signal change abruptly). For example, if there is a sudden heavy beat in the audio signal, and if the audio coder receives a frame of 1024 samples in length (including the heavy beat) due to inadequate temporal masking in a signal including abrupt changes, a problem called pre-echo can occur. This is because the sound signal contains error after quantization, and this error can result in an audible noise before the onset of the heavy beat, hence called the pre-echo. Heavy beats are also called ‘attacks.’ A signal is said to have an attack if it exhibits a significant amount of non-stationarity within the duration of a frame under analysis. For example, sudden increase in amplitudes of a time signal within a typical duration of analysis is an attack. To avoid this problem the audio signal is coded with frames having smaller frame lengths instead of the long 1024 samples. To keep continuity in the number of samples given as input usually 8 smaller blocks of 128 samples are coded (8×128 samples=1024 samples). This will restrict the heavy beat to one set of 128 samples among 8 smaller blocks, and hence the noise introduced will not spread to the neighboring smaller blocks as pre-echo. But the disadvantage of coding in 8 smaller blocks of 128 samples is that they require more bits to code than required by the larger blocks of 1024 samples in length. So the compression efficiency of the audio coder is significantly reduced. To improve the compression efficiency, the heavy beats have to be detected accurately so that the smaller blocks can be applied only around the heavy beats. It is important that the heavy beats be accurately detected, or else pre-echo can occur. Also, a false detection of heavy beats can result in significantly reduced compression efficiency. Current methods to detect the heavy beats use the PE. Calculating the PE is computationally very intensive and also not very accurate.
Also, we have seen earlier that the blocks that have attacks should be coded as smaller blocks having 128 samples and others as larger blocks having 1024 samples. The smaller frame lengths of 128 samples are called ‘short-blocks’, and the 1024 samples frame length are called ‘long-blocks.’ We have also seen that the short-blocks require more bits to code than the long-blocks. Also for each large frame there is a fixed number of bits allocated. If we can intelligently save some bits while coding a long-block and use the saved bits in a short-block, the compression efficiency of the audio coder can be significantly increased. For storing the bits, a ‘Bit Reservoir mechanism’ is needed. Since long-blocks do not need a large number of bits, the unused bits from the long-blocks can be saved in the bit reservoir and used later for a short-block. Currently there are no efficient techniques to save and allocate bits between long and short-blocks to improve the compression efficiency of the audio coder.
The audio signal can be of two types (i) single channel or mono-signal and (ii) multi-channel or stereo signal to produce spatial effects. The stereo signal is a multi-channel signal comprised of two channels, namely left and right channels. Generally the audio signals in the two channels have a large correlation between them. By using this correlation the stereo channels can be coded more efficiently. Instead of directly coding the stereo channels, if their sum and difference signals are coded and transmitted where the correlation is high, a better quality of sound is achieved at a same bit rate. When the audio signal is a stereo signal, the audio coder can operate in two modes (a) normal mode and (b) M-S mode. The M-S mode means encoding the sum and difference of the left and right channels of the stereo. Currently the decision to switch between the normal and M-S modes is based on the PE. As explained before, computing PE is very computation intensive and inconsistent.
Therefore, there is a need in the art for a computationally efficient quantization technique. Also, there is a need in the art for an improved attack detection technique that is computationally less intensive and more accurate, to improve the compression efficiency of the audio coder. In addition, there is a need in the art for a technique to allocate the bits between the long and short-blocks to improve the computation efficiency of the audio coder. Furthermore, there is also a need in the art for a technique that is computationally efficient and more accurate in switching between the normal and the M-S modes when the audio signal is a stereo signal.
SUMMARY OF THE INVENTIONThe present invention provides an improved technique for detecting an attack in an input audio signal to reduce pre-echo artifacts caused by attacks during compression encoding of the input audio signal. This is accomplished by providing a computationally efficient and more accurate attack detection technique. The improved audio coder converts the input audio signal to a digital audio signal. The audio coder then divides the digital audio signal into larger frames having a long-block frame length and partitions each of the frames into multiple short-blocks. The audio coder then computes short-block audio signal characteristics for each of the partitioned short-blocks based on changes in the input audio signal. The audio coder further compares the computed short-block characteristics to a set of threshold values to detect presence of an attack in each of the short-blocks and changes the long-block frame length of one or more short-blocks upon detecting the attack in the respective one or more short-blocks.
Further, the improved audio coder increases compression efficiency by efficiently allocating bits between long and short-blocks. The audio coder that is computationally efficient and more accurate in switching between the normal and M-S modes when the audio signal is a stereo signal. In addition, the present invention also describes a technique for reducing the computational complexity of quantization.
The present invention provides an improved audio coder by increasing the efficiency of the audio coder during compression of an input audio signal. This is accomplished by providing computationally efficient and more accurate attack detection and quantization technique. Also, compression efficiency is improved by providing a technique to allocate bits between long and short-blocks. In addition, the present invention provides an audio coder that is computationally efficient and more accurate in switching between the normal and M-S modes when the audio signal is a stereo signal. The words ‘encode’ and ‘code’ are used interchangeably throughout this document to represent the same audio compression scheme. Also the words ‘encoder’ and ‘coder’ are used interchangeably throughout this document to represent the same audio compression system.
The Time frequency generator 110 receives the series of numbers in large frames (blocks) of fixed-length 105. Generally, the fixed length of each frame is around 1024 samples (series of numbers). Time frequency generator 110 can only process one frame at a time. This means that the audio coder 100 can process only 1024 samples at a time. The Time frequency generator 110 then transforms the received fixed-length frames (1024 samples) into corresponding frequency domains. The transformation to the frequency domain is accomplished by using an algorithm, and the output of this algorithm is another set of 1024 samples called a spectrum of the input. In the spectrum, each sample corresponds to a frequency. Then the Time frequency generator 110 computes masking thresholds from the spectrum. Masking thresholds are nothing but another set of numbers that are useful in compressing the audio signal. The following illustrates one example embodiment of computing masking thresholds.
The Time frequency generator 110 computes an energy spectrum by squaring the spectrum of 1024 samples. Then the samples are further divided into series of bands. For example, the first 10 samples can be one band and the next 10 samples can be another subsequent band and so on. Note that the number of samples (width) in each band varies. The width of the bands is designed to best suit the properties of the human ear for listening to frequencies of sound. Then the computed energy spectrum is added to each of the bands separately to produce a grouped energy spectrum.
The Time frequency generator 110 then applies a spreading function to the grouped energy spectrum to obtain an excitation pattern. This operation involves simulating and applying the effects of sounds in one critical band to a subsequent (neighboring) critical band. Generally this step involves using a convolution algorithm between the spreading function and the energy spectrum.
Based on the tonal or noise-like nature of the spectrum in each critical band, a certain amount of frequency dependent attenuation is applied to obtain initial masking threshold values. Using an absolute threshold of hearing, the final masked thresholds are obtained. Absolute threshold of hearing is a set of amplitude values below which the human ear will not be able to hear.
The Psychoacoustic model 120 combines the initial masking threshold values with the absolute threshold values to obtain the final masked threshold values. Masked threshold value means a sound value below which quantization noise is not audible to the human ear (it is an estimate of the maximum allowable noise that can be introduced during quantization).
Using the masked threshold values, the Psychoacoustic model 120 computes perceptual entropy (PE). The perceptual entropy is a measure of the minimum number of bits required to code a current frame of audio samples. In other words, the PE indicates how much the current frame of audio samples can be compressed. Various types of algorithms are currently used to compute the PE.
The Quantizer 130 then receives the spectrum, the computed masking threshold values, and the PE, and compresses the audio signals. The Quantizer 130 has only a specific number of bits allocated for each frame. It distributes these bits across the spectrum based on the masking threshold values. If the masking threshold value is high, then the audio signal is not important and hence can be represented using a smaller number of bits and similarly, if the masking threshold is low, the audio signal is important and hence can only be represented using a higher number of bits. Also, the Quantizer 130 checks to make sure that the allocated number of bits for the audio signals is not exceeded. The Quantizer 130 generally applies a two-loop strategy to allocate and monitor the number of bits to the received spectrum. The loops are generally nested and are called Rate control and Distortion control loops. The Rate Control loop controls the global gain so that the number of bits used to code the spectrum does not exceed the allocated number of bits, and the Distortion control loop does the distribution of the bits to the received spectrum. Quantization is a major part of the perceptual audio coder 100. The performance of the Quatizer 130 can be significantly improved by reducing the number of calculations performed in the control loops. The current quantization algorithms used in the Quantizer 130 are very computation intensive and hence result in slower operation.
BitStream formatter 140 receives the compressed audio signal (coded bits) from the Quatizer 130 and converts it into a desired format/syntax (specified coding standard) such as ISO MPEG-2 AAC.
In operation, the transient detection module 210 receives the input audio signal 105 as a series of numbers in frames of fixed-length and partitions each of the frames into multiple short-blocks. In some embodiments, the fixed length is a long-block frame length of 1024 samples of digital audio signal. The digital audio signal comprises series of numbers. The long-block is used when there is no attack in the input audio signal. In some embodiments, the short-blocks have a frame length in the range of about 100 to 300 samples of digital audio signal.
The transient detection module 210 computes short-block audio signal characteristics for each of the partitioned short-blocks. In some embodiments, computing the short-block audio signal characteristics includes computing inter-block differences (xdiff(m) for an mth short-block) and inter-block ratios, and further determining maximum inter-block difference and ratio, respectively. In some embodiments, computing the inter-block differences includes summing a square of the differences between samples in adjacent short-blocks. Further, in some embodiments, the inter-block ratios are computed to better isolate (detect) the attacks. In this embodiment, the inter-block ratios are computed by dividing the adjacent computed inter-block differences as follows:
r[0]=xdiff[0]/pxdif
r[1]=xdiff[1]/xdiff[0]
r[2]=xdiff[2]/xdiff[1]
r[3]=xdiff[3]/xdiff[2]
r[4]=xdiff[4]/xdiff[3]
where ‘pxdif’ is xdiffp[4] (which is xdiff[4] of the previous frame)
The transient detection module 210 compares the computed short-block characteristics with a set of threshold values to detect the presence of an attack in each of the short-blocks. Then the transient detection module 210 changes the long-block frame length of the frame including the attack based on the outcome of the comparison, and inputs the changed frame length to the time frequency generator 110 to reduce the effect of the pre-echo caused by the attack. In some embodiments, the time frequency generator uses short-blocks to restrict the attack to a smaller frame so that the attack does not spread to adjacent smaller frame lengths to reduce the pre-echo artifact caused by the attack. In this embodiment, the smaller frames have a frame length in the range of about 100 to 200 samples of digital audio signal.
The following computational sequence is used in detecting the presence of an attack in the adjacent frame 320:
The inter block differences xdiff(m) 340 in the time domain are computed using the following algorithm:
where s(j,m) is the j'th time domain sample of the m'th short-block and s(j,m−1) corresponds to time domain samples of the last short-block of the adjacent frame 320. The Diff blocks 350 shown in
In some embodiments, the short-block frame lengths are tuned to the application in use. In these embodiments, distance between the large frames is computed to determine an optimum size for the short-block frame lengths. The following algorithm is used to compute the distance between the large frames:
xdiff(m)=d(Ŝm,Ŝm-1)
where Ŝm and Ŝn-1 380 are the signal sub-vectors for the mth and (m−1)th short-blocks, and d(•) is a function that returns a distance measure between the two vectors.
The Quantizer 130 receives the large and small frames including the samples of digital audio signal from the time frequency generator 110. Further, the Quantizer 130 receives the computed perceptual entropy from the psychoacoustic model 120 shown in
The Rate Control Loop 430 computes global gain, which is commonly referred to as “common_scalefac” for a given set of spectral values with a pre-determined value for the maximum number of bits available for encoding the frame (referred to as “available_bits”). The Rate Control Loop arrives at a unique solution for the common_scalefac value for a given set of spectral data for a fixed value of available_bits, so any other variation of the Rate Control Loop must necessarily arrive at the same solution. Efficiency of the Rate Control Loop is increased by reducing the number of iterations required to compute the common_scalefac value. The technique of reducing the number of iterations required to compute the common_scalefac value according to the teachings of the present invention is discussed in detail in the following section.
The Quantizer 130 stores a start_common_scalefac value of a previous adjacent frame to use in quantization of a current frame. The Rate Control Loop 430 computes the common_scalefac value for the current frame using the stored start_common_scalefac value as a starting value during computation of iterations by the Rate Control Loop 430 to reduce the number of iterations required to compute the common_scalefac value of the current frame. Further, the Rate control Loop 430 computes counted_bits using the common_scalefac value of the current frame. The comparator 427 coupled to the Rate control Loop compares the computed count_bits with available_bits. The Rate Control Loop changes the computed common_scalefac value based on the outcome of the comparison. In some embodiments, the count_bits comprises bits required to encode a given set of spectral values for the current frame.
The Distortion Control Loop 440 is coupled to the Rate Control Loop 430 to distribute the bits among the samples in the spectrum based on the masking thresholds received from the psychoacoustic model. Also, the Distortion Control Loop 440 tries to allocate bits in such a way that quantization noise is below the masking thresholds. The Distortion Control Loop 440 also sets the starting value of start_common_scalefac to be used in the Rate Control Loop 430.
Step 520 includes dividing the converted digital audio signal into large frames having a long-block frame length. In some embodiments, the long-block frame length comprises 1024 samples of digital audio signal. In this embodiment, the samples of digital audio signal comprise series of numbers. In this embodiment, the long-block frame length comprises a frame length used when there is no attack in the input audio signal.
Step 530 includes partitioning each of the large frames into multiple short-blocks. In some embodiments, partitioning large frames into short-blocks includes partitioning short-blocks having short-block frame lengths in the range of about 100 to 300 samples.
Step 540 includes computing short-block characteristics for each of the partitioned short-blocks based on changes in the input audio signal. In some embodiments, the computing of the short-block characteristics includes computing inter-block differences and determining a maximum inter-block difference from the computed inter block differences. In some embodiments, the computing of short-block characteristics further includes computing inter-block ratios and determining a maximum inter-block ratio from the computed inter-block ratios. In this embodiment, the computing of inter-block differences includes summing a square of the differences between samples in adjacent short-blocks. Also in this embodiment the computing of the inter-block ratios includes dividing the adjacent computed inter-block differences. The process of computing the short-block characteristics is discussed in more detail with reference to
Step 550 includes comparing the computed short-block characteristics to a set of threshold values to detect a presence of the attack in each of the short-blocks. Step 560 includes changing the long-block frame length of one or more large frames based on the outcome of the comparison to reduce the pre-echo artifact caused by the attack. In some embodiments, the changing of the long-block frame length means changing to include multiple smaller frames to restrict the attack to one or more smaller frames so that the pre-echo artifact caused by the attack does not spread to the adjacent larger frames. In some embodiments, the smaller frame lengths include about 100 to 200 samples of digital audio signal.
Step 620 includes computing a perceptual entropy for the current frame of audio samples using the masking thresholds computed as described in detail with reference to
The following example further illustrates the operation of the above-described operation 600 of the bit allocation strategy:
For example, if a given mono (single) audio signal at a bit rate of 64 kbps is sampled at a sampling frequency of 44100 Hz (meaning there are 44100 samples per second which needs to be encoded at a bit rate of 64000 bits per second) and the long-block frame length is 1024 samples, the average number of bits are computed as follows:
Therefore each frame is coded using 1486 bits. Each of the frames does not require the same number of bits. Also each of the frames does not require all of the bits. Assuming the first frame to be coded requires 1400 bits, the remaining unused 86 bits are stored in the Bit Reservoir and can be used in succeeding frames. For the next adjacent frame we will have a total of 1572 bits (1486 bits+86 bits in the Bit Reservoir) available for coding. For example, if the next adjacent frame is a short frame more bits
can be allocated for coding.
In some embodiments, less than the average number of bits are used for encoding the large frames (using a reduction factor) and the remaining bits are stored in the Bit Reservoir. For example, in the above case only 1300 bits are allocated for each of the large frames. Then the remaining 186 bits (reduction factor) are stored in the Bit Reservoir.
Generally the Bit Reservoir cannot be used to store a large number of remaining bits. Therefore, a maximum limit is set for the number of bits that can be stored in the Bit Reservoir, and anytime the number of bits exceeds the maximum limit, the excess bits are allocated to the next frame. In the above example, if the bit reservoir has exceeded the maximum limit, then the next frame will receive 1300 bits along with the number of bits by which the Bit reservoir has exceeded the limit.
In the above-described operation 600 when the next frame is a small frame (small frames generally occur rarely), then more bits are allocated to the small frame from the Bit Reservoir. The number of extra bits that can be allocated to the small frame is dependent on two factors. One is the number of bits present in the Bit Reservoir and the other is the number of consecutive small blocks present in the input audio signal. Basically the strategy described in the above operation 600 is to remove bits from the long frames and to allocate the removed bits to the small frames as needed.
At 720 counted_bits associated with the current frame are computed. In some embodiments, computing counted_bits includes qunatizing the spectrum of the current frame and then computing the number of bits required to encode the quantized spectrum of the current frame.
At 730 a difference between the computed counted_bits and available_bits are computed. In some embodiments, the available_bits are the number of bits made available to encode the spectrum of the current frame. In some embodiments, the difference between the computed counted_bits and the available_bits are computed by comparing the computed counted_bits with the available_bits.
At 740 the computed difference is compared with a pre-determined MAXDIFF value. Generally, the value of pre-determined MAXDIFF is set to be in the range of about 300-500.
At 750 the common_scalefac value and quantizer_change value are reset based on the outcome of the comparison. In some embodiments, the common_scalefac value is reset when the computed difference is greater than the pre-determined MAXDIFF, and the common_scalefac value is changed based on the outcome of the comparison when the computed difference is less than or equal to the pre-determined MAXDIFF value.
In some embodiments, the changing of the common_scalefac value based on the outcome of the comparison further includes storing the computed counted_bits along with the associated common_scalefac value, then comparing the counted_bits with the available bits, and finally changing the common_scalefac value based on the outcome of the comparison.
In some embodiments, changing the common_scalefac value based on the outcome of the comparison further includes assigning a value to a quantizer_change, and changing the common_scalefac value using the assigned value to the quantizer_change and repeating the above steps when the counted_bits is greater than the available_bits. Some embodiments include restoring the counted_bits and outputting the common_scalefac value when the counted_bits is less than or equal to available_bits.
In some embodiments, resetting the common_scalefac value further includes computing predicted_common_scalefac value based on stored common_scalefac value of the previous frame adjacent to the current frame, and resetting the common_scalefac value. In case counted_bits is greater than available_bits, common_scalefac is set to the start_common_scalefac value+64, when the start_common_scalefac value+64 is not greater than predicted_common_scalefac value, otherwise common_scalefac is set to predicted_common_scalefac and quantizer_change is set to 64. Some embodiments include setting common_scalefac to start_common_scalefac+32, and further setting quantizer_change to 32 when the counted_bits is less than or equal to available_bits and the common_scalefac is not greater than start_common_scalefac+32 and if predicted_common_scalefac is greater than the present common_scalefac, recomputing counted bits. Further, some embodiments include setting the start_common_scalefac+64 when the counted_bits is less than or equal to available_bits, and the common_scalefac value is greater than the start_common_scalefac+32 and if predicted_common_scalefac is greater than the present common_scalefac, recomputing counted_bits.
Step 830 includes partitioning each of the frames into corresponding multiple left and right short-blocks having short-block frame length. In some embodiments, the short-block frame-length includes samples in the range of about 100 to 300 samples of digital audio signal.
Step 840 includes computing left and right short-block characteristics for each of the partitioned left and right short-blocks. In some embodiments, the computing the short-block characteristics includes computing the sum and difference short-block characteristics by summing and subtracting respective samples of the digital audio signals in the left and right short-blocks. In some embodiments, computing the sum and difference short-block characteristics further includes computing sum and difference energies in each of the short-blocks in the left and right short-blocks by squaring each of the samples and adding the squared samples in each of the left and right short-blocks. In addition, the short-block energy ratio is computed for each of the short-blocks computed sum and difference energies, further determining a number of short-blocks whose computed short-block energy ratio exceeds a pre-determined energy ratio value.
Step 850 includes encoding the stereo audio signal based on the computed short-block characteristics. In some embodiments, the encoding of the stereo signal includes using a sum and difference compression encoding technique to encode the left and right audio signals based on the determined number of short-blocks exceeding the pre-determined energy ratio value. In some embodiments, the pre-determined energy value is greater than 0.75 and less than 0.25.
The above-described invention increases compression efficiency by providing a technique to allocate bits between long and short-blocks. Also, the present invention significantly enhances the sound quality of the encoded audio signal by more accurately detecting an attack and reducing pre-echo artifacts caused by attacks. In addition, the present invention provides an audio coder that is computationally efficient and more accurate in switching between the normal and the M-S modes when the audio signal is a stereo signal.
The above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those skilled in the art. The scope of the invention should therefore be determined by the appended claims, along with the full scope of equivalents to which such claims are entitled.
Claims
1. (canceled)
2. A method for processing an audio signal, comprising:
- converting the audio signal into a digital audio signal;
- dividing the digital audio signal into large frames having a long-block frame length;
- partitioning each of the large frames into multiple short-blocks;
- computing short-block audio signal characteristics for each of the short-blocks based on changes in the input audio signal;
- comparing the computed short-block audio signal characteristics to a set of threshold values to detect a presence of the attack in each of the short-blocks; and
- changing the long-block frame length of one or more large frames based on the outcome of the comparison to reduce the pre-echo artifact caused by the attack.
3. The method of claim 2, wherein detecting the attack comprises:
- detecting a sudden increase in amplitude within the long-block frame length.
4. The method of claim 2, wherein the long-block frame length comprises 1024 samples of digital audio signal.
5. The method of claim 4, wherein the samples of digital audio signal comprise series of numbers.
6. The method of claim 5, wherein the long-block frame length comprises a frame length used when there is no attack in the input audio signal.
7. The method of claim 5, wherein the short-blocks comprise:
- short-blocks having short-block frame lengths in the range of about 100 to 300 samples.
8. The method of claim 5, wherein computing the short-block audio signal characteristics further comprises:
- computing inter-block differences; and
- determining a maximum inter-block difference from the computed inter-block differences.
9. An apparatus to detect an attack in an input digital audio signal to reduce a pre-echo artifact caused by the attack during compression encoding of the input digital audio signal, comprising:
- a time frequency generator to receive the digital audio signal and divide the digital audio signal into large frames having a long-block frame length, and to further partition each of the large frames into multiple short-blocks; and
- a transient detection module coupled to the time frequency generator to receive the multiple short-blocks and compute short-block audio signal characteristics for each of the received multiple short-blocks based on changes in the input digital audio signal, wherein the transient detection module compares the computed short-block audio signal characteristics to a set of threshold values to detect a presence of the attack in each of the multiple short-blocks, and the transient detection module further changes the long-block frame length of one or more large frames including the attack based on the outcome of the comparison, wherein the time frequency generator receives the changed one or more large frames and compresses the changed one or more large frames to reduce the pre-echo artifact caused by the attack.
10. The apparatus of claim 9, wherein the attack comprises:
- a sudden increase in amplitude within the long-block frame length of the large frame of digital audio signal.
11. The apparatus of claim 10, wherein the long-block frame length comprises 1024 samples of digital audio signal.
12. A computer readable storage device comprising instructions that when executed by a processor execute a process for processing an audio signal by:
- converting the audio signal into a digital audio signal;
- dividing the digital audio signal into large frames having a long-block frame length;
- partitioning each of the large frames into multiple short-blocks;
- computing short-block audio signal characteristics for each of the short-blocks based on changes in the input audio signal;
- comparing the computed short-block audio signal characteristics to a set of threshold values to detect a presence of the attack in each of the short-blocks; and
- changing the long-block frame length of one or more large frames based on the outcome of the comparison to reduce the pre-echo artifact caused by the attack.
13. The computer readable storage device of claim 12, wherein detecting the attack comprises:
- detecting a sudden increase in amplitude within the long-block frame length.
14. The computer readable storage device of claim 12, wherein the long-block frame length comprises 1024 samples of digital audio signal.
15. The computer readable storage device of claim 14, wherein the samples of digital audio signal comprise series of numbers.
16. The computer readable storage device of claim 15, wherein the long-block frame length comprises a frame length used when there is no attack in the input audio signal.
17. The computer readable storage device of claim 15, wherein the short-blocks comprise:
- short-blocks having short-block frame lengths in the range of about 100 to 300 samples.
18. The computer readable storage device of claim 15, wherein computing the short-block audio signal characteristics further comprises:
- computing inter-block differences; and
- determining a maximum inter-block difference from the computed inter-block differences.
19. A method to detect an attack in an input digital audio signal to reduce a pre-echo artifact caused by the attack during compression encoding of the input digital audio signal, comprising:
- receiving the digital audio signal and dividing the digital audio signal into large frames having a long-block frame length, and further partitioning each of the large frames into multiple short-blocks;
- receiving the multiple short-blocks and computing short-block audio signal characteristics for each of the received multiple short-blocks based on changes in the input digital audio signal;
- comparing the computed short-block audio signal characteristics to a set of threshold values to detect a presence of the attack in each of the multiple short-blocks;
- changing the long-block frame length of one or more large frames including the attack based on the outcome of the comparison; and
- receiving the changed one or more large frames and compressing the changed one or more large frames to reduce the pre-echo artifact caused by the attack.
20. The method of claim 19, wherein the attack comprises:
- a sudden increase in amplitude within the long-block frame length of the large frame of digital audio signal.
21. The method of claim 20, wherein the long-block frame length comprises 1024 samples of digital audio signal.
Type: Application
Filed: Mar 21, 2013
Publication Date: Sep 12, 2013
Applicant: Sasken Communication Technologies Limited (Bangalore)
Inventors: K. P. P. Kalyan Chakravarthy (Bangalore), Navaneetha K. Ruthramoorthy (Framingham, MA), Pushkar P. Patwardhan (Powai), Bishwarup Mondal (Kolkata)
Application Number: 13/848,457
International Classification: G10L 19/00 (20060101);