Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders

A system and method for enhancing the speech quality of the mixed-excitation linear predictive (MELP) coder and other low bit-rate speech coders are disclosed. The system includes a robust pitch-detection algorithm, which adjusts or slides a pitch-analysis window to provide the speech coder with more reliable pitch information. In addition, the system is shown to be compatible with the existing MELP coder in terms of the bit stream.

Skip to: Description  ·  Claims  · Patent History  ·  Patent History
Description
CLAIM OF PRIORITY

[0001] This application is a divisional application of a co-pending U.S. Utility Application, entitled, “Apparatus and Quality Enhancement Algorithm for Mixed Excitation Linear Predictive (MELP) and Other Speech Coders,” to Unno et al., filed Sep. 29, 1999, granted Ser. No. 09/408,195, which is incorporated herein by reference in its entirety.

FIELD OF THE INVENTION

[0002] The present invention relates to speech signal coding using a parametric coder to model a speech waveform. The speech signal parameters are communicated via a communications channel and used to synthesize the speech waveform at the receiver. More specifically, the present invention enhances the speech quality and reduces the computations of the mixed excitation linear predictive (MELP) speech coder.

BACKGROUND OF THE INVENTION

[0003] Low bit-rate speech coding technology is widely used for digital voice communication in narrow-bandwidth channels. The common objective of this technology is to transfer the digital speech signal information at a low bit rate (typically 2,400 bits/sec) while providing good quality speech synthesis at the destination. This technology also strives to provide low computational complexity, low memory requirements, and a small algorithmic delay particularly for real-time low-cost voice communications. FIG. 1A illustrates the general environment surrounding speech encoders and decoders as used in a one-way communications system. Full duplex communications are easily enabled by integrating both an encoder and decoder at both ends of the communications system.

[0004] The first widely-used low bit-rate speech coder was the Federal Standard linear predictive coding (LPC) vocoder (FS1015) in which either a periodic pulse train or white noise excites an all-pole filter in order to synthesize speech. While the 2.4 kbps bit rate was attractive, the LPC vocoder was not acceptable for many speech applications as users characterized the synthesized speech as synthetic and buzzy.

[0005] The LPC vocoder analyzes the speech waveform and extracts such parameters as filter coefficients, pitch period, voicing decision, and gain are updated every 20-30 ms and transmitted to the communications channel. The artifacts residing in the traditional LPC vocoder include buzzes, clicks, and tonal noise. In addition, the speech quality is very poor in the presence of background noise. These unintended additions to the synthesized speech are the result of the simple excitation model and the binary voicing decision error.

[0006] Over the years, several low bit-rate speech coding algorithms have been developed, and some state-of-the-art coders now provide a good natural quality. The mixed excitation linear predictive (MELP) coder is one of these speech coders. The MELP coder is a linear-prediction-based speech coder, which includes five features not found in the LPC vocoder: mixed excitation, a periodic pulses, adaptive spectral enhancement, pulse dispersion, and Fourier magnitude modeling. These features improve the synthesized speech quality by removing distortions resident in the LPC vocoder. FIG. 1B and FIG. 1C illustrate block diagrams of the MELP encoder and decoder respectively.

[0007] However, the MELP still has some perceivable distortions, particularly around the non-stationary speech segments and for some low-pitch male speakers. These distortions can also be observed with other low bit-rate speech coders. The distortion around the non-stationary speech segments results from the update of speech parameters at a low frame rate (typically 30-50 frames/sec). It is known that increasing the frame rate helps to solve this problem. Unfortunately, this solution requires a much higher bit rate. Another possible solution is a variable frame-rate system that updates the speech parameters in the less stationary segments at a higher frame rate while maintaining a low frame rate in the stationary segments. Such an approach is provided by the delayed decision approach based on dynamic programming, which uses the future frame information to control the frame rate. This system can produce high-quality speech while maintaining a relatively low bit rate by reducing the average frame rate. However, this method requires a considerably longer algorithmic delay (around 150 ms), which is unacceptable in many applications (such as two-way voice communications).

[0008] The distortion for low-pitch male speakers in the MELP is characterized by a high-pass filtered quality of the coded speech. In other words, the synthesized speech lacks “sound pressure” in the low frequencies. This distortion is caused by a post filter and a preprocessing high-pass filter, which are used in the modern low bit-rate speech coders to remove 60 Hz noise and to enhance the coded speech quality. These filters suppress the harmonic magnitudes in the low frequencies, particularly for low-pitch male speakers whose fundamental frequencies are less than 100 Hz. The suppression of these low frequency harmonics results in a high-pass filtered speech that is perceived as too synthetic.

[0009] The most significant speech distortion present in the prior art is the lack of a suitable model or method to accurately synthesize a plosive sound. Plosive sounds are characterized by the sudden opening or closing of the vocal chords. Plosive phonemes are created when most English speaking persons create sounds such as: “b,” “d,” “g,” “k,” “p,” “t,” “th,” “ch,” or “tch.” It is important to note that the preceding list of plosive phonemes is not exclusive and that not all speakers will create like sounds. Plosive phonemes may be created both at the start and at the end of syllables (i.e., “pop,” “tank,” “tot”), at the end of syllables (i.e., “sound,” “sat,” “shrug”), or at the start of syllables (i.e., “toy,” “boy,” “boss”). Plosive sounds are easily identified in a speech waveform but difficult to model and synthesize in low bit-rate speech coders. Plosive sounds are characterized by an impulse of energy followed by a brief period where the speech waveform is aperiodic. Prior art speech encoders have been unable to model and synthesize plosive sounds in a manner acceptable to the human ear.

SUMMARY OF THE INVENTION

[0010] As described briefly, an object of the present invention is to enhance the coded speech quality of the existing low bit-rate speech coders including the MELP vocoder while maintaining its low bit rate, small algorithmic delay, and low computational complexity.

[0011] Another object of the present invention is to provide an efficient mixed excitation algorithm to reduce the computational complexity of the existing MELP vocoder. Another object of the present invention is to provide bit-stream compatibility with the existing MELP vocoder in order to permit the introduction of the invention into systems where only the present MELP decoder is available. This would allow for backward compatibility through the introduction of an updated encoder while allowing for full system upgrades where both the encoder and the decoder could be updated.

[0012] The present invention provides four embodiments. The first is a robust pitch-detection algorithm. In the encoder, the fixed-length pitch analysis window is manipulated around the original position to seek the position that contains the signal with the highest pitch correlation. Once the window position is determined, pitch is estimated using the signal that is contained in the selected window. Other parameters such as LPC coefficients, gain, and voicing decision are also estimated using the signal corresponding to the selected window. The estimated parameters are used to synthesize the coded speech in the decoder on each sample window in the same manner as earlier fixed-position windows in the prior art.

[0013] The second embodiment is a plosive analysis/synthesis method. In the encoder, the system first detects the frame that contains the plosive signal. The plosive detection is performed with sliding-window peakiness analysis. The detected plosive signal is quantized to only a small number of bits and transmitted via the communication channel to the decoder. In the decoder, the plosive signal is synthesized independently and added back to the coded speech.

[0014] The third embodiment is a post-processor for the Fourier magnitude model. In the decoder, the harmonic magnitudes of the coded speech in the low frequencies are emphasized to overcome the muffling effect of the high pass filter. In this way, the decoded speech is synthesized without the muffling effect often observed in the high-pass filtered speech of current low bit-rate speech encoders.

[0015] The fourth embodiment is a new mixed-excitation algorithm. In the decoder, a pulse train is mixed with random noise in the frequency domain in unvoiced frequency bands to eliminate the band-pass filtering operations, which are required to generate the mixed-excitation signal in the existing MELP coder. The elimination of the filters results in a significant reduction of computational complexity in the MELP decoder. As a result, the present system is shown to be compatible in terms of bit-stream and is interchangeable with the coder/decoder of the existing MELP speech coder.

BRIEF DESCRIPTION OF THE DRAWINGS

[0016] The present invention will be more fully understood from the accompanying drawings of the embodiments of the invention, which however, should not be taken to limit the invention to the specific embodiments enumerated, but are for explanation and for better understanding only. Finally, like reference numerals in the figures designate corresponding parts throughout the drawings.

[0017] FIG. 1A is a block diagram of a communications system having a MELP speech encoder and decoder;

[0018] FIG. 1B is a block diagram illustrating the MELP encoder of FIG. 1A;

[0019] FIG. 1C is a block diagram illustrating the MELP decoder of FIG. 1A;

[0020] FIG. 2A is a block diagram highlighting the new embodiments of the present system;

[0021] FIG. 2B is a block diagram illustrating the new encoder of FIG. 2A;

[0022] FIG. 2C is a block diagram illustrating the new decoder of FIG. 2A;

[0023] FIG. 3A illustrates plosive signal types and locations in a sample sentence and reveals how plosive sounds remain undetected in the prior art;

[0024] FIG. 3B illustrates plosive signal synthesis in coded speech;

[0025] FIG. 3C illustrates a typical LPC residual waveform for a plosive signal;

[0026] FIG. 3D illustrates the Fourier spectrums of an original plosive sound along with the replacement plosive model;

[0027] FIG. 3E illustrates the Fourier spectrums of an original plosive sound with a click with the replacement plosive model;

[0028] FIG. 4 illustrates the relative time shifting in the robust pitch detector shown in FIG. 2B;

[0029] FIG. 5 illustrates a block diagram of the plosive analysis/synthesis system of the present invention as shown in FIG. 2B and FIG. 2C;

[0030] FIG. 6 illustrates the plosive detector of the present invention as shown in FIG. 5;

[0031] FIG. 7 illustrates a block diagram of the plosive synthesizer of the present invention as shown in FIG. 5;

[0032] FIG. 8 illustrates a block diagram of the post-processor for the Fourier magnitude of the present invention as shown in FIG. 2C;

[0033] FIG. 9 illustrates a block diagram of the new mixed excitation method of the present invention as shown in FIG. 2C;

[0034] FIG. 10 illustrates the flow diagram of bit packing for the plosive signal parameters within voiced and unvoiced frames;

[0035] FIG. 11 illustrates the flow diagram of the bit unpacking for the plosive signal parameters for voiced and unvoiced frames.

[0036] FIG. 12 illustrates words with plosive sounds;

[0037] FIG. 13 illustrates the replacement of different plosive types in the present invention;

[0038] FIG. 14 reveals the bit allocation for the plosive signal model;

[0039] FIG. 15 reveals the 99-level Pitch and Voicing level quantization in the existing MELP;

[0040] FIG. 16A reveals the bit allocation in the existing MELP frame; and

[0041] FIG. 16B reveals the bit transmission order in the existing MELP frame.

DESCRIPTION OF THE PREFERRED EMBODIMENT

[0042] The present invention is embedded in the existing MELP coder as shown in FIG. 2A to enhance coded speech quality. It will be apparent to those skilled in the art that the MELP coder can be replaced with other low bit-rate speech coders that are based on a parametric speech coding algorithm in order to practice the current invention. The present invention consists of four embodiments. The first embodiment, a robust pitch detector, is shown as 52 in FIG. 2A. The robust pitch detector 52 replaces a portion of the refinement of pitch and voicing decision 37 in the MELP coder and does not require additional bits for transmission.

[0043] The second embodiment, the plosive analysis/plosive synthesis function is illustrated in FIG. 2A. Plosive analysis 55 is added to the encoder. Plosive synthesis 59 is added to the decoder and requires two bits for transmission.

[0044] The third embodiment, a post-processor for the Fourier magnitude 62, is shown in FIG. 2A. It is added to the decoder and does not require additional bits for transmission.

[0045] The fourth embodiment, a new mixed excitation 35, is also shown in FIG. 2A. It replaces the mixed excitation method of the prior art. The new mixed excitation 35 is embedded in the decoder, and does not require additional bits for transmission.

[0046] MELP Encoder

[0047] FIG. 1B illustrates a block diagram of the processing flow within the MELP encoder. A frame of speech data is processed by the MELP coder every 22.5 ms. Each frame contains 180 voice samples or 8,000 samples per second. The MELP is a parametric speech coder that creates a 54-bit per frame concatenated code that is used by the MELP decoder to synthesize the speech waveform at the receiver. Each frame contains the following parameters: Line Spectral Frequencies (LSFs), Fourier Magnitudes, Gain, Pitch, Band-pass Voicing, Aperiodic Flag, Error Protection (in unvoiced frames only), and a synchronization bit.

[0048] Input speech is encoded as follows. First, the input speech signal is processed through high-pass filter 11 with a cut-off frequency of 60 Hz to remove low-frequency noise. A buffer containing the most recent samples of the actual input speech signal is maintained in the encoder. One of the samples is identified as the last sample of the current frame. The buffer contains samples that extend beyond the current frame both in the past and into the future to enable the coding process. This designated last frame of the sample is the reference point for many of the encoder calculations.

[0049] Next, the speech signal is band-passed filtered into 5 frequency bands from 0-500, 500-100, 1000-2000, 2000-3000, and 3000-4000 Hz for voicing analysis. An initial pitch estimation is made using the 0-500 Hz filter output signal. The measurement is centered on the filter output produced when its input is the last sample in the current frame. The initial pitch estimation from the first band-pass filter is used as the initial reference point for robust pitch detector 52 (FIG. 2B). For each of the remaining frequency bands, the band-pass voicing strength is determined using the pitch determined by the robust pitch detector 52 described below. The time envelopes of each of the band-pass filters are calculated by full-wave rectification followed by a smoothing filter. The analysis windows for each of the remaining frequency bands are centered on the last sample in the current frame as in the case of the first band.

[0050] Robust Pitch Detection

[0051] Most low bit-rate speech coders use the normalized pitch correlation to estimate pitch lag. In the MELP coder, the pitch correlation is also used to make band-pass voicing decisions. The normalized pitch correlation r(T) is computed with the signal in the fixed-position analysis window in the prior art as follows: 1 r ⁡ ( T ) = c T ⁡ ( 0 , T ) c T ⁡ ( 0 , 0 ) ⁢ c T ⁡ ( T , T ) ⁢ ⁢ c T ⁡ ( m , n ) = ∑ k = - T 2 - N 2 - T 2 + N 2 - 1 ⁢ s k + m ⁢ s k + n , Eq .   ⁢ ( 1 )

[0052] where, Sk is the kth sample in the fixed-position window, sO is the signal at the center of the fixed-position window, T is a pitch lag, and N is the number of samples accumulated for the correlation computation.

[0053] The binary voicing decision forces the MELP to use either periodic pulse or noise excitation for each frequency band even in frames containing an irregular or ill-defined pitch. As a result, noise excitation for bands inappropriately designated as noise or pitch excitation inappropriately matched with an inaccurate pitch lag leads to distortion in transitions. To solve this problem, a sliding-sample window is used in the present invention. This method seeks the pitch analysis window position that provides the highest pitch correlation by sliding the window around the original position. This is equivalent to using a more periodically stable signal rather than using a portion of the signal with an irregular pitch for pitch analysis. By using a periodically stable portion of the signal for pitch analysis, the present invention avoids inappropriate voicing decisions and pitch estimates, thus reducing the artifactual nose in the non-periodically stable signal segments.

[0054] FIG. 4 shows a robust pitch detector used in the present invention. In FIG. 4, the normalized pitch correlation in the window 43 is first computed in the same manner as the fixed window pitch detection as shown in Equation (1), where, Sk is the kth signal and s0 is the signal at the center of the original fixed-position window. The normalized pitch correlation in the window 43 is computed recursively as follows: 2 r i ⁡ ( T ) = c T ⁡ ( i , T + i ) c T ⁡ ( i , i ) ⁢ c T ⁡ ( T + i , T + i ) , ⁢ where , &IndentingNewLine; ⁢ c T ⁡ ( i , j ) = c T ⁡ ( i - 1 , j - 1 ) + s i - T 2 + N 2 - 1 ⁢ s j - T 2 + N 2 - 1 - s i - 1 - T 2 - N 2 ⁢ s j - 1 - T 2 - N 2 Eq .   ⁢ ( 2 )

[0055] In each window, the maximum normalized pitch correlation ri(Ti) and the associated pitch lag, Ti is determined and the final pitch lag selected as the pitch lag associated with the maximum normalized pitch correlation r(T) in all windows as follows: 3 r ⁡ ( T ) = max i = - N s N s - 1 ⁢ [ max T ⁢ { r i ⁡ ( T ) } ] , Eq .   ⁢ ( 3 )

[0056] where, Ns is the maximum window-sliding range from the original fixed-position window. In the present invention, an LPC parameter, a gain, band-pass voicing decision, and fractional pitch are computed using the signal in the window that maximizes the normalized pitch correlation. A direct implementation of Equation (2) solving for ri (T) for all values of i would result in a significant increase in the computational complexity. To reduce the additional complexity, the recursion Equation (2) for cT (i, j) is used to compute the autocorrelation.

[0057] The aperiodic flag is set if Vbpl, determined in the voicing analysis for the 0 to 500 Hz band-pass, is less than 0.5 and set to 0 otherwise. When set, the flag informs the decoder that the voiced component of the excitation should be aperiodic.

[0058] A 10th order linear prediction analysis is performed on the input speech signal using a 200 sample (25 ms) Hamming window centered on the last sample in the current frame. A traditional autocorrelation analysis procedure is implemented using Levinson-Durbin recursion. In addition, a bandwidth expansion constant of 0.994 (15 Hz) is applied to the prediction coefficients by multiplying each coefficient by the bandwidth expansion constant.

[0059] Next, a linear prediction residual signal is calculated by filtering the input speech signal with the prediction filter using the coefficients determined above and an inverse of the prediction filter using those same coefficients. The two resulting signals are summed to create the linear prediction residual signal.

[0060] Plosive Analysis

[0061] The plosive analysis/synthesis system of the current invention consists of three parts: plosive detection, plosive modeling, and plosive synthesis. FIG. 5 shows the plosive analysis/synthesis system.

[0062] Plosive Detection

[0063] With reference to FIG. 5, the plosive detector 56 uses a sliding window for “peakiness” computation to detect the frame that contains a plosive signal. The peakiness value is sensitive to the phase of the plosive signal. By using a sliding window to detect a window position that maximizes the peakiness value, the phase sensitivity of the plosive is reduced. The peakiness, P, is defined as a ratio of the L2 norm to the L1 norm of the signal: 4 P = 1 N ⁢ ∑ n = 0 N - 1 ⁢ r n 2 1 N ⁢ ∑ n = 0 N - 1 | r n | , Eq .   ⁢ ( 4 )

[0064] where, rn is a LPC residual signal and N is a frame size. As shown in FIG. 6, the plosive detector slides the peakiness analysis window 63 to find the maximum peakiness value in all windows. The peakiness of each window is given by: 5 P i = 1 N ⁢ ∑ n = 0 N - 1 ⁢ r n + i 2 1 N ⁢ ∑ n = 0 N - 1 | r n + i | = 1 N ⁢ B i 1 N ⁢ A i , Eq .   ⁢ ( 5 )

[0065] where, Pi is the peakiness of the ith window from the past, and r0 is the first LPC residual signal in the original fixed-position window. In FIG. 6, the peakiness in the window 63 (P−Ns) is first computed. The peakiness in the window 63 is computed recursively as follows:

Al=Ai−1+|rN−1=i|−|ri−1|

Bi=Bi−1=rN−1=i2−rt−12,  Eq. (6)

[0066] Then, the maximum peakiness value in all windows is used as the peakiness value P of the frame: 6 P = max i = - N s N s - 1 ⁢ [ P i ] , Eq .   ⁢ ( 7 )

[0067] where, Ns is the maximum window-sliding range, which is also used for the pitch detector of the present invention. The peakiness value with the sliding window is illustrated in FIG. 3A along with that of the fixed position window and a corresponding speech input waveform. In addition to the peakiness value, the low pass energy is computed and used to distinguish the rapid onset of a vowel from the plosive signal.

[0068] Plosive Modeling

[0069] In the present invention, a simple model is applied to the plosive signal expression in plosive modeling 57 of FIG. 5 so as to minimize the additional transmission bits. FIG. 12 shows the plosive signals detectable in the English language. Analysis of the frequency spectrums associated with the identified plosive sounds in FIG. 12 reveals that the 28 separate plosive sounds could be closely represented by the frequency spectrums of 18 replacement plosive sounds by aligning the maximum amplitude positions of each plosive signal. Near transparent replacement requires at least a rough spectral fit for each frequency. FIG. 13 illustrates the replacement matrix for the plosive sounds in the current invention.

[0070] In this model, all plosive signals p(n) are produced by scaling and LPC synthesis filtering the single pre-stored template LPC residual signal v(n) as follows: 7 p ⁡ ( n ) = g p ⁢ v ⁡ ( n ) + ∑ i = 1 P ⁢ a i ⁢ p ⁡ ( n - 1 ) , Eq .   ⁢ ( 8 )

[0071] where, gp is the scaling factor based on the energy of the input plosive signal, and ai are the LPC coefficients computed from the input plosive signal. The template plosive signal v(n) was chosen arbitrarily and filtered with the 14th order inverse linear prediction filter. Since only a rough spectral fit between the input and the synthesized plosive signals provides a near transparent sound, an accurate LPC analysis is not required for the input plosive signal. In order to minimize the additional bits required for the plosive model, the same 10th order LPC model used for voiced pitch modeling is used for the production of the plosive signal.

[0072] The parameters for transmission are a plosive flag, a plosive location, and plosive gain. The gain is computed by comparing the energy of the LPC residual of the plosive signal with that of the template signal. For the specific embodiment of the present invention, the gain is quantized with two bits. The position of the plosive signal is identified by seeking the maximum amplitude position in the frame and representing the plosive signal position with one bit in either the first half or the second half of the current frame. Thus, for the specific embodiment of the present invention, the plosive signal is quantized with only four bits including one bit for a plosive flag, two bits for a plosive gain and one bit for plosive position as is shown in FIG. 14. In the present invention, plosive synthesis is performed in the MELP decoder and will be disclosed in the description of the decoder.

[0073] Next, the input speech signal gain is measured twice per frame using a pitch adaptive window length. This adaptive length is identical for both gain measurements and is determined as follows. When Vbp1>0.6, the length is the shortest multiple of P2 which is longer than 120 samples. If this length exceeds 320 samples, it is divided by 2. When Vbpl is less than or equal to 0.6, the window length is 120 samples. The gain calculation for the first window produces G1 and is centered 90 samples before the last sample of the current frame. The calculation for the second window produces G2 and is centered on the last sample of the current frame. The gain is the RMS value, measured in dB, of the signal in the window sn: 8 G i = 10 ⁢ log 10 ⁡ ( 0.01 + 1 L ⁢ ∑ n = 1 L ⁢ s n 2 ) , Eq .   ⁢ ( 9 )

[0074] where, L is the window length. The 0.01 offset prevents the log argument from approaching zero. If a gain measurement is less than 0.0, it is clamped to 0.0. The gain measurement assumes that the input signal range is −32768 to 32767.

[0075] Next, the encoder performs a quantization of the LPC coefficients. First, the LPC coefficients are converted into line spectrum frequencies (LSFs). All adjacent pairs of the LSF components are organized such that each is in ascending frequency order with a minimum of 50 Hz separation. The resulting LSF vector f is quantized using a multi-stage vector quantizer. The resulting vector is used in the Fourier magnitude calculation in the decoder.

[0076] The final pitch value, P3, is quantized on a logarithmic scale with a 99-level uniform quantizer ranging from 20 to 160 samples. These pitch values are then mapped to a 7-bit codeword using a lookup table. The all zero codeword represents the unvoiced state and is sent if Vbpl is less than or equal to 0.6. All 28 codewords with Hamming weight of 1 or 2 are reserved for error protection.

[0077] The two gain values are quantized as follows. G2 is quantized with a 5-bit uniform quantizer ranging from 10 to 77 dB. G1 is quantized to 3 bits using the following adaptive algorithm. If G2 for the current frame is within 5 dB of G2 for the previous frame, and G1 is within 3 dB of the average of G2 values for the current and previous frames, then the frame is steady-state and a code of all zeros is sent to indicate that the decoder should set G1 to the mean of G2 values for the current and previous frames. Otherwise, the frame represents a transition and G1 is quantized with a 7-level uniform quantizer ranging from 6 dB below the minimum of the G2 values for the current and previous frames to 6 dB above the maximum of those G2 values.

[0078] Band-pass voicing quantization occurs as follows. When Vbpl is less than or equal to 0.6 (unvoiced state), the remaining strengths Vbpi, i=2, 3, 4, 5 are set to 0. When Vbpl is >0.6, the remaining voicing strengths are quantized to 1.

[0079] Fourier Magnitude calculation and quantization occurs as follows. The Fourier magnitudes of the first 10 pitch harmonics of the prediction signal residual generated by the quantized prediction coefficients. It uses a 512 point Fast Fourier Transform (FFT) of a 200 sample window centered at the end of the frame. First, a set of quantized predictor coefficients are calculated from the quantized LSF vector. Then, the residual window is generated using the quantized prediction coefficients. Next, a 200 sample Hamming window is applied, the signal is zero-padded to 512 points, and the complex FFT is performed. Finally, the complex FFT output is transformed into magnitudes and the harmonics found with a spectral peak-selecting algorithm.

[0080] The peak-selecting algorithm finds the maximum within a width of 512/P frequency samples centered around the initial estimate for each pitch harmonic, where P is the quantized pitch. This width is truncated to an integer. The initial estimate for the location of the ith harmonic is 512 i/P. The number of harmonic magnitudes searched for is limited to the smaller of 10 or P/4. These magnitudes are then normalized to have a RMS value of 1.0. If fewer than 10 harmonics are found, the remaining magnitudes are set to 1.0.

[0081] The 10 magnitudes are quantized with an 8-bit quantizer. The codebook is searched for a perceptually weighted Euclidean distance, with fixed weights that emphasize low frequencies over higher frequencies. The weights are given by: 9 w i = [ 117 25 + 75 ⁢ ( 1 + 1.4 ⁢ ( f i 1000 ) 2 ) 0.69 ] 2 , i = 1 , 2 , … ⁢   , 10 Eq .   ⁢ ( 10 )

[0082] where,fi=8000i/60 is the frequency in Hz corresponding to the ith harmonic for a default pitch period of 60 samples. The weights are applied to the squared difference between the input Fourier magnitudes and the codebook values.

[0083] Lastly, the MELP encoder adds error protection and structures the 54-bit frame as follows. FIG. 12 shows the bit allocation for the MELP coder. To improve performance in channel errors, the unused coder parameters for the unvoiced mode are replaced with forward error correction. Three Hamming (7,4) codes and one Hamming (8,4) code may be used. The (7,4) code corrects single bit errors, while the (8,4) code detects double bit errors. The (8,4) code is applied to the 4 most significant bits (MSBs) of the first multi-stage vector quantization index, and the 4 parity bits are written over the band-pass voicing. The remaining three bits of the first multi-stage vector quantization index along with the reserved bit, are covered by a (7,4) code with the resulting 3 parity bits written to the MSBs of the Fourier series vector quantization index. The 4 MSBs of the G2 codeword are protected with 3 parity bits which are written to the next 3 bits of the Fourier magnitudes. Finally, the least significant bit (LSB) of the second gain index and the 3 bit G1 codeword are protected with 3 parity bits written to the 2 LSBs of the Fourier magnitudes and the aperiodic flag bit. The parity generator matrix for the Hamming (7,4) code is: 10 G 7 , 4 = [ 1 1 0 1 1 0 1 1 0 1 1 1 ] ⁢   . Eq .   ⁢ ( 11 )

[0084] The parity generator matrix for the Hamming (8,4) code is: 11 G 8 , 4 = [ 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 0 ] . Eq .   ⁢ ( 12 )

[0085] FIG. 16A illustrates the bit allocation across the parameters communicated in the 54 bits of each MELP frame. FIG. 16B shows the transmission order for the 54 bits of each MELP frame for both voiced and unvoiced frame modes.

[0086] MELP Decoder

[0087] The received bit stream is unpacked from the communications channel 18 and assembled into the parametric codewords. Parameter decoding differs for the voiced and unvoiced frames. Pitch is decoded first as it contains the voiced/unvoiced mode information. If the pitch code is all zeros or has only 1 bit set, then the unvoiced mode is used. If two bits are set, a frame erasure is indicated. Otherwise, the pitch value is decoded and the voiced mode is used.

[0088] In the unvoiced mode, the (8,4) Hamming code is decoded to correct single bit errors and to detect double bit errors. If an uncorrectable error is detected, a frame erasure is indicated. Otherwise, the (7,4) Hamming codes are decoded, correcting single bit errors.

[0089] If an erasure is indicated in the current frame, by the Hamming code, by the pitch code, or directly signaled from the communication channel 18, then a frame repeat mechanism is implemented. All of the parameters for the current frame are replaced with the parameters from the previous frame. In addition, the first gain term is set equal to the second gain term so that no gain transitions are permitted.

[0090] If an erasure is not indicated, the remaining parameters are decoded. The LSFs are checked for ascending order and a minimum separation of 50 Hz. In the unvoiced mode, default parameter values are used for the pitch, jitter, band-pass voicing, and Fourier magnitudes. The pitch value is set to 50 samples, the jitter is set to 25%, the band-pass voicing strengths are set to 0, and the Fourier magnitudes are set to 1.0. In the voiced mode, Vbpl is set to 1; jitter is set to 25% if the aperiodic flag is set; otherwise, jitter is set to 0%. The band-pass voicing strength for the upper four bands is set to 1.0 if the corresponding bit is a 1; otherwise, the voicing strength is set to 0.

[0091] When the special all zero code for the first gain parameter G1, is received, some errors in the second gain parameter, G2, can be detected and corrected. This correction process provides improved performance in channel errors.

[0092] For quiet input signals, a small amount of gain attenuation is applied to both gain parameters using a power subtraction rule. This attenuation is a simplified, frequency invariant case of a smooth spectral subtraction noise suppression method. The background noise estimate is also used in the adaptive spectral enhancement calculation.

[0093] Gain, G1, is then modified by subtracting a positive correction term, Gatt, given in dB by:

Gatt=−10log10(1−1001[Gn+3−G1]).  Eq. (13)

[0094] All MELP speech synthesis parameters are interpolated pitch synchronously for each synthesized pitch period. The interpolated parameters are the gain in dB, LSFs, pitch, jitter, Fourier magnitudes, pulse and noise coefficients for mixed excitation, and spectral-tilt coefficient for the adaptive spectral-enhancement filter. Gain is linearly interpolated between the gain of the prior frame, G2p, and the first gain of the current frame, G1, if the starting point, t0, t0=0, 1, . . . , 179, of the new pitch period is less than 90; otherwise, gain is interpolated between the G1 and G2. Normally, the other parameters are linearly interpolated between the past and current frame values. The interpolation factor, int, for these parameters is based on the starting point of the new pitch period:

int=t0/180   Eq. (14)

[0095] There are two exceptions to the interpolation procedure. First, there is an onset with a high pitch frequency, pitch interpolation is disabled and the new pitch is immediately used. This condition is met when G1 is more than 6 dB greater than G2 and the current frame's pitch period is less than half the prior frame's pitch period. The second exception also involves a gain onset. If G2 differs from G2p by more than 6 dB, then the LSFs, spectral tilt, and pitch are interpolated using the interpolated gain trajectory as a basis, since the gain is transmitted twice per frame and has a more accurate interpolation path. In this case, the interpolation factor is given by: 12 int = G int - G 2 ⁢ p G 2 - G 2 ⁢ p , Eq .   ⁢ ( 15 )

[0096] where Gint is the interpolated gain. This interpolation factor is then clamped between 0 and 1.

[0097] New Mixed Excitation Algorithm

[0098] Although the mixed excitation method in the existing MELP coder minimizes the band-pass filtering operations, it still requires two 32nd order FIR filtering operations for a pulse train and noise. The present invention removes these filters to reduce the computational complexity of the existing MELP. FIG. 9 shows a new mixed-excitation algorithm in the present invention. The existing MELP uses the Fourier magnitudes to generate a pulse train. The pulse train is mixed with random noise in time domain by band-pass filtering. In the present invention, noise is mixed with a pulse train in the frequency domain by adding a random phase to the Fourier magnitudes. Block 64 hows the random phase generator. The random phase is added to only the Fourier magnitudes in unvoiced frequency bands. The mixed excitation signal in the present method is given by: 13 e m ⁢ ( n ) = 1 2 ⁢ π ⁢ ∫ - π π ⁢ E M ⁡ ( ⅇ j ⁢   ⁢ ω ) ⁢ ⅇ j ⁢   ⁢ ω ⁢   ⁢ n ⁢   ⁢ ⅆ ω , E M ⁡ ( ⅇ j ⁢   ⁢ ω ) = E 0 ⁡ ( ⅇ j ⁢   ⁢ ω )

[0099] If, &ohgr;=0, &ohgr;=&pgr;, or in the voiced band,

[0100] otherwise,

Em(ej&ohgr;)=E0(ej&ohgr;)ej&ohgr;100 , &phgr;=U[−&agr;&pgr;, &agr;&pgr;],   Eq. (16)

[0101] where, cc is an interpolation coefficient between 0 and 1. Since the existing MELP coder generates a pulse pitch-synchronously, the band-pass voicing decision needs to be linearly interpolated between 0 (voiced) and 1 (unvoiced).

[0102] The adaptive spectral enhancement filter is then applied to the mixed excitation signal. This filter is a 10th order pole/zero filter with additional first order tilt compensation. The coefficients are generated by bandwidth expansion of the LPC filter transfer function A(z), corresponding to the interpolated LSFs. The transfer function of the enhancement filter, Hase(Z), is given by: 14 H ase ⁡ ( z ) = A ⁡ ( α ⁢   ⁢ z - 1 ) A ⁡ ( β ⁢   ⁢ z - 1 ) × ( 1 + μ ⁢   ⁢ z - 1 ) , Eq .   ⁢ ( 17 )

[0103] where,

&agr;=0.5p &bgr;=0.8p′  Eq. (18)

[0104] and tilt coefficient, &mgr;, is first calculated as max(0.5kl0), then interpolated and multiplied by p, the signal probability. The first reflection coefficient, kl, is calculated from the decoded LSFs. By the MELP predictor coefficient sign convention, kl, is usually negative for the voiced spectra. The signal probability p is estimated by comparing the current interpolated gain, Gint, to the background noise estimate, Gn, using the formula: 15 p = G int - G n - 12 18 . Eq .   ⁢ ( 19 )

[0105] This signal probability is clamped between 0 and 1.

[0106] Linear prediction synthesis is performed by applying the coefficients corresponding to the interpolated LSFs directly to the form filter.

[0107] Since excitation of the synthesized voice signal is generated at an arbitrary level, a speech gain adjustment must be performed on the synthesized speech. The correct scaling factor, Sgain, is computed for each synthesized pitch period of length Tby dividing the desired RMS value (Gint, must be converted from dB) by the RMS value of the unsealed synthetic speech signal sn: 16 S gain = 10 G int 20 1 T ⁢ ∑ n = 1 T ⁢   ⁢ s n 2 . Eq .   ⁢ ( 20 )

[0108] To prevent discontinuities in the synthesized speech, this scale factor is linearly interpolated between the previous and current values for the first ten samples of the pitch period.

[0109] The pulse dispersion filter is a 65th order FIR filter derived from a spectrally flattened triangular pulse. The coefficients used in the filter are provided in the Specification for the Analog to Digital Conversion of Voice by 2,400 Bit/Second Mixed Excitation Linear Prediction herein enclosed for reference.

[0110] Post-Processor for the Fourier Magnitude Model

[0111] In the present invention, a post-processor for the Fourier magnitude model 62 is added to the MELP decoder as shown in FIG. 2A. In the prior art, it was observed that the first few harmonic magnitudes of the coded speech for some low-pitch male speakers were suppressed by the preprocessing high-pass filter 11 in FIG. 2B and the adaptive spectral enhancement filter (ASEF) 30 in FIG. 2C. It was found that this effect led to a high-pass filtered quality for low-pitch male speakers. To provide more natural speech quality for such speakers, the present invention adaptively emphasizes the harmonic magnitudes in low frequencies by removing the effect of the two filters. The emphasized harmonic magnitude is given by: 17 &LeftBracketingBar; S ~ ⁡ ( ⅇ j ⁢   ⁢ ω i ) &RightBracketingBar; = &LeftBracketingBar; S ⁡ ( ⅇ j ⁢   ⁢ ω i ) &RightBracketingBar; ⁢ G H ⁡ ( ⅇ j ⁢   ⁢ ω i ) , Eq .   ⁢ ( 21 )

[0112] where, &ohgr;i is the ith harmonic frequency, G is the average Fourier spectrum energy, and |S(ej&ohgr;)| is the non-emphasized Fourier magnitude of the ith harmonic. As shown in FIG. 8, the present invention uses the MELP Fourier magnitude parameters, which are the Fourier magnitudes of the LPC residual signal 23, for the harmonic magnitude emphasis rather than using the harmonic magnitude of the synthesized speech S(ej&ohgr;). From Parseval's theorem, the average Fourier spectrum magnitude G is given by: 18 G = ∑ n = 0 N - 1 ⁢ &LeftBracketingBar;   ⁢ h ⁡ ( n ) 2 &RightBracketingBar; , Eq .   ⁢ ( 22 )

[0113] where, h(n) is the impulse response of the filter H(ej&ohgr;), and N is the length of impulse response. The magnitude response of the filter |H(ej&ohgr;)|, is given by:

|H(ej&ohgr;)|=|H1(ej&ohgr;)||H2(ej&ohgr;)|,  Eq. (23)

[0114] where, H1 (ej&ohgr;) and H2 (ej&ohgr;) are the magnitude responses of the ASEF 30 and preprocessing high-pass filter 11 respectively. To avoid losing the advantage of the ASEF 30 in the prior art, the harmonic magnitude emphasis is applied to only the harmonics that are 200 Hz less than the first formant frequency of the frame. The first formant frequency F1 is roughly estimated using quantized line spectrum frequencies (LSFs) as follows: 19 F 1 = f ^ 1 + f ^ 2 2 ,

[0115] otherwise, 20 F 1 = f ^ 2 + f ^ 3 2 , Eq .   ⁢ ( 23 )

[0116] where, {circumflex over (f)}i is the ith quantized LSF. From the experimental result, the emphasized harmonic magnitude |{tilde over (S)}(ej&ohgr;)| is further emphasized by 2 dB in the present invention.

[0117] Plosive Synthesis

[0118] FIG. 7 shows the block diagram of the plosive synthesis 66. As shown in FIG. 7, all plosive signals are produced by scaling and LPC synthesis/filtering 32 the plosive residual template 71, which is pre-stored in the synthesizer. This plosive residual template 71 was chosen arbitrarily and filtered with the 14th order LPC inverse filter. The LPC coefficients for the frame that contains the plosive 81 are also used for the plosive signal synthesis. The gain of synthesized plosive signal is adjusted by applying plosive gain 76 to the MELP gain 34. In the present invention, the length of synthesized plosive signal is a half of the frame length, and the synthesized plosive is added back to either the first half or the second half of the coded speech frame according to the plosive position as shown in block 73. Before the plosive is added back to the coded speech, the gain of the coded speech is adjusted in gain suppressor 75 such that the gain of the half frame to which the plosive is added back is suppressed. It is realized by simply replacing the gain of the half frame to which the plosive is added back with that of the previous half frame:

[0119] gi (0)=gi-1(1), if the plosive position is the first half of the frame, otherwise, gi(1)=gi(0), if the plosive position is the second half of the frame, where, gi is the jth gain (j=0,1) in the ith frame. Since plosive detection, modeling and synthesis are performed independently from the MELP coder as shown in FIG. 5, this embodiment can be applied to other low bit-rate speech coders.

[0120] Bit Allocation

[0121] Another advantage of the present invention is bit-stream compatibility with the existing MELP coder. The present invention consists of four embodiments including a robust pitch detector, a plosive analysis/synthesis system, a post-processor for the Fourier magnitude model and a new mixed-excitation algorithm. As shown in FIG. 14, only the plosive analysis/synthesis system requires additional bits for transmission. In the present invention, the additional bits for the plosive can be packed into the bit-stream of the existing MELP. There are two different modes for the bit allocation of the existing MELP: one voiced, the other unvoiced. The mode is selected as voiced if the first band is voiced and as unvoiced if the first band is unvoiced. For unvoiced mode, the existing MELP coder sets only the first and fifth band to voiced and the index for a pitch lag is set less than three so as to indicate that the frame is unvoiced. In the decoder, if the index for the pitch lag is less than three, the frame is regarded as unvoiced. Otherwise, the frame is regarded as voiced. In the present invention, a frame that contains a plosive is assumed to be a unvoiced frame. FIG. 10 shows the bit packing flow diagram for the plosive signal. To identify the plosive frame in the decoder of the present invention, the first and the fifth frame is set to voiced but the pitch is set to three as a dummy. Then, a plosive gain and position is packed into the bits for the Fourier magnitude, which is used for the voiced frame in the existing MELP. FIG. 11 shows the bit unpacking flow diagram for the plosive signal. The decoder of the existing MELP regards the frame as unvoiced if the pitch index is less than three. If the pitch index is equal to or greater than three, the combination that only the first and the fifth bands are unvoiced will never occur in the existing MELP. In the decoder of the present invention, the frame is regarded as the plosive frame if this combination occurs. Then, the plosive parameters such as a gain and position are extracted from the bits for the Fourier magnitude. Since the bit-stream specification is maintained in the present invention, the present system can interchange the encoder/decoder with the existing MELP.

[0122] While preferred embodiments of the invention have been disclosed in detail in the foregoing description and drawings, it will be understood by those skilled in the art that variations and modifications thereof can be made without departing from the spirit and scope of the invention as set forth in the following claims.

Claims

1. A method of enhancing the speech quality of a speech coder comprising the steps of:

digitally sampling speech to create a speech waveform over a multiplicity of frames;
using a sliding-sample window to locate a frame position with the highest pitch correlation; and
formulating at least one synthesized voice parameter in response to the speech waveform within the located frame position.

2. The method of claim 1, wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of the pitch over multiple frame positions defined by the sliding-sample window.

3. The method of claim 1, wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of the pitch for a fixed-length sliding-sample window.

4. The method of claim 1, wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of the pitch for up to a predetermined number of frames.

5. The method of claim 1, wherein the step of formulating comprises estimating a frame pitch in response to the signal contained within the located frame position.

6. The method of claim 5, further comprising the step of:

estimating linear predictive coding (LPC) coefficients in response to the signal contained within the located frame position.

7. The method of claim 5, further comprising the step of:

estimating gain in response to the signal contained within the located frame position.

8. The method of claim 5, further comprising the step of:

estimating a voicing decision in response to the signal contained within the located frame position.

9. The method of claim 5, further comprising the step of:

estimating a fractional pitch in response to the signal contained within the located frame position.

10. A speech coder comprising:

means for sampling a speech waveform to generate a discrete representation of the speech waveform over a multiplicity of frames; and
means for locating a pitch-analysis window over that frame position with the highest pitch correlation.

11. The coder of claim 10, wherein the means for locating a frame position with the highest pitch correlation compares pitch analysis results associated with multiple frames.

12. The coder of claim 10, wherein the means for locating a frame position with the highest pitch correlation performs a recurrence calculation on the autocorrelation of the pitch for multiple frame positions defined by the sliding-sample window.

13. The coder of claim 10, wherein the means for locating a pitch-analysis window comprises a fixed-length window.

14. The coder of claim 13, wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of pitch results from multiple frame positions defined by the fixed-length window.

15. The coder of claim 10, wherein the frame position with the highest pitch correlation is determined by performing a recurrence calculation on the autocorrelation of the pitch from up to a predetermined number of frames defined by the sliding-sample window.

16. The coder of claim 10, further comprising:

means for estimating a plurality of speech parameters in response to the signal contained within the located frame position.

17. The coder of claim 16, wherein the means for estimating comprises at least one digital signal processor in the mixed-excitation linear predictive (MELP) coder.

18. The coder of claim 16, wherein the means for estimating comprises at least one algorithm stored within the mixed-excitation linear predictive (MELP) coder.

Patent History
Publication number: 20020052734
Type: Application
Filed: Nov 16, 2001
Publication Date: May 2, 2002
Inventors: Takahiro Unno (Richardson, TX), Thomas P. Barnwell (Atlanta, GA), Kwan K. Truong (Lilburn, GA)
Application Number: 09991387
Classifications
Current U.S. Class: Pitch (704/207)
International Classification: G10L013/08; G10L011/04;