ACOUSTIC ECHO CANCELLATION SYSTEM AND ASSOCIATED METHOD

Info

Publication number: 20240048905
Type: Application
Filed: Aug 7, 2023
Publication Date: Feb 8, 2024
Applicant: MediaTek Singapore Pte. Ltd. (Singapore)
Inventors: BOZHONG LIU (Singapore), Xiaoxi Yu (Singapore), HANTAO HUANG (Singapore), Chia-Hsin Yang (Hsinchu City), Li-Wei Cheng (Hsinchu City)
Application Number: 18/230,672

Abstract

An acoustic echo cancellation (AEC) system includes an adaptive filter, a subtraction circuit, and a processor executing a model. The adaptive filter is arranged to generate an estimated echo signal according to a first microphone signal played by a loudspeaker. The subtraction circuit is arranged to subtract the estimated echo signal from a signal that is output from a microphone receiving both of a speech signal and an echo signal, to generate a second microphone signal, wherein the first microphone signal is not output from the microphone, and the echo signal is transmitted from the loudspeaker to the microphone. The model is arranged to perform short-time Fourier transform upon the first microphone signal and the second microphone signal, respectively, and generate an estimated speech signal through a neural network according to a first transformed microphone signal and a second transformed microphone signal.

Description

Description

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No. 63/396,218, filed on Aug. 8, 2022. The content of the application is incorporated herein by reference.

BACKGROUND

The present invention is related to acoustic echo cancellation (AEC), and more particularly, to an AEC system that uses an adaptive filter to generate an estimated echo signal for canceling/reducing an echo signal in a data signal before the data signal is fed into a neural network.

Acoustic echo often occurs in audio/video calls if a far-end speaker's voice is played by a near-end speaker and is picked up by a near-end microphone (e.g. a near-end microphone signal generated by the near-end microphone may include an echo signal). For a conventional AEC system, an adaptive filter and a neural network are utilized to suppress the echo signal, wherein the adaptive filter is a part of neural network architecture. Some problems may occur, however. Since the neural network architecture includes the adaptive filter, the adaptive filter also needs to be considered during training, which may reduce the training efficiency or reduce the training effect. As a result, a novel AEC system that uses the adaptive filter to generate an estimated echo signal for canceling/reducing the echo signal in a data signal (e.g. the near-end microphone signal) before the data signal is fed into the neural network is urgently needed.

SUMMARY

It is therefore one of the objectives of the present invention to provide an AEC system that uses the adaptive filter to suppress the echo signal in the data signal before the data signal is fed into the neural network, to address the above-mentioned issues

According to an embodiment of the present invention, an AEC system is provided. The AEC system comprises an adaptive filter, a subtraction circuit, and a processor. The adaptive filter is arranged to generate an estimated echo signal according to a first microphone signal played by a loudspeaker. The subtraction circuit is arranged to subtract the estimated echo signal from a signal that is output from a microphone receiving both a speech signal and an echo signal, to generate a second microphone signal. The first microphone signal is not output from the microphone, and the echo signal is transmitted from the loudspeaker to the microphone. The processor is arranged to execute a model. The model is arranged to perform short-time Fourier transform upon the first microphone signal and the second microphone signal, respectively, to generate a first transformed microphone signal and a second transformed microphone signal, and generate an estimated speech signal through a neural network according to the first transformed microphone signal and the second transformed microphone signal.

According to an embodiment of the present invention, an AEC method is provided. The AEC method comprises: generating an estimated echo signal according to a first microphone signal played by a loudspeaker; subtracting the estimated echo signal from a signal that is output from a microphone receiving both of a speech signal and an echo signal, to generate a second microphone signal, wherein the first microphone signal is not output from the microphone, and the echo signal is transmitted from the loudspeaker to the microphone; performing short-time Fourier transform upon the first microphone signal and the second microphone signal, respectively, to generate a first transformed microphone signal and a second transformed microphone signal; and generating an estimated speech signal through a neural network according to the first transformed microphone signal and the second transformed microphone signal.

One of the benefits of the present invention is that, in the AEC system and associated method of the present invention, before a data signal (e.g. a signal that is output from the microphone receiving an echo signal) is fed into an AEC model for training, an adaptive filter is utilized to generate an estimated echo signal to cancel/reduce most of the echo signal in the data signal. In this way, the train efficiency and the training effect of the AEC model can be improved.

These and other objectives of the present invention will no doubt become obvious to those of ordinary skill in the art after reading the following detailed description of the preferred embodiment that is illustrated in the various figures and drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram illustrating an electronic device according to an embodiment of the present invention.

FIG. 2 is a diagram illustrating an acoustic echo cancellation (AEC) system for performing according to an embodiment of the present invention.

FIG. 3 is a diagram illustrating implementation details of the AEC model shown in FIG. 2 according to an embodiment of the present invention.

FIG. 4 is a flow chart of an AEC method according to an embodiment of the present invention.

DETAILED DESCRIPTION

Certain terms are used throughout the following description and claims, which refer to particular components. As one skilled in the art will appreciate, electronic equipment manufacturers may refer to a component by different names. This document does not intend to distinguish between components that differ in name but not in function. In the following description and in the claims, the terms “include” and “comprise” are used in an open-ended fashion, and thus should be interpreted to mean “include, but not limited to . . . ”.

FIG. 1 is a diagram illustrating an electronic device 10 according to an embodiment of the present invention. Byway of example, but not limitation, the electronic device 10 maybe a portable device such as a smartphone or a tablet. The electronic device 10 may include a processor 12 and a storage device 14. The processor 12 may be a single-core processor or a multi-core processor. The storage device 14 is a non-transitory machine-readable medium, and is arranged to store computer program code PROG and a model MD. The processor 12 is equipped with software execution capability. The computer program code PROG may include multiple artificial intelligence (AI)-based algorithms. When loaded and executed by the processor 12, the computer program code PROG instructs the processor 12 to train the model MD according to the AI-based algorithms. The electronic device 10 may be regarded as a computer system using a computer program product that includes a computer-readable medium containing the computer program code PROG. Regarding a model in an acoustic echo cancellation (AEC) system as proposed by the present invention, it may be embodied on the electronic device 10. That is, the model MD may be an AEC model mentioned hereinafter.

FIG. 2 is a diagram illustrating an AEC system 20 according to an embodiment of the present invention. As shown in FIG. 2, the AEC system 20 may include a loudspeaker 200, a microphone 202, an adaptive filter 204, and an AEC model 206. The loudspeaker 200 may be arranged to receive and play a far-end microphone signal x(n), and the microphone 202 may be arranged to receive a speech signal s(n), wherein the far-end microphone signal x(n) is not output from the microphone 202, an echo signal d(n) is transmitted from the loudspeaker 200 to the microphone 202, and an external noise signal v(n) may also be received by the microphone 202. The AEC model 206 may be arranged to receive the far-end microphone signal x(n) and a near-end microphone signal y(n) derived from a signal MS that is output from the microphone 202 (which receives the speech signal S(n), the echo signal d(n), and the external noise signal v(n)), and generate an estimated speech signal s′(n) through a neural network according to the far-end microphone signal x(n) and the near-end microphone signal y(n).

It should be noted that, before the near-end microphone signal y(n) is transmitted to the AEC model 206, the adaptive filter 204 may be arranged to generate an estimated echo signal d′(n) according to the far-end microphone signal x (n) for canceling the echo signal d(n) in the signal MS output from the microphone 202. For example, the adaptive filter 204 may be arranged to multiply a room response r(n) and an impulse response h(n) (which is a time-varying impulse response between the loudspeaker 200 and the microphone 202), to generate the estimated echo signal d′(n) (i.e. d′(n)=r(n)*h(n), where d′ (n)≈d(n)). In this embodiment, the AEC system 20 may further include a subtraction circuit (which may be implemented by an adder that is configured to perform a subtraction operation) 208, wherein the subtraction circuit 208 may be coupled to the microphone 202, the adaptive filter 204, and the AEC model 206, and may be arranged to subtract the estimated echo signal d′(n) from the signal MS to generate the near-end microphone signal y(n) (i.e. y(n)=MS−d′(n) =s(n)+d(n)+v(n)−d′(n)≈(n)+v(n)). In this way, before the near-end microphone signal y(n) is transmitted to the AEC model 206 for training, most of the echo signal d (n) in the near-end microphone signal y(n) has been canceled/reduced by the adaptive filter 204.

FIG. 3 is a diagram illustrating implementation details of the AEC model 206 shown in FIG. 2 according to an embodiment of the present invention. As shown in FIG. 3, the AEC model 206 may include multiple segment modules 300 and 302, multiple fast Fourier transform (FFT) module 304 and 306, multiple instant layer normalization (iLN) modules 308 and 310, a concat module 312, and a separation kernel 314. The AEC model 206 may be arranged to perform short-time Fourier transform (SIFT) upon the far-end microphone signal x(n) and the near-end microphone signal y(n), respectively, to generate a first transformed microphone signal X_T and a second transformed microphone signal Y_T. For example, the segment module 300 may be arranged to split the far-end microphone signal x(n) to generate a first segmented microphone signal X_S, and the FFT module 304 may be arranged to perform. FFT upon the first segmented microphone signal X_S to generate the first transformed microphone signal X_T. The segment module 302 may be arranged to split the near-end microphone signal y(n) to generate a second segmented microphone signal Y_S, and the FFT module 306 may be arranged to perform FFT upon the second segmented microphone signal Y_S to generate the second transformed microphone signal Y_T.

Afterwards, the AEC model 206 may be arranged to generate the estimated speech signal s′(n) through the separation kernel 314 according to the first transformed microphone signal X_T and the second transformed microphone signal Y_T. In this embodiment, the separation kernel 314 may include multiple long short term memory (LSTM) layers (e.g. 3 LSTM layers 316-320) and a fully-connected layer 322 (labeled as “FC” in FIG. 3) with a sigmoid activation 324. The iLN module 308 may be arranged to normalize the first transformed microphone signal X_T to generate a first normalized microphone signal X_N. The iLN module 310 may be arranged to normalize the second transformed microphone signal Y_T to generate a second normalized microphone signal Y_N. The concat module 312 may be arranged to concatenate the first normalized microphone signal X_N and the second normalized microphone signal Y_N to generate a concatenated result CR. The separation kernel 314 may be arranged to predict and generate two masks (e.g. a real part mask RM and an imaginary part mask IM) according to the concatenated result CR through the LSTM layers 316-320 and the fully-connected layer 322 with the sigmoid activation 324, wherein the real part mask RM corresponds to magnitude information of the far-end microphone signal x(n) and the near-end microphone signal y(n), the imaginary part mask IM corresponds to phase information of the far-end microphone signal x(n) and the near-end microphone signal y(n), and the estimated speech signal s′(n) is generated according to the real part mask RM and the imaginary part mask IM.

Specifically, for the training of the AEC model 206, a noisy speech signal Y(k, l) is a sum of a clean speech signal S(k, l) and a noise signal N(k, l) (which may correspond to the external noise signal v(n) and the remnant echo signal that is generated by subtracting the estimated echo signal d′(n) from the echo signal d(n)), that is, Y(k, l)=S(k, l)+N(k, l), wherein k is a frame index, and l is a frequency bin index. After the AEC model 206 is trained according to the AI-based algorithms, a spectral magnitude mask (SMM) may be predicted and generated through the LSTM layers 316-320 and the fully-connected layer 322 with the sigmoid activation 324, wherein the SMM is equal to a ratio of a spectral magnitude of the clean speech signal S (k, l) and a spectral magnitude of the noisy speech signal Y(k, l) (i.e.

$S M M = \frac{❘ S (k, l) ❘}{❘ Y (k, l) ❘}),$

the real part mask RM is a real part of the SMM, and the imaginary part mask IM is an imaginary part of the SMM. In this way, a real part of the estimated speech signal s′(n) can be obtained by multiplying the real part mask RM by a real part of the near-end microphone signal y(n), and an imaginary part of the estimated speech signal s′(n) can be obtained by multiplying the imaginary part mask IM by an imaginary part of the near-end microphone signal y(n).

FIG. 4 is a flow chart of an AEC method according to an embodiment of the present invention. Provided that the result is substantially the same, the steps are not required to be executed in the exact order shown in FIG. 4. For example, the AEC method shown in FIG. 4 may be employed by the AEC system 20 shown in FIG. 2.

In Step S400, the far-end microphone signal x(n) is received and played by the loudspeaker 200.

In Step 402, the speech signal s(n) is received by the microphone 202, wherein the far-end microphone signal x(n) is not output from the microphone 202, the echo signal d(n) is transmitted from the loudspeaker 200 to the microphone 202, and the external noise signal v(n) may also be received by the microphone 202.

In Step 404, the estimated echo signal d′(n) is generated by the adaptive filter 204 according to the far-end microphone signal x(n).

In Step 406, by the subtraction circuit 208, the estimated echo signal d′(n) is subtracted from the signal MS that is output from the microphone 202 receiving the speech signal s(n), the echo signal d(n), and the external noise signal v(n), to generate the near-end microphone signal y(n).

In Step 408, by the AEC model 206, the short-time Fourier transform is performed upon the far-end microphone signal x(n) and the near-end microphone signal y(n), respectively, to generate the first transformed microphone signal X_T and the second transformed microphone signal Y_T.

In Step 410, the estimated speech signal s′(n) is generated through the neural network according to the first transformed microphone signal X_T and the second transformed microphone signal Y_T.

Since a person skilled in the pertinent art can readily understand details of the steps after reading above paragraphs directed to the AEC system 20 shown in FIG. 2, further description is omitted here for brevity.

In summary, in the AEC system 20 and associated method of the present invention, before the signal MS that is output from the microphone 202 receiving the echo signal d(n) is fed into the AEC model 206 for training, the adaptive filter 204 is utilized to generate the estimated echo signal d′(n) to cancel/reduce most of the echo signal d(n) in the signal MS. In this way, the train efficiency and the training effect of the AEC model 206 can be improved.

Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.

Claims

1. An acoustic echo cancellation (AEC) system, comprising:

an adaptive filter, arranged to generate an estimated echo signal according to a first microphone signal played by a loudspeaker;

a subtraction circuit, arranged to subtract the estimated echo signal from a signal that is output from a microphone receiving both of a speech signal and an echo signal, to generate a second microphone signal, wherein the first microphone signal is not output from the microphone, and the echo signal is transmitted from the loudspeaker to the microphone; and

a processor, arranged to execute: a model, arranged to perform short-time Fourier transform upon the first microphone signal and the second microphone signal, respectively, to generate a first transformed microphone signal and a second transformed microphone signal, and generate an estimated speech signal through a neural network according to the first transformed microphone signal and the second transformed microphone signal.

2. The AEC system of claim 1, wherein the model comprises:

multiple segment modules, arranged to split the first microphone signal and the second microphone signal, respectively, to generate a first segmented microphone signal and a second segmented microphone signal;

multiple fast Fourier transform modules, arranged to perform fast Fourier transform upon the first segmented microphone signal and the second segmented microphone signal, respectively, to generate the first transformed microphone signal and the second transformed microphone signal;

multiple instant layer normalization (iLN) modules, arranged to normalize the first transformed microphone signal and the second transformed microphone signal, respectively, to generate a first normalized microphone signal and a second normalized microphone signal;

a concat module, arranged to concatenate the first normalized microphone signal and the second normalized microphone signal, to generate a concatenated result; and

a separation kernel, arranged to predict and generate two masks according to the concatenated result, wherein the estimated speech signal is generated according to the two masks.

3. The AEC system of claim 2, wherein a noisy speech signal is a sum of a clean speech signal and a noise signal; the two masks are a real part mask and an imaginary part mask, the real part mask and the imaginary part mask are a real part and an imaginary part of a ratio of a spectral magnitude of the clean speech signal and a spectral magnitude of the noisy speech signal, respectively; the real part mask corresponds to magnitude information of the first microphone signal and the second microphone signal, and the imaginary part mask corresponds to phase information of the first microphone signal and the second microphone signal.

4. The AEC system of claim 3, wherein a real part of the estimated speech signal is obtained by multiplying the real part mask by a real part of the second microphone signal, and an imaginary part of the estimated speech signal is obtained by multiplying the imaginary part mask by an imaginary part of the second microphone signal.

5. The AEC system of claim. 2, wherein the separation kernel comprises multiple long short term memory (LSTM) layers and a fully-connected layer with sigmoid activation, and the two masks are predicted and generated by the multiple LSTM layers and the fully-connected layer with sigmoid activation.

6. The AEC system of claim 1, wherein the adaptive filter is arranged to multiply a room response and an impulse response between the loudspeaker and the microphone, to generate the estimated echo signal.

7. An acoustic echo cancellation (AEC) method, comprising:

generating an estimated echo signal according to a first microphone signal played by a loudspeaker;

subtracting the estimated echo signal from a signal that is output from a microphone receiving both of a speech signal and an echo signal, to generate a second microphone signal, wherein the first microphone signal is not output from the microphone, and the echo signal is transmitted from the loudspeaker to the microphone;

performing short-time Fourier transform upon the first microphone signal and the second microphone signal, respectively, to generate a first transformed microphone signal and a second transformed microphone signal; and

generating an estimated speech signal through a neural network according to the first transformed microphone signal and the second transformed microphone signal.

8. The AEC method of claim 7, wherein the step of performing the short-time Fourier transform upon the first microphone signal and the second microphone signal, respectively, to generate the first transformed microphone signal and the second transformed microphone signal comprises:

splitting the first microphone signal and the second microphone signal, respectively, to generate a first segmented microphone signal and a second segmented microphone signal; and

performing fast Fourier transform upon the first segmented microphone signal and the second segmented microphone signal, respectively, to generate the first transformed microphone signal and the second transformed microphone signal.

9. The AEC method of claim 7, wherein the step of generating the estimated speech signal through the neural network according to the first transformed microphone signal and the second transformed microphone signal comprises:

normalizing the first transformed microphone signal and the second transformed microphone signal, respectively, to generate a first normalized microphone signal and a second normalized microphone signal;

concatenating the first normalized microphone signal and the second normalized microphone signal, to generate a concatenated result; and

predicting and generating two masks according to the concatenated result, wherein the estimated speech signal is generated according to the two masks.

10. The AEC method of claim 9, wherein a noisy speech signal is a sum of a clean speech signal and a noise signal; the two masks are a real part mask and an imaginary part mask, the real part mask and the imaginary part mask are a real part and an imaginary part of a ratio of a spectral magnitude of the clean speech signal and a spectral magnitude of the noisy speech signal, respectively; the real part mask corresponds to magnitude information of the first microphone signal and the second microphone signal, and the imaginary part mask corresponds to phase information of the first microphone signal and the second microphone signal.

11. The AEC method of claim 10, wherein the method further comprises:

obtaining a real part of the estimated speech signal by multiplying the real part mask by a real part of the second microphone signal; and

obtaining an imaginary part of the estimated speech signal by multiplying the imaginary part mask by an imaginary part the second microphone signal.

12. The AEC method of claim 9, wherein the step of predicting and generating the two masks according to the concatenated result comprises:

predicting and generating the two masks by multiple long short term memory (LSTM) layers and a fully-connected layer with sigmoid activation.

13. The AEC method of claim 7, wherein the step of generating the estimated echo signal according to the first microphone signal comprises:

multiplying a room response and an impulse response between the loudspeaker and the microphone, to generate the estimated echo signal.