Noise spatial covariance matrix estimation apparatus, noise spatial covariance matrix estimation method, and program
A time-variant noise spatial covariance matrix is estimated effectively. Using time-frequency-divided observation signals based on observation signals acquired by collecting acoustic signals emitted from one or a plurality of sound sources and mask information expressing the occupancy probability of a component of each of the time-frequency-divided observation signals that corresponds to each noise source, a time-independent first noise spatial covariance matrix corresponding to the time-frequency-divided observation signals and the mask information belonging to a long time interval is acquired for each noise source. Further, using the mask information of each of a plurality of different short time intervals, a mixture weight corresponding to each noise source in each short time interval is acquired. Furthermore, a time-variant third noise spatial covariance matrix is acquired, the third noise spatial covariance matrix being based on a time-variant second noise spatial covariance matrix, which corresponds to the time-frequency-divided observation signals and the mask information belonging to each short time interval and relates to noise formed by adding together all of the noise sources, and a weighted sum of the first noise spatial covariance matrices with the mixture weights of the respective short time intervals.
Latest NIPPON TELEGRAPH AND TELEPHONE CORPORATION Patents:
- TRANSMISSION SYSTEM, ELECTRIC POWER CONTROL APPARATUS, ELECTRIC POWER CONTROL METHOD AND PROGRAM
- SOUND SIGNAL DOWNMIXING METHOD, SOUND SIGNAL CODING METHOD, SOUND SIGNAL DOWNMIXING APPARATUS, SOUND SIGNAL CODING APPARATUS, PROGRAM AND RECORDING MEDIUM
- OPTICAL TRANSMISSION SYSTEM, TRANSMITTER, AND CONTROL METHOD
- WIRELESS COMMUNICATION SYSTEM AND WIRELESS COMMUNICATION METHOD
- DATA COLLECTION SYSTEM, MOBILE BASE STATION EQUIPMENT AND DATA COLLECTION METHOD
This application is a U.S. 371 Application of International Patent Application No. PCT/JP2020/008216, filed on 28 Feb. 2020, which application claims priority to and the benefit of JP Application No. 2019-045649, filed on 13 Mar. 2019, the disclosures of which are hereby incorporated herein by reference in their entireties.
TECHNICAL FIELDThe present invention relates to a technique for generating a noise spatial covariance matrix.
BACKGROUND ARTA noise spatial covariance matrix is often used to analyze an acoustic signal. NPL 1, for example, discloses a technique for suppressing noise from an observation signal in the frequency domain using a noise spatial covariance matrix. In this method, a beamformer for minimizing the power of noise in the frequency domain is estimated using a noise spatial covariance matrix acquired from an observation signal in the frequency domain and a steering vector representing a sound source direction or an estimated vector thereof under the constraint condition that sound arriving at a microphone from the sound source is not distorted, and noise is suppressed by applying the beamformer to the observation signal in the frequency domain.
CITATION LIST Non Patent Literature[NPL 1] T Higuchi, N Ito, T Yoshioka, T Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise”, Proc. ICASSP 2016, 2016.
SUMMARY OF THE INVENTION Technical ProblemIn conventional methods such as that of NPL 1, the noise spatial covariance matrix is estimated using the entirety of an acoustic signal input over a long time interval as a subject. Then, when a beamformer is estimated in each time block, the noise spatial covariance matrix determined for the entire input signal is used. In other words, the beamformer is estimated for each time block on the basis of a common noise spatial covariance matrix.
In an actual environment, noise to be suppressed may include signals such as a voice signal, in which the sound level varies greatly from moment to moment, and in this case, the noise spatial covariance matrix may differ in each time block. It is therefore desirable to estimate a time-variant noise spatial covariance matrix for each time block. As a simple method, a noise spatial covariance matrix may be estimated for each time block using only the acoustic signal of each time block as a subject, but with this method, the time interval of the acoustic signal used for estimation shortens, leading to a reduction in the precision of the noise spatial covariance matrix.
In consideration of this problem, an object of the present invention is to provide a technique for effectively estimating a time-variant noise spatial covariance matrix.
Means for Solving the ProblemHereafter, in the present invention, time-frequency signals acquired by dividing an acoustic signal into discrete time points (time frames) and discrete frequencies (frequency bands) are used. An observation signal expressed as a time-frequency signal will be referred to as a time-frequency-divided observation signal, for example.
In the present invention, using time-frequency-divided observation signals based on observation signals acquired by collecting acoustic signals emitted from one or a plurality of sound sources and mask information expressing the occupancy probability of a component of each of the time-frequency-divided observation signals that corresponds to each noise source, a time-independent first noise spatial covariance matrix corresponding to the time-frequency-divided observation signals and the mask information belonging to a long time interval is acquired for each noise source. Further, using the mask information of each of a plurality of different short time intervals, a mixture weight corresponding to each noise source in each short time interval is acquired. Furthermore, a time-variant third noise spatial covariance matrix is acquired, the third noise spatial covariance matrix being based on a time-variant second noise spatial covariance matrix, which corresponds to the time-frequency-divided observation signals and the mask information belonging to each short time interval and relates to noise formed by adding together all of the noise sources, and a weighted sum of the first noise spatial covariance matrices with the mixture weights of the respective short time intervals.
Effects of the InventionThe third noise spatial covariance matrix can respond to variation over the short time intervals on the basis of the respective second noise spatial covariance matrices and mixture weights of the short time intervals, and at the same time, the third noise spatial covariance matrix can be acquired with a high degree of precision on the basis of the first noise spatial covariance matrix of the long time interval. As a result, a time-variant noise spatial covariance matrix can be estimated effectively.
Embodiments of the present invention will be described below with reference to the figures.
Definitions of Reference SymbolsFirst, reference symbols used in the following embodiments will be defined.
I: I is a positive integer expressing the number of microphones. For example, I≥2.
i: i is a positive integer expressing a microphone number, where 1≤i≤I is satisfied.
A microphone having the microphone number i (in other words, an ith microphone) will be written as “microphone i”. Values and vectors corresponding to the microphone number i are expressed using reference symbols having the subscript suffix “i”.
S: S is a positive integer expressing the number of sound sources. For example, S≥2. The sound sources include a target sound source and noise sources other than the target sound source.
s: s is a positive integer expressing a sound source number, where 1≤s≤S is satisfied. A sound source having the sound source number s (in other words, an sth sound source) will be written as “sound source s”.
J: J is a positive integer expressing the number of noise sources. For example, S≥J≥1.
j, j′: j and j′ are positive integers expressing a noise source number, where 1≤j, j′≤J is satisfied. A noise source having the noise source number j (in other words, a jth noise source) will be written as “noise source j”. Further, the noise source number is expressed using an upper right suffix in round parentheses. Values and vectors based on the noise source having the noise source number j are expressed using reference symbols having the upper right suffix “(j)”. This applies likewise to j′. Furthermore, in this specification, a sound acquired by adding together sounds emitted from all of the noise sources is treated as noise.
L: L expresses a long time interval. The long time interval may be the entire time interval subject to processing or a partial time interval of the entire time interval subject to processing.
Bk: Bk expresses a single short time interval (a short time block). A plurality of different short time intervals are expressed by B1, . . . , BK, where K is an integer of 1 or more and k=1, . . . , K. For example, the short time intervals B1, . . . , BK are acquired by separating the long time interval L into K time intervals. Some or all of the short time intervals B1, . . . , BK may be included in an interval other than the long time interval L.
t, τ: t and τ are positive integers expressing a time frame number. Values and vectors corresponding to the time frame number t are expressed using symbols having the subscript suffix “t”. This applies likewise to t.
f: f is a positive integer expressing a frequency band number. Values and vectors corresponding to the frequency band number f are expressed using symbols having the subscript suffix “f”.
T: T expresses a non-conjugate transpose of a matrix or a vector. αT represents a matrix or a vector acquired by implementing non-conjugate transpose on α.
H: H expresses a conjugate transpose (a Hermitian transpose) of a matrix or a vector. αH represents a matrix or a vector acquired by implementing conjugate transpose on α.
a∈β:α∈β indicates that α belongs to β.
First EmbodimentNext, referring to
As shown in
<Noise Spatial Covariance Matrix Calculation Unit 11 (First Noise Spatial Covariance Matrix Calculation Unit)>
The noise spatial covariance matrix calculation unit 11 receives, as input, time-frequency-divided observation signals xt, f based on observation signals acquired by collecting acoustic signals emitted from one or a plurality of sound sources s∈{1, . . . , S} and mask information λt, f(j) expressing the occupancy probability of a component of each of the time-frequency-divided observation signals xt, f corresponding to each noise source j, and uses these elements to acquire and output, for each noise source j∈{1, . . . , J}, a time-independent noise spatial covariance matrix ψf(j) (a first noise spatial covariance matrix) corresponding to the time-frequency-divided observation signals xt, f and the mask information λt, f(j) belonging to the long time interval L (step S11). Note that the noise sources are assumed to include both sounds (point sound sources) generated from a single location, such as voices, and sounds (diffusive noise) arriving from any peripheral direction, such as background noise. Further, the upper right suffix “(j)” of “λt, f(j)”) should actually be written directly above the lower right suffix “t, f” but due to notation limitations has been written to the upper right of “t, f”. This applies likewise to other notation using the upper right suffix “(j)”, such as “ψf(j)”.
<<Illustration of Time-Frequency-Divided Observation Signals xt, f>>
Acoustic signals emitted from the sound source s are collected by the I microphones i ∈{1, . . . , I} (not shown). One sound source s∈{1, . . . , S}, for example, is a noise source j∈{1, . . . , J}. The collected acoustic signals are converted into digital signals Xτ, 1, . . . , Xτ, I in the time domain, whereupon the time-domain digital signals Xτ, 1, . . . , Xτ, I are converted into the frequency domain in units of a predetermined time interval. An example of conversion into the frequency domain in time interval units is the short-time Fourier transform. For example, signals acquired by conversion into the frequency domain in time interval units may be set as time-frequency-divided observation signals xt, f, 1, . . . , xt, f, I, where xt, f=(xt, f, 1, . . . , xt, f, I)T. Alternatively, the result of performing arithmetic of some kind on the signals acquired by conversion into the frequency domain in time interval units may be set as xt, f, 1, . . . , xt, f, I, where xt, f=(xt, f, 1, . . . , xt, f, I)T. In other words, the time-frequency-divided observation signals corresponding to the observation signals acquired by collecting sound in the ith microphone and corresponding to the frequency band f at the time frame t, for example, are xt, f, i(i∈{1, . . . , I}), where xt, f=(xt, f, 1, . . . , xt, f, I)T. The time-frequency-divided observation signals xt, f (where t∈L) belonging at least to the long time interval L are input into the noise spatial covariance matrix calculation unit 11 according to this embodiment. The time-frequency-divided observation signals xt, f belonging to the long time interval L may be input alone, or the time-frequency-divided observation signals xt, f belonging to a time interval that is longer than the long time interval L and includes the long time interval L may be input. There are no limitations on the long time interval L. For example, the entire time interval during which sound is collected may be set as the long time interval L, a voice interval extracted therefrom may be set as the long time interval L, a predetermined time interval may be set as the long time interval L, or a specified time interval may be set as the long time interval L. An example of the long time interval L is a time interval of approximately 1 second to several tens of seconds. The time-frequency-divided observation signals xt, f may be either stored in a storage device not shown in the figures or transmitted over a network.
<<Illustration of Mask Information λt, f(j)>>
The mask information λt, f(j) expresses the occupancy probability of a component of each of the time-frequency-divided observation signal xt, f corresponding to each noise source j. In other words, the mask information λt, f(j) expresses the occupancy probabilities of the components of the respective time-frequency-divided observation signals xt, f, 1, . . . , xt, f, I in the frequency band f at the time frame t that correspond to each noise source j. In this embodiment, it is assumed that the mask information λt, f(j) corresponding to each frequency band f and each noise source j is estimated by an external device, not shown in the figures, for at least the time frames t∈L belonging to the long time interval L and the time frames t∈Bk belonging to the short time intervals Bk. There are no limitations on the method for estimating the mask information λt, f(j). Methods for estimating the mask information λt, f(j) are well-known, and various methods, for example an estimation method using a complex Gaussian mixture model (CGMM) (reference document 1, for example), an estimation method using a neural network (reference document 2, for example), an estimation method combining these methods (reference document 3, for example), and so on are available.
- Reference document 1: T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise”, Proc. IEEE ICASSP-2016, pp. 5210-5214, 2016.
- Reference document 2: J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming”, Proc. IEEE ICASSP-2016, pp. 196-200, 2016.
- Reference document 3: T. Nakatani, N. Ito, T. Higuchi, S. Araki, and K. Kinoshita, “Integratin DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming”, Proc. IEEE ICASSP-2017, pp. 286-290, 2017.
The mask information λt, f(j) may be estimated in advance and stored in a storage device, not shown in the figures, or estimated successively.
<<Illustration of Noise Spatial Covariance Matrix φf(j)>>
The noise spatial covariance matrix calculation unit 11 according to this embodiment receives the time-frequency-divided observation signals xt, f and the mask information λt, f(j) as input, and estimates and outputs a time-independent noise spatial covariance matrix ψf(j) corresponding to the time-frequency-divided observation signals xt, f and the mask information λt, f(j) belonging to the long time interval L. For example, the noise spatial covariance matrix ψf(j) is the sum or the weighted sum of λt, f(j)×xt, f×xt, fH with respect to the frequency band f at the time frames t∈L belonging to the long time interval L. For example, the noise spatial covariance matrix calculation unit 11 calculates (estimates) and outputs the noise spatial covariance matrix ψf(j) as shown below in formula (1).
Here, νf(j) is a real number parameter (a hyperparameter), and in this embodiment, νf(j) is a constant. The significance of νf(j) will be described below.
<Mixture Weight Calculation Unit 12>
The mixture weight calculation unit 12 receives the mask information λt, f(j) of each of the plurality of different short time intervals Bk (where k∈{1, . . . , K}) as input, and uses this to acquire and output a mixture weight μk, f(j) corresponding to each noise source j∈{1, . . . , J} in each short time interval Bk (step S12). An example of the mixture weight μk, f(j) is a ratio of a second sum to a first sum, as will now be described. The first sum is the sum of the mask information λt, f(j′) corresponding to the frequency band f at the time frame number t belonging to the respective short time intervals Bk with respect to all of the noise sources j′ ∈{1, . . . , J}. The second sum is the sum of the mask information λt, f(j) corresponding to the frequency band f at the time frame t belonging to the respective short time intervals Bk with respect to each noise source j. For example, the mixture weight calculation unit 12 acquires and outputs the mixture weights μk, f(j) as shown below in formula (2).
<Noise Spatial Covariance Matrix Calculation Unit 13 (Second Noise Spatial Covariance Matrix Calculation Unit)>
The noise spatial covariance matrix calculation unit 13 acquires and outputs a noise spatial covariance matrix to be described below from the following four inputs (step S13). The four inputs are the time-frequency-divided observation signals xt, f, the mask information λt, f(j) of each noise source j∈{1, . . . , J}, the noise spatial covariance matrix ψf(j) of each noise source j, and the mixture weight μk, f(j) of each noise source j. The aforementioned noise spatial covariance matrix is a time-variant noise spatial covariance matrix R{circumflex over ( )}k, f (a third noise spatial covariance matrix) based on a time-variant noise spatial covariance matrix (a second noise spatial covariance matrix) corresponding to the time-frequency-divided observation signals xt, f and the mask information λt, f(j) belonging to each short time interval Bk (where k∈{1, . . . , K}) with respect each noise source n∈{1, . . . , J} and a weighted sum of the noise spatial covariance matrices ψf(j) (the first noise spatial covariance matrices) with the mixture weights μk, f(j) of the respective short time intervals Bk. Note that the suffix “{circumflex over ( )}” to the upper right of “R” should actually be written directly above “R” but due to notation limitations has been written to the upper right of “R”. For example, the time-variant noise spatial covariance matrix (the second noise spatial covariance matrix) corresponding to the time-frequency-divided observation signals xt, f and the mask information λt, f(j) belonging to each short time interval Bk and the frequency band f with respect to noise formed by adding together all of the noise sources is the sum or the weighted sum of λt, f(j)×xt, f×xt, fH at the time frame t and all of the noise sources j belonging to each short time interval Bk. Further, the noise spatial covariance matrix R{circumflex over ( )}k, f (the third noise spatial covariance matrix) is based on a weighted sum of the time-variant noise spatial covariance matrix (the second noise spatial covariance matrix) corresponding to the time-frequency-divided observation signals xt, f and the mask information λt, P belonging to each short time interval Bk and the frequency band f with respect to noise formed by adding together all of the noise sources, and the weighted sum of the noise spatial covariance matrices ψf(j) with the mixture weights μk, f(j) with respect to all of the noise sources j∈{1, . . . , J}. For example, the noise spatial covariance matrix calculation unit 13 calculates (estimates) and outputs the time-variant noise spatial covariance matrix R{circumflex over ( )}k, f as shown below in formula (3).
In this example, the noise spatial covariance matrix R{circumflex over ( )}k, f is the weighted sum of the noise spatial covariance matrix
and the weighted sum
of the noise spatial covariance matrices ψf(j) with the mixture weights μk, f(j) in each short time interval Bk, where the parameter νf(j) is used to determine the weights of the noise spatial covariance matrix ψf(j) and the noise spatial covariance matrix
in the noise spatial covariance matrix R{circumflex over ( )}k, f.
Note that here, as an example, the noise spatial covariance matrix calculation unit 13 acquires the noise spatial covariance matrix R{circumflex over ( )}k, f using the time-frequency-divided observation signals xt, f, the mask information λt, f(j) of each noise source j∈{1, . . . , J}, the noise spatial covariance matrix ψt, f, of each noise source j, and the mixture weight μk, f(j) of each noise source j as input, but the present invention is not limited thereto. More specifically, the noise spatial covariance matrix calculation unit 13 may acquire the noise spatial covariance matrix R{circumflex over ( )}k, f using λt, f(j)×xt, f×xt, fH, which is acquired midway through the calculations of the noise spatial covariance matrix calculation unit 11, as input instead of the time-frequency-divided observation signals xt, f.
Features of this EmbodimentIn this embodiment, the time-variant noise spatial covariance matrix R{circumflex over ( )}k, f (the third noise spatial covariance matrix) is generated on the basis of the time-variant noise spatial covariance matrix (the second noise spatial covariance matrix) corresponding to the time-frequency-divided observation signals xt, f and the mask information λt, f(j) belonging to each short time interval Bk (where k∈{1, . . . , K}) and each frequency band f with respect to noise formed by adding together all of the noise sources, and the weighted sum of the noise spatial covariance matrices ψf(j) (the first noise spatial covariance matrices) with the mixture weights μk, f(j) of the respective short time intervals Bk. Here, the noise spatial covariance matrix ψf(j) is calculated using all of the time-frequency-divided observation signals xt, f and the mask information λt, f(j) belonging to the long time interval L (step S11), and therefore a high degree of estimation precision can be secured for the noise spatial covariance matrix ψf(j). Meanwhile, the time-variant noise spatial covariance matrix R{circumflex over ( )}k, f, which is based on the time-variant noise spatial covariance matrix corresponding to the time-frequency-divided observation signals xt, f and the mask information λt, f(j) belonging to each short time interval Bk with respect to noise formed by adding together all of the noise sources and the weighted sum of the noise spatial covariance matrices ψf(j) with to the mixture weights μk, f(j) of the respective short time intervals Bk is acquired for the short time intervals B1, . . . , BK, and therefore the acquired noise spatial covariance matrix R{circumflex over ( )}k, f responds flexibly to temporal variation over the short time intervals Bk. According to this embodiment, therefore, a highly precise noise spatial covariance matrix that responds flexibly to temporal variation in the time-frequency-divided observation signals xt, f can be acquired.
Second EmbodimentNext, a second embodiment will be described. The second embodiment differs from the first embodiment in that the weights of the first noise spatial covariance matrix and the second noise spatial covariance matrix in the third noise spatial covariance matrix can be modified on the basis of the input parameter. The following description focuses on differences with the matter already described, and with respect to the matter already described, identical reference numerals will be used and the description will be simplified.
As shown in
in the noise spatial covariance matrix R{circumflex over ( )}k, f can be adjusted. More specifically, as the value of the parameter νf(j) is increased, the weight of the noise spatial covariance matrix ψf(j) increases, leading to an improvement in the estimation precision in exchange for a reduction in the responsiveness to temporal variation in the time-frequency-divided observation signals xt, f. Conversely, as the value of the parameter νf(j) is reduced, the weight of the noise spatial covariance matrix
increases, leading to an improvement in the responsiveness to temporal variation in the time-frequency-divided observation signals xt, f in exchange for estimation stability. Otherwise, the second embodiment is as described in the first embodiment.
Next, a third embodiment will be described. The third embodiment is an example application of the first and second embodiments, in which the noise spatial covariance matrix R{circumflex over ( )}k, f generated as described in the first and second embodiments is used in noise suppression processing. The configuration and processing content of a noise suppression device 30 according to the third embodiment will be described below with reference to
As shown in
As described in the first or second embodiment, the noise spatial covariance matrix estimation device 10 or 20 generates and outputs the noise spatial covariance matrix R{circumflex over ( )}k, f using the time-frequency-divided observation signals xt, f and the mask information λt, f(j) and if necessary, also the parameter νf(j)) as input (step S10 (step S20)). The noise spatial covariance matrix R{circumflex over ( )}k, f is transmitted to the beamformer estimation unit 32.
The beamformer estimation unit 32 generates and outputs a beamformer (an instantaneous beamformer) Wk, f for each short time interval Bk using as input the noise spatial covariance matrix R{circumflex over ( )}k, f and a steering vector νf, 0 corresponding to the sound source to be subjected to estimation using the beamformer (step S32). Methods for generating the steering vector νf, 0 and the beamformer (the instantaneous beamformer) Wk, f are well-known, and are described in reference documents 4 and 5, and so on, for example.
Reference document 4: T Higuchi, N Ito, T Yoshioka, and T Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise”, Proc. ICASSP 2016, 2016.
Reference document 5: J Heymann, L Drude, and R Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming”, Proc. ICASSP 2016, 2016.
The beamformer Wk, f is transmitted to the suppression unit 33.
The suppression unit 33, using the time-frequency-divided observation signals xt, f and the beamformer Wk, f as input, applies the beamformer Wk, f to the time-frequency-divided observation signals xt, f as shown below in formula (4) in order to acquire time-frequency-divided observation signals yt, f in which noise has been suppressed from the time-frequency-divided observation signals xt, f. The suppression unit 33 then outputs the time-frequency-divided observation signals yt, f.
yt,f=Wk,f×t,f (4)
The time-frequency-divided observation signals yt, f may be used in other processing in the frequency domain or may be converted into the time domain. When the time-frequency-divided observation signals yt, f acquired as described above are used in voice recognition processing, for example, a word error rate can be improved by approximately 20% in comparison with a case where signals acquired by estimating a beamformer using the non-time-variant noise spatial covariance matrix estimation method illustrated in NPL 1 and suppressing noise therein are used in voice recognition processing.
Other Modified Examples and so onNote that the present invention is not limited to the embodiments described above. For example, in the above embodiments, the long time interval L is not updated, but the time-variant noise spatial covariance matrix R{circumflex over ( )}k, f may be acquired for each short time interval in the manner described above while updating the long time interval L. For example, the noise spatial covariance matrix R{circumflex over ( )}k, f may be acquired in the manner described above by batch processing, or the noise spatial covariance matrix R{circumflex over ( )}k, f may be acquired in the manner described above by sequentially extracting data of a length corresponding to the long time interval L from time-frequency-divided observation signals xt, f and mask information λt, f(j) input into the noise spatial covariance matrix estimation device in real time.
Instead of formula (1), the noise spatial covariance matrix ψf(j) may be calculated as follows.
Here, β is a coefficient and may be either a constant or a variable.
Further, instead of formula (3), the noise spatial covariance matrix R{circumflex over ( )}k, f may be calculated as follows.
Here, θ is a coefficient and may be either a constant or a variable.
Further, in the third embodiment, the noise spatial covariance matrix R{circumflex over ( )}k, f is used in noise suppression processing, but the noise spatial covariance matrix R{circumflex over ( )}k, f may be used in another application such as sound source position (sound source direction) estimation.
The various processing described above does not have to be executed in time series in accordance with the description and may, depending on the processing capacity of the devices that execute the processing or as required, be executed in parallel or individually. The processing may also be modified as appropriate within a scope that does not depart from the spirit of the present invention.
The devices described above are configured by having a general-purpose or dedicated computer including a processor (a hardware processor) such as a CPU (central processing unit) and a memory such as a RAM (random-access memory) or a ROM (read-only memory) execute a predetermined program, for example. The computer may include one processor and one memory or pluralities of processors and memories. The program may be installed in the computer or recorded in advance in the ROM or the like. Further, some or all of the processing units may be configured using electronic circuitry that realizes processing functions without using a program rather than electronic circuitry that realizes processing functions by reading a program, such as a CPU. Electronic circuitry constituting a single device may include a plurality of CPUs.
When the configurations described above are realized by a computer, the processing content of the functions to be provided in the devices is described by a program. By having the computer execute the program, the processing functions described above are realized on the computer. The program describing the processing content can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of this type of recording medium include a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and so on.
The program is distributed by selling, transferring, lending, or otherwise distributing a portable recording medium such as a DVD or a CD-ROM on which the program is recorded, for example. The program may also be distributed by storing the program in a storage device of a server computer and transferring the program from the server computer to another computer over a network.
For example, the computer that executes the program first temporarily stores the program, which has been recorded on a portable recording medium or transferred from a server computer, in a storage device provided therein. Then, when the processing is to be executed, the computer reads the program stored in the storage device and executes processing corresponding to the read program. Further, as another embodiment of the program, the computer may read the program directly from the portable recording medium and execute processing corresponding to the program. Furthermore, the computer may execute processing corresponding to the received program successively each time the program is transferred thereto from the server computer. Alternatively, the processing described above may be executed using a so-called ASP (Application Service Provider) service in which, instead of transferring the program from the server computer to the computer, the processing functions are realized only by issuing execution commands and acquiring results.
Instead of realizing the processing functions of the present device by executing a predetermined program on a computer, at least some of the processing functions may be realized by hardware.
REFERENCE SIGNS LIST
- 10, 20 Noise spatial covariance matrix estimation device
Claims
1. A noise spatial covariance matrix estimation device comprising processing circuitry configured to:
- use time-frequency-divided observation signals xt, f and mask information λt, f(j) to acquire, for each noise source j, time-independent first noise spatial covariance matrices ψf(j) corresponding to the time-frequency-divided observation signals xt, f and the mask information λt, f(j) for all t∈L, wherein j is a positive integer expressing a noise source number, J is a positive integer expressing a number of the noise sources, j=1,..., J holds, t is a positive integer expressing a time frame number, f is a positive integer expressing a frequency band number, L is a long time interval, the time-frequency-divided observation signals xt, f are based on observation signals acquired using one or more microphones by collecting acoustic signals emitted from one or a plurality of sound sources, and the mask information λt, f(j) expresses an occupancy probability of a component corresponding to each noise source j in each of the time-frequency-divided observation signals Xt, f;
- use the mask information λt, f(j) for t∈Bk of each of a plurality of different short time intervals B1,..., BK to acquire a mixture weight μk, f(j) corresponding to each noise source j in each short time interval Bk, wherein K is an integer greater than 1, k=1,..., K, each short time interval Bk is shorter than the long time interval L, and each short time interval Bk is a part of L; and
- acquire and output a time-variant third noise spatial covariance matrix R{circumflex over ( )}k, f for a noise of the acoustic signals based on a time-variant second noise spatial covariance matrix and a weighted sum of the first noise spatial covariance matrices ψf(j) with the mixture weights μk, f(j) for each short time interval Bk, wherein the second noise spatial covariance matrix corresponds to the time-frequency-divided observation signals xt, f and the mask information λt, f(j) for the noise source j and t∈Bk of each short time interval Bk, and the noise is formed by all of the noise sources j=1,..., J.
2. The noise spatial covariance matrix estimation device according to claim 1, wherein
- the third noise spatial covariance matrix R{circumflex over ( )}k, f is a weighted sum of the second noise spatial covariance matrix and the weighted sum of the first noise spatial covariance matrices ψf(j) with the mixture weights μk, f(j) of each short time interval Bk, and
- respective weights of the first noise spatial covariance matrices ψf(j) and the second noise spatial covariance matrix in the third noise spatial covariance matrix R{circumflex over ( )}k, f is modifiable.
3. The noise spatial covariance matrix estimation device according to claim 1, wherein
- αT represents a non-conjugate transpose of α and αH represents a conjugate transpose of a,
- J noise sources exist, J being an integer of 1 or more,
- the observation signals are collected by I microphones, I being an integer of 2 or more,
- the time-frequency-divided observation signals that correspond to a frequency band f at a time frame t and correspond to the observation signals acquired by collecting sound in an ith microphone, are xt, f, i where xt, f=(xt, f, 1,..., xt, f, I)T,
- the mask information expressing the occupancy probability of the component that corresponds to a jth noise source in each of the time-frequency-divided observation signals xt, f, 1,..., xt, f, I in the frequency band f at the time frame t is λt, f(j),
- each of the first noise spatial covariance matrices corresponding to the jth noise source is ψf(j), ψf(j) being a sum or a weighted sum of λt, f(j)×xt, f×xt, fH with respect to the frequency band f at the time frame f belonging to the long time interval,
- with regard to the short time intervals B1,..., BK, K is an integer of 2 or more, and k=1,..., K,
- each of the mixture weights μk, f(j) corresponding to the frequency band f at each of the short time intervals Bk with respect to each of the noise sources j∈{1,..., J} is each a ratio of the sum of the mask information λt, f(j) corresponding to the frequency band f at the time frame t belonging to the respective short time intervals Bk with respect to each noise source j to the sum of the mask information λt, f(j) corresponding to the frequency band f at the time frame t belonging to the respective short time intervals Bk with respect to all of the noise sources j′∈{1,..., J},
- the second noise spatial covariance matrix that corresponds to the time-frequency-divided observation signals Xt, f and the mask information λt, f(j) belonging to each short time interval Bk and each frequency band f and relates to noise formed by adding together all of the noise sources is the sum or the weighted sum of λt, f(j)×xt, f×xt, fH at the time frames t and all of the noise sources j belonging to each short time interval Bk and each frequency f, and
- the third noise spatial covariance matrix is based on a weighted sum of the second noise spatial covariance matrix and a weighted sum of the first noise spatial covariance matrices ψf(j) with the mixture weights μk, f(j) for all of the noise sources j.
4. A noise spatial covariance matrix estimation method comprising:
- using time-frequency-divided observation signals Xt, f and mask information λt, f(j) to acquire, for each noise source j, time-independent first noise spatial covariance matrices ψf(j) corresponding to the time-frequency-divided observation signals xt, f and the mask information λt, f(j) for all t ∈L, wherein j is a positive integer expressing a noise source number, J is a positive integer expressing a number of the noise sources, j=1,..., J holds, t is a positive integer expressing a time frame number, f is a positive integer expressing a frequency band number, L is a long time interval, the time-frequency-divided observation signals xt, f are based on observation signals acquired using one or more microphones by collecting acoustic signals emitted from one or a plurality of sound sources, and the mask information λt, f(j) expresses an occupancy probability of a component corresponding to each noise source j in each of the time-frequency-divided observation signals xt, f,
- using the mask information λt, f(j) for t ∈Bk of each of a plurality of different short time intervals B1,..., BK to acquire mixture weight μk, f(j) corresponding to each noise source j in each short time interval BK, wherein K is an integer greater than 1, k=1,..., K, and each short time interval Bk is shorter than the long time interval L, and each short time interval Bk is a part of L; and
- acquiring and outputting a time-variant third noise spatial covariance matrix R{circumflex over ( )}k, f for a noise of the acoustic signals based on a time-variant second noise spatial covariance matrix and a weighted sum of the first noise spatial covariance matrices ψf(j) with the mixture weights μk, f(j) for each short time interval Bk, wherein the second noise spatial covariance matrix corresponds to the time-frequency-divided observation signals xt, f and the mask information λt, f(j) for the noise source j and t ∈Bk of each short time interval Bk, where the noise is formed by all of the noise sources j=1,..., J.
5. A non-transitory computer-readable recording medium storing a program for causing a program for casing a computer to function as the noise spatial covariance matrix estimation device according to claim 1.
- Higuchi et al. “Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR”, IEEE/ACM Transactions of Audio, Speech, and Language Processing, vol. 25, No. 4, pp. 780-793, April (Year: 2017).
- Togami, “Simultaneous Optimization of Forgetting Factor and Time-Frequency Mask for Block Online Multi-Channel Speech Enhancement”, ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2702-2706, May (Year: 2019).
- Kubo et al., Mask-based MVDR beamformer for noisy multisource environments: Introduction of time-varing spatial covariance model, 2019 IEEE Internaional Conference on Acoustics, Speech and signal Processing, Apr. 16, 2019. p. 6855-6859, ISSN 2379-190X.
- Higuchi et al. “Robust MVDR beam forming using time-frequency masks for online/offline ASR in noise”, Proc. ICASSP 2016.
Type: Grant
Filed: Feb 28, 2020
Date of Patent: Jun 13, 2023
Patent Publication Number: 20220130406
Assignee: NIPPON TELEGRAPH AND TELEPHONE CORPORATION (Tokyo)
Inventors: Tomohiro Nakatani (Tokyo), Marc Delcroix (Tokyo), Keisuke Kinoshita (Tokyo), Shoko Araki (Tokyo), Yuki Kubo (Tokyo)
Primary Examiner: Leshui Zhang
Application Number: 17/437,701