Source Localization and Source Separation

Figure 1
Figure 1: Target extraction using multi-channel signal processing

The figure above shows a sample multi-channel target enhancement scenario. Two competing speakers are recorded by a 5-channel microphone array in a noisy and reverberant environment T60 = 0.6 s, source and interferer are at a distance of 1.0m from the center of the array, outside the critical distance of room (≈ 0.84m)). The target localisation and subsequent extraction/enhancement using the approaches developed in are presented below for this example. Each speaker signal has the same power. The background noise is diffuse white noise mixed at 10 dB below the signal power. Note that the noise is correlated across the microphones.

The microphone signal: x1(n) Play! withoutCepstroTemporalSmoothing.wav
The output using delay-and-sum beamforming: yDSB(n) Play! withoutCepstroTemporalSmoothing.wav
Output using an adaptive soft-mask based on the target presence probability: ymask(n) Play! withoutCepstroTemporalSmoothing.wav
Output using a cepstro-temporally smoothed version [2] of the soft-mask above: ysmth(n) Play! withoutCepstroTemporalSmoothing.wav
Output using the parsimoniously excited generalised sidelobe canceller (PEG) algorithm: yPEG(n) Play! withoutCepstroTemporalSmoothing.wav

As expected, the DSB offers some enhancement, but as it is not actively cancelling interference and noise, the enhancement is limited – in particular, the interferer is still clearly audible. The mask based approach offers good noise and interference suppression. The quality of the target speech is also rather good, due to the use of a soft-mask. However, some artefacts may be observed in the output. Also, there is one point in the signal where the target distortion is audible: the target has low energy in these frames, and is not well localised – leading to a sudden dip in the voice (towards the latter part of the sentence). This may impact intelligibility. The smoothed masks improves upon the target speech quality, however this is at the cost of reduced noise and interference suppression. The PEG approach offers all-round good performance, in terms of interference and noise cancellation and preservation of the target signal. It sounds the most natural.

[1] N. Madhu, “Acoustic source localization: Algorithms, applications and extensions to source separation”, Dissertation, Ruhr-Universit¨at Bochum.

[2] N. Madhu, C. Breithaupt and R. Martin, “Temporal smoothing of spectral masks in the cepstral domain for speech separation”, Proc. International Conference on Acoustics, Speech and Signal processing (ICASSP), 2008.

Figure 2

Figure 2: Extraction of 3 speakers from 2 microphones using the parsimoniously excited GSC (PEG). Source 2 is at broadside, sources 1 and 3 are each 30° respectively to the left and right of source 2

The extraction of sources from under-determined mixtures using the PEG approach. The source placements are as shown above, the microphones are 8 cm apart. The recordings were made in a reverberant room (T60 = 0.6 s, sources at 1m from the array center outside the critical distance of room (dcrit. = 0.84m)). Each source signal consists of two sentences each, one spoken by a male and one by a female speaker (not necessarily in that order).
Each source has the same power. Thus, the effective mixing SIR is -3 dB.

The microphone signal: x1(n) Play! withoutCepstroTemporalSmoothing.wav
The clean sources as perceived at microphone 1:      
  1. s1(n) Play! withoutCepstroTemporalSmoothing.wav
  2. s2(n) Play! withoutCepstroTemporalSmoothing.wav
  3. s3(n) Play! withoutCepstroTemporalSmoothing.wav
Output of the PEG      
  1. y1(n) Play! withoutCepstroTemporalSmoothing.wav
  2. y2(n) Play! withoutCepstroTemporalSmoothing.wav
  3. y3(n) Play! withoutCepstroTemporalSmoothing.wav

After a short adaptation period, the target sources can be quite well extracted from the under-determined mixture. These sources are much clearer after processing by PEG. The separated source from broadside has the least improvement, which is to be expected: separation is achieved by steering a null towards the interferers. For the source at broadside, this requires steering nulls both to the left and the right of the source – which is difficult as we only have one degree of freedom. The system therefore nulls the strongest interference (evident for the duration of the second sentence, where the strong female interferer is suppressed, allowing the target male speaker in the second sentence to be heard). For sources 1 & 3, the interfering sources are all to one side of it making it possible to steer a broader null or quickly vary the position of the null, thereby yielding better separation performance.