Robust Speech Recognition

Implementation

Our JASPER speech recognition system is based on a token passing architecture. Its hybrid model architecture utilizes hidden Markov models in the graph-search back end, which are combined with deep neural networks (DNNs) as state estimators in the front end.

Robust Recognition

To recognize speech reliably in noisy or reverberant environments, it is important that the acoustic distortions are modeled as precisely as possible, which we achieve with the help of multichannel information, by analyzing the spectro-temporal evolution of the microphone signal(s), and by consequent use of data augmentation.

The distortion model is used firstly to improve the audio signal quality itself https://ieeexplore.ieee.org/document/7602938/. Despite this quality enhancement, residual errors remain in the signal.

To obtain the best possible speech recognition results despite a possibly sub-optimal signal quality, we utilize an estimate of the remaining errors in the recognition engine. This allows the recognizer to focus more on the most reliable components of the signal, and to reduce the influence of distorted components accordingly https://ieeexplore.ieee.org/document/7472187/.

In addition, adaptive stream weighting has proven helpful to fuse all streams of information (e.g. in audiovisual speech recognition) in accordance with their current degree of reliability https://ieeexplore.ieee.org/document/7953172/.

Blind Source Separation

When multiple talkers are speaking simultaneously, blind source separation can be used to segregate their speech signals.

For this purpose, multiple microphone signals are recorded synchronously. These can be interpreted as weighted sums of all speech signals, each of which is convolved with the respective room transfer function. By assuming all sources to be statistically independent, it is often possible to infer the relevant characteristics of these room transfer functions, and to use this knowledge for obtaining estimates of all isolated speaker signals.

The quality of the separation depends on the relative positions of speakers and microphones, on possible background noise and on the reverberation time of the room. In anechoic chambers, best results are obtained:

Mixture 1 Mixture 2
Separation Result 1 Separation Result 2

while the separation in more realistic conditions, as e.g. in a driving car:

Mixture 3 Mixture 4
Separation Result 3 Separation Result 4

are the subject of ongoing work.

Details, also on coupling source separation and speech recognition, are described in Chapter_ICA.pdf (in: „Robust Speech Recognition of Uncertain or or Missing Data - Theory and Applications“, Springer Verlag, July 2011).

Audiovisual Speech Recognition

The audiovisual speech recognizer JASPER (Java Audiovisual SPEech Recognizer) and its real-time extension CASPER (CUDA Audiovisual SPEech Recognizer) are based on coupled HMMs in an efficient token passing implementation. Compared to single-modality speech recognition, we can often cut the error rate in half relative to the better of the two modalities. Still, even under the strongest acoustic distortions, the error rate never drops beneath that of lip-reading only, making the system attractive for high-distortion environments such as those seen at public transportation ticketing or ATM machines.

For current work, we are using the GRID-corpus, consisting of sentences in a simple command language, which has been made available by Jon Barker, Martin Cooke, Stuart Cunningham and Xu Shao: http://www.dcs.shef.ac.uk/spandh/gridcorpus/#credits .

The following video shows an example of audiovisual recognition with artificially distorted audio data. For this purpose, white noise was added to the speech signal at 10dB SNR:

Video-Beispiel (WMV, 366,3 KB)

The resulting audiovisual recognition rate reaches 81.1% here, compared to 57.5% using only the audio channel. For clean audio data, the system achieves 99.7% word accuracy on the Grid corpus.