Robust Speech Recognition

Implementation
Recognition under observation uncertainty
Statistical Speech Processing
Blind Source Separation
Audiovisual Speech Recognition


Implementation

JASPER is based on a token passing architecture, using hidden Markov models with Gaussian mixture models (GMMs) as output distributions.

For audiovisual recognition, we are using coupled HMMs. This allows us to compensate asynchronies of audio- and video stream. These can arise, for example, because speakers tend to bring articulators into the right position before starting to enunciate, so that video features are often ahead of audio features by up to 120ms.

The JASPER system is also equipped for Recognition under observation uncertainty. This allows to weigh the contributions of the different feature components in both streams in accordance with the distortions present in the channels. In this way, the most reliable features at each given time have the strongest influence in the search for the best recognition result.

The JASPER offspring CASPER utilizes massively parallel graphics processor units (GPUs), typically equipped with a few hundred streaming processors, to compute the GMM output probabilities efficiently with the help of CUDA. This lets audiovisual recognition perform in real time, also when using observation uncertainties:  Interspeech2010.pdf.

Recognition under observation uncertainty

To recognize speech reliably in noisy or reverberant environments, it is important that the acoustic distortions are modeled as precisely as possible, which we achieve with the help of multichannel information and by analyzing the spectro-temporal evolution of the microphone signal(s).

The distortion model is used firstly to improve the audio signal quality itself (Statistical Speech Processing). Despite this quality enhancement, residual errors remain in the signal, and, depending on the speech processing techniques, some new artifacts may be introduced.

To obtain the best possible speech recognition results despite the sub-optimal signal quality, we transmit an estimate of the remaining errors to the recognition engine. This allows the recognizer to lay more focus on the more reliable components of the signal, and to reduce the influence of distorted components accordingly.

Several algorithms for realizing such a tight coupling between speech processing and speech recognition are described in Chapter_UncertaintyOfObservation.pdf (in: „Robust Speech Recognition of Uncertain or Missing Data - Theory and Applications“, Springer Verlag, July 2011).

Statistical Speech Processing

Many statistical speech processing methods that have been developed primarily for human listeners at the Institute of Communication Acoustics are important for achieving greater robustness in automatic speech recognition.

Examples are

Blind Source Separation

When multiple talkers are speaking simultaneously, blind source separation can be used to segregate their speech signals.

For this purpose, multiple microphone signals are recorded synchronously. These can be interpreted as weighted sums of all speech signals, each of which is convolved with the respective room transfer function. By assuming all sources to be statistically independent, it is often possible to infer the relevant characteristics of these room transfer functions, and to use this knowledge for obtaining estimates of all isolated speaker signals.

The quality of the separation depends on the relative positions of speakers and microphones, on possible background noise and on the reverberation time of the room. In anechoic chambers, best results are obtained:

Mixture 1 Mixture 2
Separation Result 1 Separation Result 2

while the separation in more realistic conditions, as e.g. in a driving car:

Mixture 3 Mixture 4
Separation Result 3 Separation Result 4

are the subject of ongoing work.

Details, also on coupling source separation and speech recognition, are described in Chapter_ICA.pdf (in: „Robust Speech Recognition of Uncertain or or Missing Data - Theory and Applications“, Springer Verlag, July 2011).

Audiovisual Speech Recognition

The audiovisual speech recognizer JASPER (Java Audiovisual SPEech Recognizer) and its real-time extension CASPER (CUDA Audiovisual SPEech Recognizer) are based on coupled HMMs in an efficient token passing implementation. Compared to single-modality speech recognition, we can often cut the error rate in half relative to the better of the two modalities. Still, even under the strongest acoustic distortions, the error rate never drops beneath that of lip-reading only, making the system attractive for high-distortion environments such as those seen at public transportation ticketing or ATM machines.

For current work, we are using the GRID-corpus, consisting of sentences in a simple command language, which has been made available by Jon Barker, Martin Cooke, Stuart Cunningham and Xu Shao: http://www.dcs.shef.ac.uk/spandh/gridcorpus/#credits .

The following video shows an example of audiovisual recognition with artificially distorted audio data. For this purpose, white noise was added to the speech signal at 10dB SNR:

Video-Beispiel (WMV, 366,3 KB)

The resulting audiovisual recognition rate reaches 81.1% here, compared to 57.5% using only the audio channel. For clean audio data, the system achieves 99.7% word accuracy on the Grid corpus.