audivisual speech recognition

The audiovisual speech recognizer JASPER (Java Audiovisual SPEech Recognizer) and its real-time extension CASPER (CUDA Audiovisual SPEech Recognizer) are based on coupled HMMs in an efficient token passing implementation. Compared to single-modality speech recognition, we can often cut the error rate in half relative to the better of the two modalities. Still, even under the strongest acoustic distortions, the error rate never drops beneath that of lip-reading only, making the system attractive for high-distortion environments such as those seen at public transportation ticketing or ATM machines.

For current work, we are using the GRID-corpus, consisting of sentences in a simple command language, which has been made available by Jon Barker, Martin Cooke, Stuart Cunningham and Xu Shao: .

The following video shows an example of audiovisual recognition with artificially distorted audio data. For this purpose, white noise was added to the speech signal at 10dB SNR:

Video-Beispiel (WMV, 366,3 KB)

The resulting audiovisual recognition rate reaches 81.1% here, compared to 57.5% using only the audio channel. For clean audio data, the system achieves 99.7% word accuracy on the Grid corpus.