Robust Speech Recognition

Implementation
Recognition under observation uncertainty
Statistical Speech Processing
Blind Source Separation
Audiovisual Speech Recognition

Implementation

JASPER is based on a token passing architecture, using hidden Markov models with Gaussian mixture models (GMMs) as output distributions.

For audiovisual recognition, we are using coupled HMMs. This allows us to compensate asynchronies of audio- and video stream. These can arise, for example, because speakers tend to bring articulators into the right position before starting to enunciate, so that video features are often ahead of audio features by up to 120ms.

The JASPER system is also equipped for Recognition under observation uncertainty. This allows to weigh the contributions of the different feature components in both streams in accordance with the distortions present in the channels. In this way, the most reliable features at each given time have the strongest influence in the search for the best recognition result.

The JASPER offspring CASPER utilizes massively parallel graphics processor units (GPUs), typically equipped with a few hundred streaming processors, to compute the GMM output probabilities efficiently with the help of CUDA. This lets audiovisual recognition perform in real time, also when using observation uncertainties:  Interspeech2010.pdf.