Continuous audio-visual speech recognition

Luettin, Juergen; Dupont, Stéphane

doi:10.1007/bfb0054771

Cited by 17 publications

(16 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other models were developed by placing constraints on the states or the transitions in order to make the new models tractable. The Multi-Stream HMM [41,42] allows for multiple input feature streams that may have different frame rates and can be asynchronous. It assumes that the model consists of a number of sub-unit models that correspond to the level at which the streams have to synchronize, for example phoneme level or syllable level.…”

Section: Data Fusion Architecturementioning

confidence: 99%

Comparison between different feature extraction techniques for audio-visual speech recognition

Chitu

Rothkrantz

Wiggers

et al. 2007

J Multimodal User Interfaces

View full text Add to dashboard Cite

Having a robust speech recognition system that can be relied upon in different environments is a strong requirement for modern systems. However audioonly speech recognition still lacks robustness when the signal to noise ratio decreases. This is especially true when the system is deployed in public spaces or is used for crises situations management where the background noise is expected to be extremely large. The video information is not affected by noise which makes it an ideal candidate for data fusion. The acoustic features have been well defined during the course of the years, the most used features being mel-frequency cepstral coefficiens (MFCCs) or linear predictive coefficients(LPCs). On the visual side, however, there is still much place for improvements. It is still not clear which visual features retain the most speech related information. Until now the visual features used were static features which describe the face of the user at one instance in time only. In the paper [1] the authors have shown that most of the techniques used for extraction of static visual features result in equivalent features or at least the most informative features exhibit this. This means that all techniques describe the same aspect of the visual stream. However the improvement of recognition even though looks promis- ing is still modest. We argue that the main problem of existing methods is that the resulting features contain no information about the motion of the speaker's lips. We present in this paper a new method for extracting useful features from the point of view of speech recognition based on optical flow analysis. The video features extracted using this method are preserving the information about speaker mouth motion. We tested the method on an audio-video database for Dutch language. The Audio-Visual Speech Recognizer(AVSR) used is based on HMMs method and was trained for large vocabulary continuous speech. For completion we also present the method introduced in the paper [2] for extracting static visual features. We will compare these two methods with respect to the induced recognition performance. Another way to recover motion information from static features is to use their first and/or the second derivative as visual features. However this can not always guaranty that the resulted features are physically sound quantities. We will also present for comparison the recognition results based on such features. The evaluation of these methods will be done under different noise conditions. We show that the audio-video recognition based on the true motion features outperforms the other settings in low Signal to Noise Ratio(SNR) conditions.

show abstract

Section: Data Fusion Architecturementioning

confidence: 99%

Comparison between different feature extraction techniques for audio-visual speech recognition

Chitu

Rothkrantz

Wiggers

et al. 2007

J Multimodal User Interfaces

View full text Add to dashboard Cite

show abstract

“…The goal is to use the motion of the lips in order to improve the acoustic recognition of the words. Many different studies have shown improved speech recognition (both faster and more accurate) when visual cues are available [6][7][8][9][10][11][16][17][18][19]21,22].…”

Section: Introductionmentioning

confidence: 99%

“…In the first stage, information from the video frames is processed in order to prepare it for integration with the acoustic signal [7]. One simplistic example of this is image-based data extraction, during which the image of the mouth is selected without any processing [7,19,20,22]. While all the information contained within that frame is automatically selected, it does not include any dimensionality reduction and hence makes audiovisual information fusion extremely difficult.…”

Section: Introductionmentioning

confidence: 99%

“…Other first stage techniques include visual motion based information extraction approaches [7,23], model based approaches [5,7,[13][14][15], and geometric feature extraction techniques [7,18]. The latter techniques involve the automatic detection of pertinent features (such as the width and height of the outer corners of the mouth) that are then used for audiovisual information fusion.…”

Section: Introductionmentioning

confidence: 99%

“…Audiovisual information fusion can either be accomplished early in the speech recognition process [20,21] or late [18,19]. Early integration is often preferable since it will result in a more robust system [7]. Typical systems in the Ôearly integration' category often combine audiovisual cues before the phonetic classification stage.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations