Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition

Papandreou, George; Katsamanis, Athanasios; Pitsikalis, Vassilis; Maragos, Petros

doi:10.1109/tasl.2008.2011515

Cited by 80 publications

(32 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For visual features the approach used histogram-based descriptors around twelve lip landmarks determined using an AAM fitting technique and the classification involved multiple kernel learning and SVM. Similar results were reported by Papandreou et al [68] who achieved a best recognition rate of 83% in speaker independent experiments when using AAM visual features obtained from the entire lower face with six shape and six texture coefficients and when using HMM for classification.…”

Section: Comparison With Other Studiessupporting

confidence: 87%

Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping

Ibrahim

Mulvaney

2015

Journal of Visual Communication and Image Representation

View full text Add to dashboard Cite

Section: Comparison With Other Studiessupporting

confidence: 87%

Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping

Ibrahim

Mulvaney

2015

Journal of Visual Communication and Image Representation

View full text Add to dashboard Cite

“…As already noted by [31], in coupled HMM decoding, stream weight adaptation and uncertainty compensation by UD both provide significant advantages in isolation, but using uncertainty compensation in addition to optimized stream weighting provides only small benefits. This finding was replicated in our experiments.…”

Section: Discussionmentioning

confidence: 86%

“…Uncertainty Decoding (denoted by GDU in the following tables) was used successfully for audiovisual speech recognition in [31]. In conjunction with uncertainty propagation techniques and stream weight optimization, however, the respective performance gains of UD become small.…”

Section: Uncertainty-based Decodingmentioning

confidence: 99%

Robust audiovisual speech recognition using noise-adaptive linear discriminant analysis

Zeiler

Nicheli

et al. 2016

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Automatic speech recognition (ASR) has become a widespread and convenient mode of human-machine interaction, but it is still not sufficiently reliable when used under highly noisy or reverberant conditions. One option for achieving far greater robustness is to include another modality that is unaffected by acoustic noise, such as video information. Currently the most successful approaches for such audiovisual ASR systems, coupled hidden Markov models (HMMs) and turbo decoding, both allow for slight asynchrony between audio and video features, and significantly improve recognition rates in this way. However, both typically still neglect residual errors in the estimation of audio features, so-called observation uncertainties. This paper compares two strategies for adding these observation uncertainties into the decoder, and shows that significant recognition rate improvements are achievable for both coupled HMMs and turbo decoding.

show abstract

“…These approaches were all proposed by Matthews et al 7 to extract visual features. Continuing their work on visual feature extraction, Papandreou et al 32 focused on multimodal fusion scenarios, using audiovisual speech recognition as an example. They demonstrated that their visemic AAM (based on digits 0-9) with six texture coefficients outperforms their PCA-based technique with 18 texture coefficients, achieving a word accuracy rate of 83% and 71%, respectively.…”

Section: Introductionmentioning

confidence: 99%

Visual Speech Recognition Using Optical Flow and Support Vector Machines

Shaikh

Kumar

Gubbi

2011

Int. J. Comp. Intel. Appl.

View full text Add to dashboard Cite

A lip-reading technique that identifies visemes from visual data only and without evaluating the corresponding acoustic signals is presented. The technique is based on vertical components of the optical flow (OF) analysis and these are classified using support vector machines (SVM). The OF is decomposed into multiple non-overlapping fixed scale blocks and statistical features of each block are computed for successive video frames of an utterance. This technique performs automatic temporal segmentation (i.e., determining the start and the end of an utterance) of the utterances, achieved by pair-wise pixel comparison method, which evaluates the differences in intensity of corresponding pixels in two successive frames. The experiments were conducted on a database of 14 visemes taken from seven subjects and the accuracy tested using five and ten fold cross validation for binary and multiclass SVM respectively to determine the impact of subject variations. Unlike other systems in the literature, the results indicate that the proposed method is more robust to inter-subject variations with high sensitivity and specificity for 12 out of 14 visemes. Potential applications of such a system include human computer interface (HCI) for mobility-impaired users, lip reading mobile phones, in-vehicle systems, and improvement of speech based computer control in noisy environment.

show abstract

Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition

Cited by 80 publications

References 44 publications

Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping

Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping

Robust audiovisual speech recognition using noise-adaptive linear discriminant analysis

Visual Speech Recognition Using Optical Flow and Support Vector Machines

Contact Info

Product

Resources

About