Xiaobo Pi scite author profile

The use of visual features in audio-visual speech recognition (AVSR) is justified by both the speech generation mechanism, which is essentially bimodal in audio and visual representation, and by the need for features that are invariant to acoustic noise perturbation. As a result, current AVSR systems demonstrate significant accuracy improvements in environments affected by acoustic noise. In this paper, we describe the use of two statistical models for audio-visual integration, the coupled HMM (CHMM) and the factorial HMM (FHMM), and compare the performance of these models with the existing models used in speaker dependent audio-visual isolated word recognition. The statistical properties of both the CHMM and FHMM allow to model the state asynchrony of the audio and visual observation sequences while preserving their natural correlation over time. In our experiments, the CHMM performs best overall, outperforming all the existing models and the FHMM

show abstract

A coupled HMM for audio-visual speech recognition

Nefian

Liang

et al. 2002

117

View full text Add to dashboard Cite

In recent years several speech recognition systems that use visual together with audio information showed significant increase in performance over the standard speech recognition systems. The use of visual features is justified by both the bimodality of the speech generation and by the need of features that are invariant to acoustic noise perturbation. The audio-visual speech recognition system presented in this paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM). The statistical properties of the coupled-HMM allow us to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time. The experimental results show that the coupled HMM outperforms the multistream HMM in audio visual speech recognition.

show abstract

A coupled HMM for audio-visual speech recognition

Nefian¹,

Liang²,

Pi³

et al. 2002

View full text Add to dashboard Cite

show abstract

Audio-visual continuous speech recognition using a coupled hidden Markov model

Liu¹,

Zhao²,

Pi³

et al. 2002

View full text Add to dashboard Cite

With the increase in the computational complexity of recent computers, audio-visual speech recognition (AVSR) became an attractive research topic that can lead to a robust solution for speech recognition in noisy environments. In the audio visual continuous speech recognition system presented in this paper, the audio and visual observation sequences are integrated using a coupled hidden Markov model (CHMM). The statistical properties of the CHMM can describe the asyncrony of the audio and visual features while preserving their natural correlation over time. The experimental results show that the current system tested on the XM2VTS database reduces the error rate of the audio only speech recognition system at SNR of 0db by over 55%.

show abstract

Speaker independent audio-visual continuous speech recognition

Liang

Liu

Zhao

et al.

View full text Add to dashboard Cite

The increase in the number of multimedia applications that require robust speech recognition systems determined a large interest in the study of audiovisual speech recognition (AVSR) systems. The use of visual features in AVSR is justified by both the audio and visual modality of the speech generation and the need for features that are invariant to acoustic noise perturbation. The speaker independent audiovisual continuous speech recognition system presented in this paper relies on a robust set of visual features obtained from the accurate detection and tracking of the mouth region. Further, the visual and acoustic observation sequences are integrated using a coupled hidden Markov (CHMM) model. The statistical properties of the CHMM can model the audio and visual state asynchrony while preserving their natural correlation over time. The experimental results show that the current system tested on the XM2VTS database reduces by over 55% the error rate of the audio only speech recognition system at SNR of 0db.

show abstract

Voice barge-in in telephony speech recognition

Pi¹,

Ying²

2009

J. Acoust. Soc. Am.

View full text Add to dashboard Cite

Audio-visual speaker identification using coupled hidden Markov models

Liu²,

Liang³

et al.

View full text Add to dashboard Cite

A study of training strategy of unit HMM for Chinese speech recognition

Pi¹,

Du²,

Hou³

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Xiaobo Pi

Dynamic Bayesian Networks for Audio-Visual Speech Recognition

A coupled HMM for audio-visual speech recognition

A coupled HMM for audio-visual speech recognition

Audio-visual continuous speech recognition using a coupled hidden Markov model

Speaker independent audio-visual continuous speech recognition

Voice barge-in in telephony speech recognition

Audio-visual speaker identification using coupled hidden Markov models

A study of training strategy of unit HMM for Chinese speech recognition

Contact Info

Product

Resources

About