2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221) 2001
DOI: 10.1109/icassp.2001.940794
|View full text |Cite
|
Sign up to set email alerts
|

Asynchronous stream modeling for large vocabulary audio-visual speech recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
46
0

Year Published

2002
2002
2016
2016

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 60 publications
(46 citation statements)
references
References 8 publications
0
46
0
Order By: Relevance
“…The resulting observation sequences are then modeled using one HMM [12]. A model fusion system based on multi-stream HMM was proposed in [13]. The multi-stream HMM assumes that audio and video sequences are state synchronous but allows the audio and video components to have different contribution to the overall observation likelihood.…”
Section: Related Workmentioning
confidence: 99%
“…The resulting observation sequences are then modeled using one HMM [12]. A model fusion system based on multi-stream HMM was proposed in [13]. The multi-stream HMM assumes that audio and video sequences are state synchronous but allows the audio and video components to have different contribution to the overall observation likelihood.…”
Section: Related Workmentioning
confidence: 99%
“…A related model is the factorial HMM [25], in which there is a single observation sequence, but multiple state sequences that indirectly interact via their common influence on observations. These models have found wide use in automatic speech recognition for multi-stream [4], [39] and audio-visual modeling [36]. multiscale statistical models in the second group have been explored in many different facets of signal processing and data fusion; see [53] for an extensive review.…”
Section: Hmms and Previous Work In Multiscale Modelingmentioning
confidence: 99%
“…The LDA and MLLT transform were trained for each noise condition. The video stream features were obtained by an LDA-MLLT transform of the pixels in a region of interest around the mouth as described in [2]. The audio-visual modeling is based on context dependent phone models.…”
Section: Evaluation Tasksmentioning
confidence: 99%
“…Various models of audio-visual integration for speech recognition have been proposed, among which the multistream hidden Markov model (MSHMM) has been demonstrated to consistently improve recognition over audio-only ASR [1,2,3]. This model is based on the use of parallel HMMs to represent various streams of information.…”
Section: Introductionmentioning
confidence: 99%