1998
DOI: 10.1007/bfb0054771
|View full text |Cite
|
Sign up to set email alerts
|

Continuous audio-visual speech recognition

Abstract: Abstract. We address the problem of robust lip tracking, visual speech feature extraction, and sensor integration for audio-visual speech recognition applications. An appearance based model of the articulators, which represents linguistically important features, is learned from example images and is used to locate, track, and recover visual speech information. We t a c kle the problem of joint temporal modelling of the acoustic and visual speech signals by applying Multi-Stream hidden Markov models. This appro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
16
0

Year Published

1999
1999
2009
2009

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 17 publications
(16 citation statements)
references
References 24 publications
0
16
0
Order By: Relevance
“…Other models were developed by placing constraints on the states or the transitions in order to make the new models tractable. The Multi-Stream HMM [41,42] allows for multiple input feature streams that may have different frame rates and can be asynchronous. It assumes that the model consists of a number of sub-unit models that correspond to the level at which the streams have to synchronize, for example phoneme level or syllable level.…”
Section: Data Fusion Architecturementioning
confidence: 99%
“…Other models were developed by placing constraints on the states or the transitions in order to make the new models tractable. The Multi-Stream HMM [41,42] allows for multiple input feature streams that may have different frame rates and can be asynchronous. It assumes that the model consists of a number of sub-unit models that correspond to the level at which the streams have to synchronize, for example phoneme level or syllable level.…”
Section: Data Fusion Architecturementioning
confidence: 99%
“…The goal is to use the motion of the lips in order to improve the acoustic recognition of the words. Many different studies have shown improved speech recognition (both faster and more accurate) when visual cues are available [6][7][8][9][10][11][16][17][18][19]21,22].…”
Section: Introductionmentioning
confidence: 99%
“…In the first stage, information from the video frames is processed in order to prepare it for integration with the acoustic signal [7]. One simplistic example of this is image-based data extraction, during which the image of the mouth is selected without any processing [7,19,20,22]. While all the information contained within that frame is automatically selected, it does not include any dimensionality reduction and hence makes audiovisual information fusion extremely difficult.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations