2002
DOI: 10.1109/6046.985551
|View full text |Cite
|
Sign up to set email alerts
|

A review of speech-based bimodal recognition

Abstract: Speech recognition and speaker recognition by machine are crucial ingredients for many important applications such as natural and flexible human-machine interfaces. Most developments in speech-based automatic recognition have relied on acoustic speech as the sole input signal, disregarding its visual counterpart. However, recognition based on acoustic speech alone can be afflicted with deficiencies that preclude its use in many real-world applications, particularly under adverse conditions. The combination of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
93
0

Year Published

2005
2005
2017
2017

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 185 publications
(93 citation statements)
references
References 114 publications
0
93
0
Order By: Relevance
“…Successful audio-visual information fusion should take advantage of the complementary nature of the two modalities to produce a synergetic performance gain. On the other hand, the integrated recognition performance may be even worse than the performance of any modality if the integration is not performed properly, which is called "attenuating fusion" [1].…”
Section: Audio-visual Information Fusionmentioning
confidence: 99%
See 1 more Smart Citation
“…Successful audio-visual information fusion should take advantage of the complementary nature of the two modalities to produce a synergetic performance gain. On the other hand, the integrated recognition performance may be even worse than the performance of any modality if the integration is not performed properly, which is called "attenuating fusion" [1].…”
Section: Audio-visual Information Fusionmentioning
confidence: 99%
“…Audio-visual speech recognition (AVSR) systems which additionally observe lip movements along with acoustic speech have been proposed and shown to produce enhanced noise-robust performance due to the complementary nature of the two modalities [1]. The speakers' lip movements contain significant cues about spoken language and, besides, they are not affected by acoustic noise.…”
Section: Introductionmentioning
confidence: 99%
“…One of the first more comprehensive data sets, namely DAVID-BT, was created in 1996 (Chibelushi et al 2002). It is composed of 4 corpora with different research themes.…”
Section: Review Of Audio-visual Corporamentioning
confidence: 99%
“…Widely used algorithms are statistically based, such as Hidden Markov models (HMMs) (Mana and Pianesi 2006) and dynamic time warping (DTW) (Rabiner and Juang 1993). Multimodal recognition is recently acknowledged as a vital component of the next generation of spoken language systems (Chibelushi et al 2002). …”
Section: Interaction Awarenessmentioning
confidence: 99%