A review of speech-based bimodal recognition

Chibelushi, Claude C.; Deravi, Farzin; Mason, John S.

doi:10.1109/6046.985551

Cited by 185 publications

(93 citation statements)

References 114 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Successful audio-visual information fusion should take advantage of the complementary nature of the two modalities to produce a synergetic performance gain. On the other hand, the integrated recognition performance may be even worse than the performance of any modality if the integration is not performed properly, which is called "attenuating fusion" [1].…”

Section: Audio-visual Information Fusionmentioning

confidence: 99%

“…Audio-visual speech recognition (AVSR) systems which additionally observe lip movements along with acoustic speech have been proposed and shown to produce enhanced noise-robust performance due to the complementary nature of the two modalities [1]. The speakers' lip movements contain significant cues about spoken language and, besides, they are not affected by acoustic noise.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Two-Level Bimodal Association for Audio-Visual Speech Recognition

Lee

Ebrahimi

2009

Advanced Concepts for Intelligent Vision Systems

View full text Add to dashboard Cite

Abstract. This paper proposes a new method for bimodal information fusion in audio-visual speech recognition, where cross-modal association is considered in two levels. First, the acoustic and the visual data streams are combined at the feature level by using the canonical correlation analysis, which deals with the problems of audio-visual synchronization and utilizing the cross-modal correlation. Second, information streams are integrated at the decision level for adaptive fusion of the streams according to the noise condition of the given speech datum. Experimental results demonstrate that the proposed method is effective for producing noise-robust recognition performance without a priori knowledge about the noise conditions of the speech data.

show abstract

Section: Audio-visual Information Fusionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Two-Level Bimodal Association for Audio-Visual Speech Recognition

Lee

Ebrahimi

2009

Advanced Concepts for Intelligent Vision Systems

View full text Add to dashboard Cite

show abstract

“…One of the first more comprehensive data sets, namely DAVID-BT, was created in 1996 (Chibelushi et al 2002). It is composed of 4 corpora with different research themes.…”

Section: Review Of Audio-visual Corporamentioning

confidence: 99%

An audio-visual corpus for multimodal automatic speech recognition

et al. 2017

View full text Add to dashboard Cite

A review of available audio-visual speech corpora and a description of a new multimodal corpus of English speech recordings is provided. The new corpus containing 31 hours of recordings was created specifically to assist audio-visual speech recognition systems (AVSR) development. The database related to the corpus includes high-resolution, high-framerate stereoscopic video streams from RGB cameras, depth imaging stream utilizing Time-of-Flight camera accompanied by audio recorded using both: a microphone array and a microphone built in a mobile computer. For the purpose of applications related to AVSR systems training, every utterance was manually labeled, resulting in label files added to the corpus repository. Owing to the inclusion of recordings made in noisy conditions the elaborated corpus can also be used for testing robustness of speech recognition systems in the presence of acoustic background noise. The process of building the corpus, including the recording, labeling and post-processing phases is described in the paper. Results achieved with the developed audio-visual automatic speech recognition (ASR) engine trained and tested with the material contained in the corpus are presented and discussed together with comparative test results employing a state-of-the-art/commercial ASR engine. In order to demonstrate the practical use of the corpus it is made available for the public use.

show abstract

“…Widely used algorithms are statistically based, such as Hidden Markov models (HMMs) (Mana and Pianesi 2006) and dynamic time warping (DTW) (Rabiner and Juang 1993). Multimodal recognition is recently acknowledged as a vital component of the next generation of spoken language systems (Chibelushi et al 2002). …”

Section: Interaction Awarenessmentioning

confidence: 99%

Pervasive Social Computing: augmenting five facets of human intelligence

Zhou

Sun

Athukorala

et al. 2011

J Ambient Intell Human Comput

View full text Add to dashboard Cite

Pervasive Social Computing is a novel collective paradigm, derived from pervasive computing, social media, social networking, social signal processing, etc. This paper reviews Pervasive Social Computing as an integrated computing environment, which promises to augment five facets of human intelligence: physical environment awareness, behavior awareness, community awareness, interaction awareness, and content awareness. Reviews of related studies are given, and their generic architectures are designed. The resulting architecture for Pervasive Social Computing is presented. A prototype is developed and examined, in order to investigate the characteristics exhibited by Pervasive Social Computing.

show abstract

A review of speech-based bimodal recognition

Cited by 185 publications

References 114 publications

Two-Level Bimodal Association for Audio-Visual Speech Recognition

Two-Level Bimodal Association for Audio-Visual Speech Recognition

An audio-visual corpus for multimodal automatic speech recognition

Pervasive Social Computing: augmenting five facets of human intelligence

Contact Info

Product

Resources

About