2009
DOI: 10.1007/978-3-642-04697-1_13
|View full text |Cite
|
Sign up to set email alerts
|

Two-Level Bimodal Association for Audio-Visual Speech Recognition

Abstract: Abstract. This paper proposes a new method for bimodal information fusion in audio-visual speech recognition, where cross-modal association is considered in two levels. First, the acoustic and the visual data streams are combined at the feature level by using the canonical correlation analysis, which deals with the problems of audio-visual synchronization and utilizing the cross-modal correlation. Second, information streams are integrated at the decision level for adaptive fusion of the streams according to t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 20 publications
0
5
0
Order By: Relevance
“…One of the driving factors beyond the adoption of multi-input AI approaches is data availability. As one can easily imagine, video data lead the list as they consist in tuples of images and audio signals 42,43 . However, the list goes beyond video data, and several multi-modal datasets can be found publicly [44][45][46][47] .…”
Section: Multi-input Classificationmentioning
confidence: 99%
“…One of the driving factors beyond the adoption of multi-input AI approaches is data availability. As one can easily imagine, video data lead the list as they consist in tuples of images and audio signals 42,43 . However, the list goes beyond video data, and several multi-modal datasets can be found publicly [44][45][46][47] .…”
Section: Multi-input Classificationmentioning
confidence: 99%
“…If the multimodal data are acquired from the same target, there usually exist cross-modal correlations between the data obtained from different sensors. For example, utilizing correlations between acoustic features and visual cues in speech recognition is beneficial to improve the overall performance [20].…”
Section: Considering Correlations Between Modalitiesmentioning
confidence: 99%
“…Multimodal data have been widely employed in recent decades [18]. One of the most prominent multimodal data is videos, which consist of image frames and audio signals [19,20,21]. In addition, human activity recognition systems usually employ data obtained from multiple sensors, including motion capture system, depth cameras, accelerometers, and microphones [12,13,15].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…One of the driving factors beyond the adoption of multi-input AI approaches is data availability. As one can easily imagine, video data lead the list as they consist in tuples of images and audio signals 42,43 . However, the list goes beyond video data, and several multi-modal datasets can be found publicly [44][45][46][47] .…”
Section: Multi-input Classificationmentioning
confidence: 99%