2019
DOI: 10.1088/1742-6596/1237/2/022144
|View full text |Cite
|
Sign up to set email alerts
|

A Review of Audio-Visual Fusion with Machine Learning

Abstract: For the study of single-modal recognition, for example, the research on speech signals, ECG signals, facial expressions, body postures and other physiological signals have made some progress. However, the diversity of human brain information sources and the uncertainty of single-modal recognition determine that the accuracy of single-modal recognition is not high. Therefore, building a multimodal recognition framework in combination with multiple modalities has become an effective means of improving performanc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(3 citation statements)
references
References 7 publications
0
2
0
Order By: Relevance
“…Fusing audio-visual data, in general, are abundantly unclear and unanswered [53]. From the model training, to improvements made to deal with the modal incompleteness, to the data processing, to modal (or sample) data imbalance; from the underlining roots of the problem to the high-level semantics, similar to contemporary multi-modal systems for biometrics with audio-visual data, FIW-MM and, thus, this work in its entirety, poses more problems than it solves; we introduce a much larger problem space than that of solutions.…”
Section: Discussionmentioning
confidence: 99%
“…Fusing audio-visual data, in general, are abundantly unclear and unanswered [53]. From the model training, to improvements made to deal with the modal incompleteness, to the data processing, to modal (or sample) data imbalance; from the underlining roots of the problem to the high-level semantics, similar to contemporary multi-modal systems for biometrics with audio-visual data, FIW-MM and, thus, this work in its entirety, poses more problems than it solves; we introduce a much larger problem space than that of solutions.…”
Section: Discussionmentioning
confidence: 99%
“…The selection of the appropriate fusion technique depends on the specific requirements of the speech recognition task and the available computational resources. In addition, both speech processing and audio machine learning [188], [189] are other topics suitable for utilizing model fusion or ensemble learning method to combine the result of multiple models. It also worth to discuss and highlight the issues regarding how to exploit multimodal machine learning technology or multi-modal information fusion on the topic of speech processing and audio machine learning in the future.…”
Section: ) Stackingmentioning
confidence: 99%
“…Multimodal learning is important for many tasks, including audio visual speech recognition (Yu et al, 2020;Zhou et al, 2019;Su et al, 2017), emotion recognition (Park et al, 2020;Cao et al, 2014), multimedia event detection (Song et al, 2019), depth-based object detection (Wang et al, 2015b;a), urban dynamics modeling (Zhang et al, 2017), image-sentence matching (Liu et al, 2019), and biometric recognition (Song et al, 2019). In many cases, an individual modality does not contain sufficient information to classify the scene.…”
Section: Introductionmentioning
confidence: 99%