A Review of Audio-Visual Fusion with Machine Learning

Song, Xiaoyu; Chen, Hong; Wang, Qing; Chen, Yunqiang; Tian, Mengxiao; Tang, Hui

doi:10.1088/1742-6596/1237/2/022144

Cited by 9 publications

(3 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fusing audio-visual data, in general, are abundantly unclear and unanswered [53]. From the model training, to improvements made to deal with the modal incompleteness, to the data processing, to modal (or sample) data imbalance; from the underlining roots of the problem to the high-level semantics, similar to contemporary multi-modal systems for biometrics with audio-visual data, FIW-MM and, thus, this work in its entirety, poses more problems than it solves; we introduce a much larger problem space than that of solutions.…”

Section: Discussionmentioning

confidence: 99%

Families In Wild Multimedia: A Multimodal Database for Recognizing Kinship

Robinson,

Khan,

Yin

et al. 2020

Preprint

View full text Add to dashboard Cite

Recognizing kinship -a soft biometric with vast applications -in photos has piqued the interest of many machine vision researchers. The large-scale Families In the Wild (FIW) database promoted the problem by supporting annual kinshipbased vision challenges that saw consistent performance improvements. We have now begun to approach performance levels for image-based systems acceptable for practical use -something unforeseeable a decade ago. However, biometric systems can benefit from multi-modal perspectives, as information contained in multimedia can add to and complement that of still images. Thus, we aim to narrow the gap from research-to-reality by extending FIW with multimedia data (i.e., video, audio, and contextual transcripts). Specifically, we introduce the first large-scale dataset for recognizing kinship in multimedia, the FIW in Multimedia (FIW-MM) database. We utilize automated machinery to collect, annotate, and prepare the data with minimal human input and no financial cost. This large-scale, multimedia corpus allows problem formulations to follow more realistic template-based protocols. We show significant improvements in benchmarks for multiple kin-based tasks when additional media-types are added. Experiments provide insights by highlighting edge cases to inspire future research and areas of improvement. Emphasis is put on short and long-term research directions, with the overarching intent to increase the potential of systems built to automatically detect kinship in multimedia. Furthermore, we expect a broader range of researchers with recognition tasks, generative modeling, speech understanding, and nature-based narratives.

show abstract

Section: Discussionmentioning

confidence: 99%

Families In Wild Multimedia: A Multimodal Database for Recognizing Kinship

Robinson,

Khan,

Yin

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…The selection of the appropriate fusion technique depends on the specific requirements of the speech recognition task and the available computational resources. In addition, both speech processing and audio machine learning [188], [189] are other topics suitable for utilizing model fusion or ensemble learning method to combine the result of multiple models. It also worth to discuss and highlight the issues regarding how to exploit multimodal machine learning technology or multi-modal information fusion on the topic of speech processing and audio machine learning in the future.…”

Section: ) Stackingmentioning

confidence: 99%

Ensemble Multifeatured Deep Learning Models and Applications: A Survey

Abimannan,

El-Alfy,

Chang

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Ensemble multifeatured deep learning methodology has emerged as a powerful approach to overcome the limitations of single deep learning models in terms of generalization, robustness, and performance. This survey provides an extended review of ensemble multifeatured deep learning models, and their applications, challenges, and future directions. We explore potential applications of these models across various domains, including computer vision, medical imaging, natural language processing, and speech recognition. By combining the strengths of multiple models and features, ensemble multifeatured deep learning models have demonstrated improved performance and adaptability in diverse problem settings. We also discuss the challenges associated with these models, such as model interpretability, computational complexity, ensemble model selection, adversarial robustness, and personalized and federated learning. This survey highlights recent advancements in addressing these challenges and emphasizes the importance of continued research in tackling these issues to enable widespread adoption of ensemble multifeatured deep learning models. It provides an outlook on future research directions, focusing on the development of new algorithms, frameworks, and hardware architectures that can efficiently handle the large-scale computations required by these models. Moreover, it underlines the need for a better understanding of the trade-offs between model complexity, accuracy, and computational resources to optimize the design and deployment of ensemble multifeatured deep learning models.

show abstract

“…Multimodal learning is important for many tasks, including audio visual speech recognition (Yu et al, 2020;Zhou et al, 2019;Su et al, 2017), emotion recognition (Park et al, 2020;Cao et al, 2014), multimedia event detection (Song et al, 2019), depth-based object detection (Wang et al, 2015b;a), urban dynamics modeling (Zhang et al, 2017), image-sentence matching (Liu et al, 2019), and biometric recognition (Song et al, 2019). In many cases, an individual modality does not contain sufficient information to classify the scene.…”

Section: Introductionmentioning

confidence: 99%

On the Benefits of Early Fusion in Multimodal Representation Learning

Talukder¹,

Barnum²,

Yue³

2020

Preprint

View full text Add to dashboard Cite

Intelligently reasoning about the world often requires integrating data from multiple modalities, as any individual modality may contain unreliable or incomplete information. Prior work in multimodal learning fuses input modalities only after significant independent processing. On the other hand, the brain performs multimodal processing almost immediately. This divide between conventional multimodal learning and neuroscience suggests that a detailed study of early multimodal fusion could improve artificial multimodal representations. To facilitate the study of early multimodal fusion, we create a convolutional LSTM network architecture that simultaneously processes both audio and visual inputs, and allows us to select the layer at which audio and visual information combines. Our results demonstrate that immediate fusion of audio and visual inputs in the initial C-LSTM layer results in higher performing networks that are more robust to the addition of white noise in both audio and visual inputs.

show abstract

A Review of Audio-Visual Fusion with Machine Learning

Cited by 9 publications

References 7 publications

Families In Wild Multimedia: A Multimodal Database for Recognizing Kinship

Families In Wild Multimedia: A Multimodal Database for Recognizing Kinship

Ensemble Multifeatured Deep Learning Models and Applications: A Survey

On the Benefits of Early Fusion in Multimodal Representation Learning

Contact Info

Product

Resources

About