Multimodal Fusion for Audio-Image and Video Action Recognition

Shaikh, Muhammad Bilal; Chai, Douglas; Islam, Syed Mohammad Shamsul; Akhtar, Naveed

doi:10.2139/ssrn.4342070

Cited by 1 publication

(1 citation statement)

References 134 publications

(178 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Data Preprocessing: The video and audio data were preprocessed separately, as described in the following subsections. The video data was transformed into frames, while the audio data was converted into six audio-image representations following [14], [23]. Standard normalization techniques were applied to both modalities.…”

Section: Proposed Methodologymentioning

confidence: 99%

MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Shaikh

Chai

Islam

et al. 2022

2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)

View full text Add to dashboard Cite

In line with the human capacity to perceive the world by simultaneously processing and integrating highdimensional inputs from multiple modalities like vision and audio, we propose a novel model, MAiVAR-T (Multimodal Audio-Image to Video Action Recognition Transformer). This model employs an intuitive approach for the combination of audio-image and video modalities, with a primary aim to escalate the effectiveness of multimodal human action recognition (MHAR). At the core of MAiVAR-T lies the significance of distilling substantial representations from the audio modality and transmuting these into the image domain. Subsequently, this audio-image depiction is fused with the video modality to formulate a unified representation. This concerted approach strives to exploit the contextual richness inherent in both audio and video modalities, thereby promoting action recognition. In contrast to existing state-of-the-art strategies that focus solely on audio or video modalities, MAiVAR-T demonstrates superior performance. Our extensive empirical evaluations conducted on a benchmark action recognition dataset corroborate the model's remarkable performance. This underscores the potential enhancements derived from integrating audio and video modalities for action recognition purposes.

show abstract

Section: Proposed Methodologymentioning

confidence: 99%