2022 IEEE International Conference on Visual Communications and Image Processing (VCIP) 2022
DOI: 10.1109/vcip56404.2022.10008833
|View full text |Cite
|
Sign up to set email alerts
|

MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Abstract: In line with the human capacity to perceive the world by simultaneously processing and integrating highdimensional inputs from multiple modalities like vision and audio, we propose a novel model, MAiVAR-T (Multimodal Audio-Image to Video Action Recognition Transformer). This model employs an intuitive approach for the combination of audio-image and video modalities, with a primary aim to escalate the effectiveness of multimodal human action recognition (MHAR). At the core of MAiVAR-T lies the significance of d… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(7 citation statements)
references
References 43 publications
0
1
0
Order By: Relevance
“…This dataset can thus be used to analyze critical features from the action sequences in the image form. This dataset extends our previous publication [2] which outperforms state-of-the-art methods producing an accuracy of 91.2% by focusing on multimodal representations of action sequences to present critical features in audio from different perspectives, as captured from each action sample. These datasets were also used as a pre-requisite requirement in developing an intelligent multimodal action recognition system for classifying actions using deep learning algorithms based on acoustic and video modality.…”
Section: Introductionsupporting
confidence: 58%
See 2 more Smart Citations
“…This dataset can thus be used to analyze critical features from the action sequences in the image form. This dataset extends our previous publication [2] which outperforms state-of-the-art methods producing an accuracy of 91.2% by focusing on multimodal representations of action sequences to present critical features in audio from different perspectives, as captured from each action sample. These datasets were also used as a pre-requisite requirement in developing an intelligent multimodal action recognition system for classifying actions using deep learning algorithms based on acoustic and video modality.…”
Section: Introductionsupporting
confidence: 58%
“…In the context of multimodal action recognition, as in Multimodal Audio-image and Video Action Recognition (MAiVAR) framework [2], these data are utilized, and they demonstrate superior performance compared to other audio representations. The study establishes a benchmark approach for using this dataset.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Bruce et al 48 devised a model-level multimodal fusion approach, employing spatiotemporal graph convolutional networks as the skeletal modality to learn how to transfer attention weights from the skeletal modality to the RGB modality network. Shaikh et al 49 proposed a meaningful representation extraction method for image and audio and fused it with video representation, demonstrating better performance in activity recognition compared to single-modality audio and video. The first two methods primarily rely on multihead self-attention mechanism-based multimodal recognition methods, enabling the global learning of relationships between different modalities.…”
Section: Multimodal Activity Recognitionmentioning
confidence: 99%
“…The outputs of the last convolution layer of these two separate network streams are processed to the fully connected layer. This step predicts the final output after fusing the classification scores and considers individual class labels at the score layer [105]. Different decision rules are deployed to fuse the scores at this stage.…”
Section: Late Fusionmentioning
confidence: 99%