MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Shaikh, Muhammad Bilal; Chai, Douglas; Islam, Syed; Akhtar, Naveed

doi:10.1109/vcip56404.2022.10008833

Cited by 5 publications

(7 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This dataset can thus be used to analyze critical features from the action sequences in the image form. This dataset extends our previous publication [2] which outperforms state-of-the-art methods producing an accuracy of 91.2% by focusing on multimodal representations of action sequences to present critical features in audio from different perspectives, as captured from each action sample. These datasets were also used as a pre-requisite requirement in developing an intelligent multimodal action recognition system for classifying actions using deep learning algorithms based on acoustic and video modality.…”

Section: Introductionsupporting

confidence: 58%

“…In the context of multimodal action recognition, as in Multimodal Audio-image and Video Action Recognition (MAiVAR) framework [2], these data are utilized, and they demonstrate superior performance compared to other audio representations. The study establishes a benchmark approach for using this dataset.…”

Section: Resultsmentioning

confidence: 99%

“…For example, a higher spectral centroid value often corresponds to a "brighter" or "sharper" sound, while a lower spectral centroid value usually indicates a "duller" or "muddier" sound [17]. By converting the spectral centroid over time into an image, we can capture spatial and temporal information that can be effectively processed by deep learning models [2]. Spectral centroid can also help in distinguishing actions based on their tonal or harmonic characteristics.…”

Section: Spectral Centroidmentioning

confidence: 99%

See 2 more Smart Citations

MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions

Shaikh,

Chai,

Islam

et al. 2024

Data

Self Cite

View full text Add to dashboard Cite

Audio-image representations for a multimodal human action (MHAiR) dataset contains six different image representations of the audio signals that capture the temporal dynamics of the actions in a very compact and informative way. The dataset was extracted from the audio recordings which were captured from an existing video dataset, i.e., UCF101. Each data sample captured a duration of approximately 10 s long, and the overall dataset was split into 4893 training samples and 1944 testing samples. The resulting feature sequences were then converted into images, which can be used for human action recognition and other related tasks. These images can be used as a benchmark dataset for evaluating the performance of machine learning models for human action recognition and related tasks. These audio-image representations could be suitable for a wide range of applications, such as surveillance, healthcare monitoring, and robotics. The dataset can also be used for transfer learning, where pre-trained models can be fine-tuned on a specific task using specific audio images. Thus, this dataset can facilitate the development of new techniques and approaches for improving the accuracy of human action-related tasks and also serve as a standard benchmark for testing the performance of different machine learning models and algorithms.

show abstract

Section: Introductionsupporting

confidence: 58%

Section: Resultsmentioning

confidence: 99%

Section: Spectral Centroidmentioning

confidence: 99%

See 1 more Smart Citation

MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions

Shaikh,

Chai,

Islam

et al. 2024

Data

Self Cite

View full text Add to dashboard Cite

show abstract

“…Bruce et al 48 devised a model-level multimodal fusion approach, employing spatiotemporal graph convolutional networks as the skeletal modality to learn how to transfer attention weights from the skeletal modality to the RGB modality network. Shaikh et al 49 proposed a meaningful representation extraction method for image and audio and fused it with video representation, demonstrating better performance in activity recognition compared to single-modality audio and video. The first two methods primarily rely on multihead self-attention mechanism-based multimodal recognition methods, enabling the global learning of relationships between different modalities.…”

Section: Multimodal Activity Recognitionmentioning

confidence: 99%

Multiscale knowledge distillation with attention based fusion for robust human activity recognition

Yuan,

Yang,

Ning

et al. 2024

Sci Rep

View full text Add to dashboard Cite

Knowledge distillation is an effective approach for training robust multi-modal machine learning models when synchronous multimodal data are unavailable. However, traditional knowledge distillation techniques have limitations in comprehensively transferring knowledge across modalities and models. This paper proposes a multiscale knowledge distillation framework to address these limitations. Specifically, we introduce a multiscale semantic graph mapping (SGM) loss function to enable more comprehensive knowledge transfer between teacher and student networks at multiple feature scales. We also design a fusion and tuning (FT) module to fully utilize correlations within and between different data types of the same modality when training teacher networks. Furthermore, we adopt transformer-based backbones to improve feature learning compared to traditional convolutional neural networks. We apply the proposed techniques to multimodal human activity recognition and compared with the baseline method, it improved by 2.31% and 0.29% on the MMAct and UTD-MHAD datasets. Ablation studies validate the necessity of each component.

show abstract

“…The outputs of the last convolution layer of these two separate network streams are processed to the fully connected layer. This step predicts the final output after fusing the classification scores and considers individual class labels at the score layer [105]. Different decision rules are deployed to fuse the scores at this stage.…”

Section: Late Fusionmentioning

confidence: 99%