Multimodal fusion frameworks for Human Action Recognition (HAR) using depth and inertial sensor data have been proposed over the years. In most of the existing works, fusion is performed at a single level (feature level or decision level), missing the opportunity to fuse rich mid-level features necessary for better classification. To address this shortcoming, in this paper, we propose three novel deep multilevel multimodal (M 2 ) fusion frameworks to capitalize on different fusion strategies at various stages and to leverage the superiority of multilevel fusion. At input, we transform the depth data into depth images called sequential front view images (SFIs) and inertial sensor data into signal images. Each input modality, depth and inertial, is further made multimodal by taking convolution with the Prewitt filter. Creating "modality within modality" enables further complementary and discriminative feature extraction through Convolutional Neural Networks (CNNs). CNNs are trained on input images of each modality to learn low-level, high-level and complex features. Learned features are extracted and fused at different stages of the proposed frameworks to combine discriminative and complementary information. These highly informative features are served as input to a multi-class Support Vector Machine (SVM). We evaluate the proposed frameworks on three publicly available multimodal HAR datasets, namely, UTD Multimodal Human Action Dataset (MHAD), Berkeley MHAD, and UTD-MHAD Kinect V2. Experimental results show the supremacy of the proposed fusion frameworks over existing methods.Index Terms-Canonical correlation analysis, fusion of depth and inertial sensors, human action recognition,, multimodal fusion.
This paper attempts at improving the accuracy of Human Action Recognition (HAR) by fusion of depth and inertial sensor data. Firstly, we transform the depth data into Sequential Front view Images(SFI) and fine-tune the pre-trained AlexNet on these images. Then, inertial data is converted into Signal Images (SI) and another convolutional neural network (CNN) is trained on these images. Finally, learned features are extracted from both CNN, fused together to make a shared feature layer, and these features are fed to the classifier. We experiment with two classifiers, namely Support Vector Machines (SVM) and softmax classifier and compare their performances. The recognition accuracies of each modality, depth data alone and sensor data alone are also calculated and compared with fusion based accuracies to highlight the fact that fusion of modalities yields better results than individual modalities. Experimental results on UTD-MHAD and Kinect 2D datasets show that proposed method achieves state of the art results when compared to other recently proposed visual-inertial action recognition methods. Index Terms-Convolutional neural network, data augmentation, multimodal fusion.
One of the major reasons for misclassification of multiplex actions during action recognition is the unavailability of complementary features that provide the semantic information about the actions. In different domains these features are present with different scales and intensities. In existing literature, features are extracted independently in different domains but the benefits from fusing these multidomain features are not realized. To address this challenge and to extract complete set of complementary information, in this paper, we propose a novel multidomain multimodal fusion framework that extracts complementary and distinct features from different domains of the input modality. We transform input inertial data into signal images, and then made the input modality multidomain and multimodal by transforming spatial domain information into frequency and time-spectrum domain using Discrete Fourier Transform (DFT) and Gabor wavelet transform (GWT) respectively. Features in different domains are extracted by Convolutional Neural networks (CNNs) and then fused by canonical correlation based fusion (CCF) for improving the accuracy of human action recognition. Experimental results on three inertial datasets show the superiority of the proposed method when compared to the state-of-the-art.
Stress analysis and assessment of affective states of mind using ECG as a physiological signal is a burning research topic in biomedical signal processing. However, existing literature provides only binary assessment of stress, while multiple levels of assessment may be more beneficial for healthcare applications. Furthermore, in present research, ECG signal for stress analysis is examined independently in spatial domain or in transform domains but the advantage of fusing these domains has not been fully utilized. To get the maximum advantage of fusing diferent domains, we introduce a dataset with multiple stress levels and then classify these levels using a novel deep learning approach by converting ECG signal into signal images based on R-R peaks without any feature extraction. Moreover, We made signal images multimodal and multidomain by converting them into time-frequency and frequency domain using Gabor wavelet transform (GWT) and Discrete Fourier Transform (DFT) respectively. Convolutional Neural networks (CNNs) are used to extract features from different modalities and then decision level fusion is performed for improving the classification accuracy. The experimental results on an in-house dataset collected with 15 users show that with proposed fusion framework and using ECG signal to image conversion, we reach an average accuracy of 85.45%.
In this paper, we present a novel Image Fusion Model (IFM) for ECG heart-beat classification to overcome the weaknesses of existing machine learning techniques that rely either on manual feature extraction or direct utilization of 1D raw ECG signal. At the input of IFM, we first convert the heart-beats of ECG into three different images using Gramian Angular Field (GAF), Recurrence Plot (RP) and Markov Transition Field (MTF) and then fuse these images to create a single imaging modality. We use AlexNet for feature extraction and classification and thus employ end-to-end deep learning. We perform experiments on PhysioNet's MIT-BIH dataset for five different arrhythmias in accordance with the AAMI EC57 standard and on PTB diagnostics dataset for myocardial infarction (MI) classification. We achieved an state-of-an-art results in terms of prediction accuracy, precision and recall.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.