MMTM: Multimodal Transfer Module for CNN Fusion

Joze, Hamid Reza Vaezi; Shaban, Amirreza; Iuzzolino, Michael L.; Koishida, Kazuhito

doi:10.1109/cvpr42600.2020.01330

Cited by 98 publications

(66 citation statements)

References 58 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cheng et al [15] proposed a cross-modality compensation block to learn complementary information between two modalities and compensate the unimodal features for better action recognition performance. Some works also perform cross-modality interactive feature learning at multiple levels in the network hierarchy for multi-modal fusion applications, such as Multimodal Transfer Module (MMTM) [51] and Information Aggregation-Distribution Module (IADM) [52]. Moreover, audio is usually converted to a spectral representation to assist action recognition [53], [54] and the multimodal fusion is performed at the feature level.…”

Section: Related Workmentioning

confidence: 99%

Spatial-Temporal Information Aggregation and Cross-Modality Interactive Learning for RGB-D-Based Human Action Recognition

et al. 2022

View full text Add to dashboard Cite

The RGB-D-based human action recognition is gaining increasing attention because the different modalities can provide complementary information. However, the recognition performance is still not satisfactory due to the limited ability to learn spatial-temporal feature and insufficient intermodel interaction. In this paper, we propose a novel approach for RGB-D human action recognition by aggregating spatial-temporal information and implementing cross-modality interactive learning. Firstly, a spatial-temporal information aggregation module (STIAM) is proposed to utilizes sample convolutional neural networks (CNNs) to aggregate the spatial-temporal information in entire RGB-D sequence into lightweight representations efficiently. This allows the model to extract richer spatial-temporal features with limited extra memory and computational cost. Secondly, a cross-modality interactive module (CMIM) is proposed to fully fuse the multi-modal complementary information. Moreover, a multi-modal interactive network (MMINet) is constructed for RGB-D-based action recognition by embeding the above two modules into the two-stream CNNs. In order to verify the universality of our approach, two backbones are deployed in the two-stream architecture, successively. Ablation experiments demonstrate that the proposed STIAM can bring significant improvement in recognizing actions. CMIM can further play the advantages of complementary features of multiple modalities. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120 and PKU-MMD datasets proved the effectiveness of the proposed approach. The proposed approach achieves an accuracy of 94.3% and 96.5% for cross-subject and cross-view on NTU RGB+D 60, 91.7% and 92.6% for cross-subject and cross-setup on NTU RGB+D 120, 93.6% and 94.2% for cross-subject and cross-view on PKU-MMD datasets, which are the state-of-the-art performance. Further analysis denotes that our approach has advantages in recognizing subtle actions.

show abstract

Section: Related Workmentioning

confidence: 99%

Spatial-Temporal Information Aggregation and Cross-Modality Interactive Learning for RGB-D-Based Human Action Recognition

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Temporal shift module (TSM) [2] shifting filter channels on the temporal dimension complements its efficiency with good accuracy, which makes it available for real‐time applications. In addition to using only RGB frames as input, methods [9–13] that extract poses from frames as Supporting Information have also achieved excellent performance.…”

Section: Related Workmentioning

confidence: 99%

MIAM: Motion information aggregation module for action recognition

Qin

Ren

Liu

et al. 2022

Electronics Letters

View full text Add to dashboard Cite

In the field of action recognition based on RGB videos, it is infeasible to train deep networks on dozens or hundreds of frames because of limits on computational complexity and memory. Previous works commonly adopted a sparse sampling strategy, which unfortunately leads to missing crucial frames and insufficient modelling for short‐range motions. In this letter, an effective motion information aggregation module (MIAM) that utilises convolutional neural networks to aggregate motion information from multiple frames into one single frame is proposed. This allows to train deep networks end‐to‐end with densely sampled frames efficiently. The MIAM enables the model to gather motion information at every instant of action, which avoids missing subtle movements. Experiments on the NTU RGB+D 60 and NTU RGB+D 120 datasets verify that the MIAM significantly improves the recognition accuracy with very limited extra computational cost and exhibits unique advantages for recognising subtle actions.

show abstract

“…Inspired by the multi-modal transfer module that recalibrates channel-wise features of each modality based on multi-modal information [36] and the convolutional block attention module that focuses on the spatial information of the feature maps [30], we devised a CMIM based on an attention mechanism to adaptively recalibrate temporal-and axis-wise features in each modality by utilizing multi-modal information. The detailed CMIM is illustrated in Figure 2c.…”

Section: Cross-modality Interaction Modulementioning

confidence: 99%

Cross-Modality Interaction Network for Equine Activity Recognition Using Imbalanced Multi-Modal Data

Mao

Huang

Gan

et al. 2021

Sensors

View full text Add to dashboard Cite

With the recent advances in deep learning, wearable sensors have increasingly been used in automated animal activity recognition. However, there are two major challenges in improving recognition performance—multi-modal feature fusion and imbalanced data modeling. In this study, to improve classification performance for equine activities while tackling these two challenges, we developed a cross-modality interaction network (CMI-Net) involving a dual convolution neural network architecture and a cross-modality interaction module (CMIM). The CMIM adaptively recalibrated the temporal- and axis-wise features in each modality by leveraging multi-modal information to achieve deep intermodality interaction. A class-balanced (CB) focal loss was adopted to supervise the training of CMI-Net to alleviate the class imbalance problem. Motion data was acquired from six neck-attached inertial measurement units from six horses. The CMI-Net was trained and verified with leave-one-out cross-validation. The results demonstrated that our CMI-Net outperformed the existing algorithms with high precision (79.74%), recall (79.57%), F1-score (79.02%), and accuracy (93.37%). The adoption of CB focal loss improved the performance of CMI-Net, with increases of 2.76%, 4.16%, and 3.92% in precision, recall, and F1-score, respectively. In conclusion, CMI-Net and CB focal loss effectively enhanced the equine activity classification performance using imbalanced multi-modal sensor data.

show abstract

MMTM: Multimodal Transfer Module for CNN Fusion

Cited by 98 publications

References 58 publications

Spatial-Temporal Information Aggregation and Cross-Modality Interactive Learning for RGB-D-Based Human Action Recognition

Spatial-Temporal Information Aggregation and Cross-Modality Interactive Learning for RGB-D-Based Human Action Recognition

MIAM: Motion information aggregation module for action recognition

Cross-Modality Interaction Network for Equine Activity Recognition Using Imbalanced Multi-Modal Data

Contact Info

Product

Resources

About