2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.01330
|View full text |Cite
|
Sign up to set email alerts
|

MMTM: Multimodal Transfer Module for CNN Fusion

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
46
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
8
2

Relationship

0
10

Authors

Journals

citations
Cited by 98 publications
(66 citation statements)
references
References 58 publications
2
46
0
Order By: Relevance
“…Cheng et al [15] proposed a cross-modality compensation block to learn complementary information between two modalities and compensate the unimodal features for better action recognition performance. Some works also perform cross-modality interactive feature learning at multiple levels in the network hierarchy for multi-modal fusion applications, such as Multimodal Transfer Module (MMTM) [51] and Information Aggregation-Distribution Module (IADM) [52]. Moreover, audio is usually converted to a spectral representation to assist action recognition [53], [54] and the multimodal fusion is performed at the feature level.…”
Section: Related Workmentioning
confidence: 99%
“…Cheng et al [15] proposed a cross-modality compensation block to learn complementary information between two modalities and compensate the unimodal features for better action recognition performance. Some works also perform cross-modality interactive feature learning at multiple levels in the network hierarchy for multi-modal fusion applications, such as Multimodal Transfer Module (MMTM) [51] and Information Aggregation-Distribution Module (IADM) [52]. Moreover, audio is usually converted to a spectral representation to assist action recognition [53], [54] and the multimodal fusion is performed at the feature level.…”
Section: Related Workmentioning
confidence: 99%
“…Temporal shift module (TSM) [2] shifting filter channels on the temporal dimension complements its efficiency with good accuracy, which makes it available for real‐time applications. In addition to using only RGB frames as input, methods [9–13] that extract poses from frames as Supporting Information have also achieved excellent performance.…”
Section: Related Workmentioning
confidence: 99%
“…Inspired by the multi-modal transfer module that recalibrates channel-wise features of each modality based on multi-modal information [36] and the convolutional block attention module that focuses on the spatial information of the feature maps [30], we devised a CMIM based on an attention mechanism to adaptively recalibrate temporal-and axis-wise features in each modality by utilizing multi-modal information. The detailed CMIM is illustrated in Figure 2c.…”
Section: Cross-modality Interaction Modulementioning
confidence: 99%