2020
DOI: 10.1016/j.image.2019.115731
|View full text |Cite
|
Sign up to set email alerts
|

Correlation Net: Spatiotemporal multimodal deep learning for action recognition

Abstract: This letter describes a network that is able to capture multimodal correlations over arbitrary timestamps. The proposed scheme operates as a complementary, extended network over multimodal CNN. For action recognition, the spatial and temporal streams are vital components of deep Convolutional Neural Network (CNNs), but reducing the occurrence of overfitting and fusing these two streams remain open problems. The existing fusion approach is to average the two streams. To this end, we propose a correlation networ… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
9
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 21 publications
(13 citation statements)
references
References 20 publications
0
9
0
Order By: Relevance
“…The main building block of EfficientNet-B0 is the mobile inverted bottleneck (MBConv), which is based on the concept of MobileNet [54,55]. As shown in Fig.…”
Section: Efficientnetmentioning
confidence: 99%
See 3 more Smart Citations
“…The main building block of EfficientNet-B0 is the mobile inverted bottleneck (MBConv), which is based on the concept of MobileNet [54,55]. As shown in Fig.…”
Section: Efficientnetmentioning
confidence: 99%
“…As shown in Fig. 3, MBConv consists of two convolutional layers(k1 × 1), a depthwise convolutional layer, a Squeeze and Excitation (SE) [54,55] block, and a dropout layer. The first convolutional layer is used to expand the channels.…”
Section: Efficientnetmentioning
confidence: 99%
See 2 more Smart Citations
“…Recently, research in multimodal models use, in addition to the RGB video streams, information about the motion within the video sequences: the optical flow can be used [77,79] or even player pose sequences [8,71]. For golf and tennis tournaments, a multimodal architecture using the reactions (such as high fives or fist pumps) and expressions of the players (aggressive, smiling, etc.…”
Section: Related Workmentioning
confidence: 99%