Correlation Net: Spatiotemporal multimodal deep learning for action recognition

Yudistira, Novanto; Kurita, Takio

doi:10.1016/j.image.2019.115731

Cited by 21 publications

(13 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The main building block of EfficientNet-B0 is the mobile inverted bottleneck (MBConv), which is based on the concept of MobileNet [54,55]. As shown in Fig.…”

Section: Efficientnetmentioning

confidence: 99%

“…As shown in Fig. 3, MBConv consists of two convolutional layers(k1 × 1), a depthwise convolutional layer, a Squeeze and Excitation (SE) [54,55] block, and a dropout layer. The first convolutional layer is used to expand the channels.…”

Section: Efficientnetmentioning

confidence: 99%

“…This indicates that the multi-head attention mechanism is useful for recognizing the action, and the proposed MAT-EffNet is a competitive network for action recognition. [12] 63.3 -I3D-RGB [12] 72.1 90.3% ARTNet [52] 70.7 89.3% MoViNet-A5 [66] 71.7 -VidTr-L [67] 70.2 89% MAT-EffNet 72.6 90.8% LRCN [38] RGB + optical flow 82.9 -C3D [26] RGB only + 3D CNNs 85.2 -IDTs [32] RGB only + 3D CNNs 85.9 57.2% Two-stream [1] RGB + optical flow 88.0 59.4% FSTCN [39] RGB + optical flow 88.1 59.1% P3D-199 [65] RGB + 3D CNNs 89.2 62.9% TDD [34] RGB + optical flow 90.3 63.2% STS-network [17] RGB + optical flow + others 90.1 62.4% R-M3D [11] RGB only + 3D CNNs 93.2 65.4% STDAN + RGB difference [58] RGB + optical flow + others 91.0 60.4% TSN Corrnet [55] RGB + optical flow 94.4 70.6% MSM-ResNets [56] RGB + optical flow + others 93.5 66.7% R-STAN-50 [68] RGB + optical flow 91.…”

Section: Exploration Of Mat-effnet On the Kinetics-400 Datasetmentioning

confidence: 99%

“…As shown in Table 6, we compare MAT-EffNet with several reference approaches on the UCF101 and HMDB51 datasets. We compare our approach with both conventional approaches and deep learning-based approaches such as long-term recurrent convolutional networks (LRCN) [37], 3D convolutional networks (C3D) [25], improved trajectories (iDTs) [31], two-stream neural network (Two-stream) [1], factorized spatio-temporal convolutional network (FSTCN) [38], trajectory-pooled deep-convolutional descriptors (TDD) [33], spatiotemporal saliency-based multi-stream network (STSnetwork) [24], multi-cue-based 3D residual network (R-M3D) [11], motion saliency-based multi-stream multiplier ResNets (MSM-ResNets) [56] and correlational convolutional LSTM network (TSN Corrnet) [55].…”

Section: Exploration Of Mat-effnet On the Ucf101 And Hmdb51 Datasetsmentioning

confidence: 99%

See 3 more Smart Citations

Multi-head attention-based two-stream EfficientNet for action recognition

Zhou

Ma²,

Ji³

et al. 2022

Multimedia Systems

View full text Add to dashboard Cite

Recent years have witnessed the popularity of using two-stream convolutional neural networks for action recognition. However, existing two-stream convolutional neural network-based action recognition approaches are incapable of distinguishing some roughly similar actions in videos such as sneezing and yawning. To solve this problem, we propose a Multi-head Attention-based Two-stream EfficientNet (MAT-EffNet) for action recognition, which can take advantage of the efficient feature extraction of EfficientNet. The proposed network consists of two streams (i.e., a spatial stream and a temporal stream), which first extract the spatial and temporal features from consecutive frames by using EfficientNet. Then, a multi-head attention mechanism is utilized on the two streams to capture the key action information from the extracted features. The final prediction is obtained via a late average fusion, which averages the softmax score of spatial and temporal streams. The proposed MAT-EffNet can focus on the key action information at different frames and compute the attention multiple times, in parallel, to distinguish similar actions. We test the proposed network on the UCF101, HMDB51 and Kinetics-400 datasets. Experimental results show that the MAT-EffNet outperforms other state-of-the-art approaches for action recognition.

show abstract

“…The main building block of EfficientNet-B0 is the mobile inverted bottleneck (MBConv), which is based on the concept of MobileNet [54,55]. As shown in Fig.…”

Section: Efficientnetmentioning

confidence: 99%

Section: Efficientnetmentioning

confidence: 99%

Section: Exploration Of Mat-effnet On the Kinetics-400 Datasetmentioning

confidence: 99%

Section: Exploration Of Mat-effnet On the Ucf101 And Hmdb51 Datasetsmentioning

confidence: 99%

See 2 more Smart Citations

Multi-head attention-based two-stream EfficientNet for action recognition

Zhou

Ma²,

Ji³

et al. 2022

Multimedia Systems

View full text Add to dashboard Cite

show abstract

“…Recently, research in multimodal models use, in addition to the RGB video streams, information about the motion within the video sequences: the optical flow can be used [77,79] or even player pose sequences [8,71]. For golf and tennis tournaments, a multimodal architecture using the reactions (such as high fives or fist pumps) and expressions of the players (aggressive, smiling, etc.…”

Section: Related Workmentioning

confidence: 99%

Improved Soccer Action Spotting using both Audio and Video Streams

Vanderplaetse

Dupont

2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

View full text Add to dashboard Cite

In this paper, we propose a study on multi-modal (audio and video) action spotting and classification in soccer videos. Action spotting and classification are the tasks that consist in finding the temporal anchors of events in a video and determine which event they are. This is an important application of general activity understanding. Here, we propose an experimental study on combining audio and video information at different stages of deep neural network architectures. We used the SoccerNet benchmark dataset, which contains annotated events for 500 soccer game videos from the Big Five European leagues. Through this work, we evaluated several ways to integrate audio stream into video-only-based architectures. We observed an average absolute improvement of the mean Average Precision (mAP) metric of 7.43% for the action classification task and of 4.19% for the action spotting task.

show abstract

An Overview of Deep Learning Techniques for Biometric Systems

Almabdy

Elrefaei

2020

Artificial Intelligence for Sustainable Development: Theory, Practice and Future Applications

View full text Add to dashboard Cite

Correlation Net: Spatiotemporal multimodal deep learning for action recognition

Cited by 21 publications

References 20 publications

Multi-head attention-based two-stream EfficientNet for action recognition

Multi-head attention-based two-stream EfficientNet for action recognition

Improved Soccer Action Spotting using both Audio and Video Streams

An Overview of Deep Learning Techniques for Biometric Systems

Contact Info

Product

Resources

About