Spatio-temporal Channel Correlation Networks for Action Classification

Diba, Ali; Fayyaz, Mohsen; Sharma, Vivek; Arzani, Mohammad Mahdi; Yousefzadeh, Rahman; Gall, Jüergen; Gool, Luc Van

doi:10.1007/978-3-030-01225-0_18

Cited by 178 publications

(138 citation statements)

References 41 publications

(70 reference statements)

Supporting

Mentioning

126

Contrasting

Order By: Relevance

“…Our experiments are best with STCnet and 3D-ResNet/Next configuration which is of depth 101. 58.9 -C3D [17] 55.6 -3D ResNet101 [17] 62.8 83.9 3D ResNext101 [17] 65.1 85.7 RGB-I3D [3] 68.4 88 STC-ResNet101 (16 frames) [6] 64.1 85.2 STC-ResNext101 (16 frames) [6] 66.2 86.5 STC-ResNext101 (32 frames) [6] 68.7 88.5 DynamoNet (ResNext) ( In Table 5, we compare the performance of Dy-namoNet with current state-of-the-art methods on UCF101/HMDB51.…”

Section: Action Recognitionmentioning

confidence: 99%

DynamoNet: Dynamic Action and Motion Network

Diba

Sharma²,

Gool

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

108

View full text Add to dashboard Cite

In this paper, we are interested in self-supervised learning the motion cues in videos using dynamic motion filters for a better motion representation to finally boost human action recognition in particular. Thus far, the vision community has focused on spatio-temporal approaches using standard filters, rather we here propose dynamic filters that adaptively learn the video-specific internal motion representation by predicting the short-term future frames. We name this new motion representation, as dynamic motion representation (DMR) and is embedded inside of 3D convolutional network as a new layer, which captures the visual appearance and motion dynamics throughout entire video clip via end-to-end network learning. Simultaneously, we utilize these motion representation to enrich video classification. We have designed the frame prediction task as an auxiliary task to empower the classification problem.With these overall objectives, to this end, we introduce a novel unified spatio-temporal 3D-CNN architecture (Dy-namoNet) that jointly optimizes the video classification and learning motion representation by predicting future frames as a multi-task learning problem. We conduct experiments on challenging human action datasets: Kinetics 400, UCF101, HMDB51. The experiments using the proposed DynamoNet show promising results on all the datasets.

show abstract

Section: Action Recognitionmentioning

confidence: 99%

DynamoNet: Dynamic Action and Motion Network

Diba

Sharma²,

Gool

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

108

View full text Add to dashboard Cite

show abstract

“…In this section, we study the proposed automatic method of designing action recognition network to demonstrate its advantages over other famous action recognition architectures, e.g., 3D-ResNet [19], C3D network [20], and STC-ResNet [21]. We evaluate our algorithm on the challenging action recognition dataset UCF101, which is a trimmed dataset containing 13320 video clips of 101 classes, with the training from scratch protocol.…”

Section: Methodsmentioning

confidence: 99%

“…#params model size Accuracy 3D-ResNet 18 [19] 33.2M 252M 42.4% 3D-ResNet 101 [19] 100M + 652M 46.7% 3D-ConvNet [20] 79M 305M 51.6% STC-ResNet 18 [21] 33.2M + -42.8% STC-ResNet 50 [21] 92M + -46.2% STC-ResNet 101 [21] 100M + -47.9% Ours 0.67M 7.32M 58.6%…”

Section: Architecturesmentioning

confidence: 99%

Video Action Recognition Via Neural Architecture Searching

Peng

Hong

Zhao

2019

2019 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

Deep neural networks have achieved great success for video analysis and understanding. However, designing a highperformance neural architecture requires substantial efforts and expertise. In this paper, we make the first attempt to let algorithm automatically design neural networks for video action recognition tasks. Specifically, a spatio-temporal network is developed in a differentiable space modeled by a directed acyclic graph, thus a gradient-based strategy can be performed to search an optimal architecture. Nonetheless, it is computationally expensive, since the computational burden to evaluate each architecture candidate is still heavy. To alleviate this issue, we, for the video input, introduce a temporal segment approach to reduce the computational cost without losing global video information. For the architecture, we explore in an efficient search space by introducing pseudo 3D operators. Experiments show that, our architecture outperforms popular neural architectures, under the training from scratch protocol, on the challenging UCF101 dataset, surprisingly, with only around one percentage of parameters of its manual-design counterparts.

show abstract

“…While these works analyze if the networks can be better trained using full supervision if additional modalities including the modality of the test data are available during training, we address the problem if the modality of the annotated training set differs from the modality of the test set. In [4], a 3D convolutional neural network is initialized by transferring the knowledge of a pre-trained 2D CNN. Cross-modal distillation has been also used for other tasks such as object detection [20], emotion recognition [21], or human pose estimation [22].…”

Section: Related Workmentioning

confidence: 99%

“…Action recognition is addressed in many works and in particular deep learning methods have been proposed for various modalities like RGB videos [1,2,3,4] or skeleton data [5,6,7,8]. Deep learning methods for action recognition, however, require large annotated datasets.…”

Section: Introductionmentioning

confidence: 99%

Cross-Modal Knowledge Distillation for Action Recognition

Thoker

Gall

2019

2019 IEEE International Conference on Image Processing (ICIP)

Self Cite

View full text Add to dashboard Cite

In this work, we address the problem how a network for action recognition that has been trained on a modality like RGB videos can be adapted to recognize actions for another modality like sequences of 3D human poses. To this end, we extract the knowledge of the trained teacher network for the source modality and transfer it to a small ensemble of student networks for the target modality. For the cross-modal knowledge distillation, we do not require any annotated data. Instead we use pairs of sequences of both modalities as supervision, which are straightforward to acquire. In contrast to previous works for knowledge distillation that use a KLloss, we show that the cross-entropy loss together with mutual learning of a small ensemble of student networks performs better. In fact, the proposed approach for cross-modal knowledge distillation nearly achieves the accuracy of a student network trained with full supervision.

show abstract

Spatio-temporal Channel Correlation Networks for Action Classification

Cited by 178 publications

References 41 publications

DynamoNet: Dynamic Action and Motion Network

DynamoNet: Dynamic Action and Motion Network

Video Action Recognition Via Neural Architecture Searching

Cross-Modal Knowledge Distillation for Action Recognition

Contact Info

Product

Resources

About