Action Recognition With Spatial-Temporal Discriminative Filter Banks

Martínez, Brais; Modolo, Davide; Xiong, Yuanjun; Tighe, Joseph

doi:10.1109/iccv.2019.00558

Cited by 77 publications

(39 citation statements)

References 36 publications

(67 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SAST-EN with different and deeper backend networks such as 3D-Resnet50, 3D-Resnet101, 3D-Resnet152 and 3D-Resnet101+NL, and add several corresponding experiments in table 5. Our SAST-EN, R101+NL achieves the comparable performances: 53.1% on the top-1 accuracy and 82.1% on the top-5 accuracy, which outperforms the latest works Martinez et al [51] with deeper ResNet152 backbone network by 0.3%, STM [31] by 1.7% and Ghadiyaram et al [32] by 3.3% on the top-5 accuracy respectively.…”

Section: Tablementioning

confidence: 62%

“…At last, for farther demonstrating the effectiveness of our method, we compare our model SAST against our baseline ECO-Lite [17] and other state of the art approaches on the validation set of Something-Something-V1 dataset. In order to make a fair comparison, we also just consider these methods using only RGB modality input and state different backbones used by different methods followed in [51]. The results in table 5 show that our approach SAST-EN, R18 with the backend network 3D-ResNet18 Stage3-5, outperforms the baseline ECO-Lite [17] and improves its top-1 accuracy from 46.4% to 47.5%.…”

Section: ) Results On Something-something-v1mentioning

confidence: 99%

See 1 more Smart Citation

SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

Wang

Huang

et al. 2019

IEEE Access

View full text Add to dashboard Cite

The state-of-the-arts in action recognition are suffering from three challenges: (1) How to model spatial transformations of action since it is always geometric variation over time in videos. (2) How to develop the semantic action-aware temporal features from one video with a large proportion of irrelevant frames to the labeled action class, which hurt the final performance. (3) The action recognition speed of most existing models is too slow to be applied to actual scenes. In this paper, to address these three challenges, we propose a novel CNN-based action recognition method called SAST including three important modules, which can effectively learn semantic action-aware spatial-temporal features with a faster speed. Firstly, to learn action-aware spatial features (spatial transformations), we design a weight shared 2D Deformable Convolutional network named 2DDC with deformable convolutions whose receptive fields can be adaptively adjusted according to the complex geometric structure of actions. Then, we propose a light Temporal Attention model called TA to develop the action-aware temporal features that are discriminative for the labeled action category. Finally, we apply an effective 3D network to learn the temporal context between frames for building the final video-level representation. To improve the efficiency, we only utilize the raw RGB rather than optical flow and RGB as the input to our model. Experimental results on four challenging video recognition datasets Kinetics-400, Something-Something-V1, UCF101 and HMDB51 demonstrate that our proposed method can not only achieve comparable performances but be 10x to 50x faster than most of state-of-the-art action recognition methods.

show abstract

Section: Tablementioning

confidence: 62%

Section: ) Results On Something-something-v1mentioning

confidence: 99%

SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

Wang

Huang

et al. 2019

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Class-wise boosts of our MGMA-Net with respect to baseline.approach with a ResNet-18 backbone pre-trained on Kinetics improves over previous state-of-the-art (with the same settings) by 1.3% in top-1 accuracy (50.8 vs. 49.5, ECO[76]); Third, we further improve our performance by training on deeper backbone ResNet-34 and larger pre-training datasets IG-65M+Kinetics, and substantially increase top-1 accuracy by 6.7% (V1) and 6.1% (V2) against the baseline model, achieving state-of-the-art performance. Also note that Martinez et al[37] uses a much deeper ResNet-152 backbone to achieve competitive top-1 accuracy (53.4), while we have not tried it, we expect a similar improvement, referring to boosts of 50.8 to 53.0 (V1) and 64.2 to 66.3 (V2) by only changing backbone from ResNet-18 to ResNet-34, which can further boost our performance.…”

mentioning

confidence: 55%

Multi-Group Multi-Attention

Shi

Cao

Guan

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Learning spatiotemporal features is very effective but challenging for video understanding especially action recognition. In this paper, we propose Multi-Group Multi-Attention, dubbed MGMA, paying more attention to "where and when" the action happens, for learning discriminative spatiotemporal representation in videos. The contribution of MGMA is threefold: First, by devising a new spatiotemporal separable attention mechanism, it can learn temporal attention and spatial attention separately for fine-grained spatiotemporal representation. Second, through designing a novel multigroup structure, it can capture multi-attention rendered spatiotemporal features better. Finally, our MGMA module is lightweight and flexible yet effective, so that can be easily embedded into any 3D Convolutional Neural Network (3D-CNN) architecture. We embed multiple MGMA modules into 3D-CNN to train an end-to-end, RGBonly model and evaluate on four popular benchmarks: UCF101 and HMDB51, Something-Something V1 and V2. Ablation study and experimental comparison demonstrate the strength of our MGMA, which achieves superior performance compared to state-of-the-arts. Our code is available at https://github.com/zhenglab/mgma. CCS CONCEPTS • Computing methodologies → Activity recognition and understanding.

show abstract

“…Video classification: Modern, DL-based video classification methods fall largely into two categories: 2D networks [48,53] that operate on 1-5 frame snippets and 3D networks [5,6,7,12,19,31,46,51,52] that operate on 16-128 frames. One of the earliest works of this type, Simonyan and Zisserman [48], trained with only 1-5 frames sampled randomly from the video.…”

Section: Related Workmentioning

confidence: 99%

Rethinking Zero-Shot Video Classification: End-to-End Training for Realistic Applications

Brattoli

Tighe

Zhdanov

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

View full text Add to dashboard Cite

Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a model once, and generalizes to new tasks whose classes are not present in the training dataset. We propose the first end-to-end algorithm for ZSL in video classification. Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features. This is in contrast to previous video ZSL methods, which use pretrained feature extractors. We also extend the current benchmarking paradigm: Previous techniques aim to make the test task unknown at training time but fall short of this goal. We encourage domain shift across training and test data and disallow tailoring a ZSL model to a specific test dataset. We outperform the state-of-the-art by a wide margin. Our code, evaluation procedure and model weights are available at github.com/bbrattoli/ZeroShotVideoClassification. * Work done during an internship at Amazon.

show abstract

Action Recognition With Spatial-Temporal Discriminative Filter Banks

Cited by 77 publications

References 36 publications

SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

Multi-Group Multi-Attention

Rethinking Zero-Shot Video Classification: End-to-End Training for Realistic Applications

Contact Info

Product

Resources

About