2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00558
|View full text |Cite
|
Sign up to set email alerts
|

Action Recognition With Spatial-Temporal Discriminative Filter Banks

Abstract: Action recognition has seen a dramatic performance improvement in the last few years. Most of the current stateof-the-art literature either aims at improving performance through changes to the backbone CNN network, or they explore different trade-offs between computational efficiency and performance, again through altering the backbone network. However, almost all of these works maintain the same last layers of the network, which simply consist of a global average pooling followed by a fully connected layer. I… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
38
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
5
2
2

Relationship

1
8

Authors

Journals

citations
Cited by 77 publications
(39 citation statements)
references
References 36 publications
(67 reference statements)
0
38
0
Order By: Relevance
“…SAST-EN with different and deeper backend networks such as 3D-Resnet50, 3D-Resnet101, 3D-Resnet152 and 3D-Resnet101+NL, and add several corresponding experiments in table 5. Our SAST-EN, R101+NL achieves the comparable performances: 53.1% on the top-1 accuracy and 82.1% on the top-5 accuracy, which outperforms the latest works Martinez et al [51] with deeper ResNet152 backbone network by 0.3%, STM [31] by 1.7% and Ghadiyaram et al [32] by 3.3% on the top-5 accuracy respectively.…”
Section: Tablementioning
confidence: 62%
See 1 more Smart Citation
“…SAST-EN with different and deeper backend networks such as 3D-Resnet50, 3D-Resnet101, 3D-Resnet152 and 3D-Resnet101+NL, and add several corresponding experiments in table 5. Our SAST-EN, R101+NL achieves the comparable performances: 53.1% on the top-1 accuracy and 82.1% on the top-5 accuracy, which outperforms the latest works Martinez et al [51] with deeper ResNet152 backbone network by 0.3%, STM [31] by 1.7% and Ghadiyaram et al [32] by 3.3% on the top-5 accuracy respectively.…”
Section: Tablementioning
confidence: 62%
“…At last, for farther demonstrating the effectiveness of our method, we compare our model SAST against our baseline ECO-Lite [17] and other state of the art approaches on the validation set of Something-Something-V1 dataset. In order to make a fair comparison, we also just consider these methods using only RGB modality input and state different backbones used by different methods followed in [51]. The results in table 5 show that our approach SAST-EN, R18 with the backend network 3D-ResNet18 Stage3-5, outperforms the baseline ECO-Lite [17] and improves its top-1 accuracy from 46.4% to 47.5%.…”
Section: ) Results On Something-something-v1mentioning
confidence: 99%
“…Class-wise boosts of our MGMA-Net with respect to baseline.approach with a ResNet-18 backbone pre-trained on Kinetics improves over previous state-of-the-art (with the same settings) by 1.3% in top-1 accuracy (50.8 vs. 49.5, ECO[76]); Third, we further improve our performance by training on deeper backbone ResNet-34 and larger pre-training datasets IG-65M+Kinetics, and substantially increase top-1 accuracy by 6.7% (V1) and 6.1% (V2) against the baseline model, achieving state-of-the-art performance. Also note that Martinez et al[37] uses a much deeper ResNet-152 backbone to achieve competitive top-1 accuracy (53.4), while we have not tried it, we expect a similar improvement, referring to boosts of 50.8 to 53.0 (V1) and 64.2 to 66.3 (V2) by only changing backbone from ResNet-18 to ResNet-34, which can further boost our performance.…”
mentioning
confidence: 55%
“…Video classification: Modern, DL-based video classification methods fall largely into two categories: 2D networks [48,53] that operate on 1-5 frame snippets and 3D networks [5,6,7,12,19,31,46,51,52] that operate on 16-128 frames. One of the earliest works of this type, Simonyan and Zisserman [48], trained with only 1-5 frames sampled randomly from the video.…”
Section: Related Workmentioning
confidence: 99%