2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00124
|View full text |Cite
|
Sign up to set email alerts
|

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

Abstract: We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations;(2) we better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields; and (3) we expli… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
418
3
2

Year Published

2018
2018
2019
2019

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 613 publications
(425 citation statements)
references
References 45 publications
1
418
3
2
Order By: Relevance
“…Buch et al [2] introduce semantics constraints for curriculum training in end-to-end temporal action localization. Chao et al [8] adopt Faster R-CNN [30] for action localization task.…”
Section: Related Workmentioning
confidence: 99%
“…Buch et al [2] introduce semantics constraints for curriculum training in end-to-end temporal action localization. Chao et al [8] adopt Faster R-CNN [30] for action localization task.…”
Section: Related Workmentioning
confidence: 99%
“…Temporal Action localization has attracted increasing attention in the last several years [6,18,26,33,34]. Inspired by the success of object detection, most current action detection methods resort to the two-stage pipeline: they first generate a set of 1D temporal proposals and then perform classification and temporal boundary regression on each proposal individually.…”
Section: Introductionmentioning
confidence: 99%
“…It combines 2D convolutional neural network and optical flow to capture appearance and motion features respectively. Recently, as kinds of 3D convolutional neural networks such as C3D [22], P3D [18], I3D [2] and 3D-ResNet [9] appear, adopting 3D convolutional neural network to extract spatio-temporal feature is getting more and more popular [1,2,25,3]. Temporal Action Proposals and Detection.…”
Section: Related Workmentioning
confidence: 99%
“…Temporal Action Proposals and Detection. Since natural videos are always long and untrimmed, temporal action proposals and detection have aroused intensive interest from researchers [6,26,1,25,3,8]. DAP [4] leverages LSTM to encode the video sequence for temporal features.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation