2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00931
|View full text |Cite
|
Sign up to set email alerts
|

Sequence Level Semantics Aggregation for Video Object Detection

Abstract: Video objection detection (VID) has been a rising research direction in recent years. A central issue of VID is the appearance degradation of video frames caused by fast motion. This problem is essentially ill-posed for a single frame. Therefore, aggregating features from other frames becomes a natural choice. Existing methods rely heavily on optical flow or recurrent neural networks for feature aggregation. However, these methods emphasize more on the temporally nearby frames. In this work, we argue that aggr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
185
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 193 publications
(196 citation statements)
references
References 35 publications
3
185
0
Order By: Relevance
“…Notice also that the best methods detected all four players in all or nearly all frames, without requiring video-based object detection techniques [ 42 , 43 , 44 ] which exploit temporal coherence across consecutive frames. We did not apply any temporal filtering to the data, as this would partially hide the actual accuracy of the methods being compared.…”
Section: Discussionmentioning
confidence: 99%
“…Notice also that the best methods detected all four players in all or nearly all frames, without requiring video-based object detection techniques [ 42 , 43 , 44 ] which exploit temporal coherence across consecutive frames. We did not apply any temporal filtering to the data, as this would partially hide the actual accuracy of the methods being compared.…”
Section: Discussionmentioning
confidence: 99%
“…Causal? Backbone mAP(%) mAP gain(%) T-CNN [13] No GoogLeNet + VGG + Fast-RCNN 73.8 6.1 MANet [14] No ResNet101 + R-FCN 78.1 4.5 FGFA [16] No ResNet101 + R-FCN 78.4 5.0 Scale-time lattice [20] No ResNet101+ Faster R-CNN 79.6 N/A Object linking [30] No ResNet101+ Fast R-CNN 74.5 5.4 Seq-NMS [19] No VGG + Faster R-CNN 52.2 7.3 STMN [18] No ResNet101 + R-FCN 80.5 N/A STSN [21] No ResNet101 + R-FCN 78.9 2.9 RDN [41] No ResNet101 + Faster R-CNN 81.8 6.4 SELSA [42] No ResNet101 + Faster R-CNN 80.3 6.7 D&T [15] No mance despite the fact that a less powerful detection network is used. Since our method focuses on causal video object detection where no future frames are allowed, no video-level post-processing is applied.…”
Section: Methodsmentioning
confidence: 99%
“…In [41], objects' interactions are captured in spatio-temporal domain. Full-sequence level feature aggregation is proposed in [42] to generate robust features for video object detection. External memory is used in [44] to store informative temporal features.…”
Section: B Video Object Detectionmentioning
confidence: 99%
“…STMN [22] adopts spatiotemporal memory module with spatial alignment mechanism to model long-term temporal appearance and motion dynamics. Besides, RDN [46] and SELSA [47] strengthen region-level features by exploiting the relation/affinity between region proposals across frames…”
Section: B Object Detection In Videosmentioning
confidence: 99%