2018
DOI: 10.1007/978-3-030-01237-3_30
|View full text |Cite
|
Sign up to set email alerts
|

Video Object Detection with an Aligned Spatial-Temporal Memory

Abstract: We introduce Spatial-Temporal Memory Networks for video object detection. At its core, a novel Spatial-Temporal Memory module (STMM) serves as the recurrent computation unit to model long-term temporal appearance and motion dynamics. The STMM's design enables full integration of pretrained backbone CNN weights, which we find to be critical for accurate detection. Furthermore, in order to tackle object motion in videos, we propose a novel MatchTrans module to align the spatial-temporal memory from frame to fram… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
161
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 181 publications
(162 citation statements)
references
References 50 publications
1
161
0
Order By: Relevance
“…Sampling Strategies for Feature Aggregation Frame sampling strategy matters for video detection. As previous works [33,36] pointed out, using more frames in feature aggregation during testing yields better results. Besides, [33] samples frames with a uniform stride during testing to improve the performance.…”
Section: Effectiveness Of Selsamentioning
confidence: 70%
See 3 more Smart Citations
“…Sampling Strategies for Feature Aggregation Frame sampling strategy matters for video detection. As previous works [33,36] pointed out, using more frames in feature aggregation during testing yields better results. Besides, [33] samples frames with a uniform stride during testing to improve the performance.…”
Section: Effectiveness Of Selsamentioning
confidence: 70%
“…STMN [33] used a Spatial-Temporal Memory module as the recurrent operation to pass the information through a video. Unlike [33], our method does not need to pass information using memory modules in temporal order. We form clusters and aggregate features in a multi-shot view to capture the rich information of videos instead.…”
Section: Object Detection In Videosmentioning
confidence: 99%
See 2 more Smart Citations
“…Generalizing still image detectors to video domain is not trivial due to the spatial and temporal complex variations existed in videos, not to mention that the object appearances in some frames may be deteriorated by motion blur or occlusion. One common solution to amend this problem is feature aggregation [1,29,49,53,54,55] that enhances per-frame features by aggregating the features of nearby frames. Specifically, FGFA [54] utilizes the optical flow from FlowNet [7] to guide the pixel-level motion compensation on feature maps of adjacent frames for feature aggregation.…”
Section: Related Workmentioning
confidence: 99%