2020
DOI: 10.48550/arxiv.2010.12573
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Object-aware Feature Aggregation for Video Object Detection

Abstract: We present an Object-aware Feature Aggregation (OFA) module for video object detection (VID). Our approach is motivated by the intriguing property that video-level objectaware knowledge can be employed as a powerful semantic prior to help object recognition. As a consequence, augmenting features with such prior knowledge can effectively improve the classification and localization performance. To make features get access to more content about the whole video, we first capture the object-aware knowledge of propo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 49 publications
(96 reference statements)
0
2
0
Order By: Relevance
“…Methods in [39,72,[88][89][90] aligned and warped adjacent features under the guidance of optical-flow. Besides optical flow, some approaches [14,15,20,27,42,53,62,62,75,80] enhanced the object-level feature by exploring semantic and spatio-temporal correspondence among the region proposals. PSLA [25] applied self-attention mechanisms in the temporal-spatial domain without relying on extra optical flow.…”
Section: Related Workmentioning
confidence: 99%
“…Methods in [39,72,[88][89][90] aligned and warped adjacent features under the guidance of optical-flow. Besides optical flow, some approaches [14,15,20,27,42,53,62,62,75,80] enhanced the object-level feature by exploring semantic and spatio-temporal correspondence among the region proposals. PSLA [25] applied self-attention mechanisms in the temporal-spatial domain without relying on extra optical flow.…”
Section: Related Workmentioning
confidence: 99%
“…[24] explores an offboard setting for auto-labeling by performing detection for individual frames and aggregating the results from the entire sequence. Inspired by relational networks [12] and its applications [3,29,33,9,11] on 2D video object detection, 3D-MAN [38] proposes to apply attention mechanisms on pooled RoI features for multi-frame alignment and aggregation. However, RoI pooling separates the objects from the context and leads to the loss of details.…”
Section: Multi-frame Point Cloud Object Detectionmentioning
confidence: 99%