Object-aware Feature Aggregation for Video Object Detection

Geng, Qichuan; Zhang, Hong; Jiang, Na; Qi, Xiaojuan; Zhang, Liangjun; Zhou, Zhong

doi:10.48550/arxiv.2010.12573

Cited by 2 publications

(2 citation statements)

References 49 publications

(96 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Methods in [39,72,[88][89][90] aligned and warped adjacent features under the guidance of optical-flow. Besides optical flow, some approaches [14,15,20,27,42,53,62,62,75,80] enhanced the object-level feature by exploring semantic and spatio-temporal correspondence among the region proposals. PSLA [25] applied self-attention mechanisms in the temporal-spatial domain without relying on extra optical flow.…”

Section: Related Workmentioning

confidence: 99%

Explore Spatio-temporal Aggregation for Insubstantial Object Detection: Benchmark Dataset and Baseline

Zhou¹,

Wang²,

Lv³

et al. 2022

Preprint

View full text Add to dashboard Cite

We endeavor on a rarely explored task named Insubstantial Object Detection (IOD), which aims to localize the object with following characteristics: (1) amorphous shape with indistinct boundary; (2) similarity to surroundings; (3) absence in color. Accordingly, it is far more challenging to distinguish insubstantial objects in a single static frame and the collaborative representation of spatial and temporal information is crucial. Thus, we construct an IOD-Video dataset comprised of 600 videos (141,017 frames) covering various distances, sizes, visibility, and scenes captured by different spectral ranges. In addition, we develop a spatiotemporal aggregation framework for IOD, in which different backbones are deployed and a spatio-temporal aggregation loss (STAloss) is elaborately designed to leverage the consistency along the time axis. Experiments conducted on IOD-Video dataset demonstrate that spatio-temporal aggregation can significantly improve the performance of IOD. We hope our work will attract further researches into this valuable yet challenging task. The code will be available at: https://github.com/CalayZhou/IOD-Video.

show abstract

Section: Related Workmentioning

confidence: 99%

Explore Spatio-temporal Aggregation for Insubstantial Object Detection: Benchmark Dataset and Baseline

Zhou¹,

Wang²,

Lv³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…[24] explores an offboard setting for auto-labeling by performing detection for individual frames and aggregating the results from the entire sequence. Inspired by relational networks [12] and its applications [3,29,33,9,11] on 2D video object detection, 3D-MAN [38] proposes to apply attention mechanisms on pooled RoI features for multi-frame alignment and aggregation. However, RoI pooling separates the objects from the context and leads to the loss of details.…”

Section: Multi-frame Point Cloud Object Detectionmentioning

confidence: 99%

TransPillars: Coarse-to-Fine Aggregation for Multi-Frame 3D Object Detection

Luo¹,

Zhang²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

3D object detection using point clouds has attracted increasing attention due to its wide applications in autonomous driving and robotics. However, most existing studies focus on single point cloud frames without harnessing the temporal information in point cloud sequences. In this paper, we design TransPillars, a novel transformerbased feature aggregation technique that exploits temporal features of consecutive point cloud frames for multiframe 3D object detection. TransPillars aggregates spatialtemporal point cloud features from two perspectives. First, it fuses voxel-level features directly from multi-frame feature maps instead of pooled instance features to preserve instance details with contextual information that are essential to accurate object localization. Second, it introduces a hierarchical coarse-to-fine strategy to fuse multi-scale features progressively to effectively capture the motion of moving objects and guide the aggregation of fine features. Besides, a variant of deformable transformer is introduced to improve the effectiveness of cross-frame feature matching. Extensive experiments show that our proposed TransPillars achieves state-of-art performance as compared to existing multi-frame detection approaches. Code will be released.

show abstract

Object-aware Feature Aggregation for Video Object Detection

Cited by 2 publications

References 49 publications

Explore Spatio-temporal Aggregation for Insubstantial Object Detection: Benchmark Dataset and Baseline

Explore Spatio-temporal Aggregation for Insubstantial Object Detection: Benchmark Dataset and Baseline

TransPillars: Coarse-to-Fine Aggregation for Multi-Frame 3D Object Detection

Contact Info

Product

Resources

About