T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos

Kang, Kai; Li, Hongsheng; Yan, Junjie; Zeng, Xingyu; Yang, Bin; Xiao, Tong; Zhang, Cong; Wang, Zhe; Wang, Ruohui; Wang, Xiaogang; Ouyang, Wanli

doi:10.1109/tcsvt.2017.2736553

Cited by 450 publications

(289 citation statements)

References 47 publications

Supporting

Mentioning

287

Contrasting

Unclassified

Order By: Relevance

“…Several previous works devised various post-processing techniques applied to the results of still image detectors by leveraging temporal information: Kang et al [15,14] proposed to suppress false positive detections via multicontext suppression (MCS) and propagate predicted bounding boxes across frames using the motion calculated by optical flow. Then a temporal convolution neural network is trained to rescore the tubelets generated using visual tracking.…”

Section: Object Detection In Videosmentioning

confidence: 99%

See 1 more Smart Citation

Sequence Level Semantics Aggregation for Video Object Detection

Chen

Wang³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

212

185

View full text Add to dashboard Cite

Video objection detection (VID) has been a rising research direction in recent years. A central issue of VID is the appearance degradation of video frames caused by fast motion. This problem is essentially ill-posed for a single frame. Therefore, aggregating features from other frames becomes a natural choice. Existing methods rely heavily on optical flow or recurrent neural networks for feature aggregation. However, these methods emphasize more on the temporally nearby frames. In this work, we argue that aggregating features in the full-sequence level will lead to more discriminative and robust features for video object detection. To achieve this goal, we devise a novel Sequence Level Semantics Aggregation (SELSA) module. We further demonstrate the close relationship between the proposed method and the classic spectral clustering method, providing a novel view for understanding the VID problem. We test the proposed method on the ImageNet VID and the EPIC KITCHENS dataset and achieve new state-of-theart results. Our method does not need complicated postprocessing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean.

show abstract

Section: Object Detection In Videosmentioning

confidence: 99%

“…Another line of work [14] focuses on utilizing optical flow to extract motion information to facilitate object detection. However, such pre-computed optical flow is neither efficient nor task related.…”

Section: Object Detection In Videosmentioning

confidence: 99%

Sequence Level Semantics Aggregation for Video Object Detection

Chen

Wang³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

212

185

View full text Add to dashboard Cite

show abstract

“…Linking single frame detections across the temporal dimension as done by T-CNN [13] constitutes possibly the simplest form of temporal domain exploration. T-CNN essentially runs region-based detectors per frame and enforces motion-based propagation to adjacent frames.…”

Section: Video Object Detectionmentioning

confidence: 99%

Great Ape Detection in Challenging Jungle Camera Trap Footage via Attention-Based Spatial and Temporal Feature Blending

Yang

Mirmehdi

Burghardt

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

We propose the first multi-frame video object detection framework trained to detect great apes. It is applicable to challenging camera trap footage in complex jungle environments and extends a traditional feature pyramid architecture by adding self-attention driven feature blending in both the spatial as well as the temporal domain. We demonstrate that this extension can detect distinctive species appearance and motion signatures despite significant partial occlusion. We evaluate the framework using 500 camera trap videos of great apes from the Pan African Programme containing 180K frames, which we manually annotated with accurate per-frame animal bounding boxes. These clips contain significant partial occlusions, challenging lighting, dynamic backgrounds, and natural camouflage effects. We show that our approach performs highly robustly and significantly outperforms frame-based detectors. We also perform detailed ablation studies and a validation on the full ILSVRC 2015 VID data corpus to demonstrate wider applicability at adequate performance levels. We conclude that the framework is ready to assist human camera trap inspection efforts. We publish key parts of the code as well as network weights and ground truth annotations with this paper.

show abstract

“…Therefore, the performance of object detection will affect almost all other computer vision research. A huge amount of effort has been put into its improvements [30,31,32].…”

Section: Introductionmentioning

confidence: 99%