Relation Distillation Networks for Video Object Detection

Deng, Jiajun; Pan, Yingwei; Yao, Ting; Zhou, Wengang; Li, Houqiang; Mei, Tao

doi:10.1109/iccv.2019.00712

Cited by 198 publications

(154 citation statements)

References 46 publications

Supporting

Mentioning

152

Contrasting

Order By: Relevance

“…Causal? Backbone mAP(%) mAP gain(%) T-CNN [13] No GoogLeNet + VGG + Fast-RCNN 73.8 6.1 MANet [14] No ResNet101 + R-FCN 78.1 4.5 FGFA [16] No ResNet101 + R-FCN 78.4 5.0 Scale-time lattice [20] No ResNet101+ Faster R-CNN 79.6 N/A Object linking [30] No ResNet101+ Fast R-CNN 74.5 5.4 Seq-NMS [19] No VGG + Faster R-CNN 52.2 7.3 STMN [18] No ResNet101 + R-FCN 80.5 N/A STSN [21] No ResNet101 + R-FCN 78.9 2.9 RDN [41] No ResNet101 + Faster R-CNN 81.8 6.4 SELSA [42] No ResNet101 + Faster R-CNN 80.3 6.7 D&T [15] No mance despite the fact that a less powerful detection network is used. Since our method focuses on causal video object detection where no future frames are allowed, no video-level post-processing is applied.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Video Object Detection With Two-Path Convolutional LSTM Pyramid

Zhang

Kim

2020

IEEE Access

View full text Add to dashboard Cite

One of the major challenges in video object detection is drastic scale changes of objects due to camera motion. In this paper, we propose a two-path Convolutional Long Short-Term Memory (convLSTM) pyramid network designed to extract and convey multi-scale temporal contextual information in order to handle object scale changes efficiently. The proposed two-path convLSTM pyramid consists of a stack of multi-input convLSTM modules. It is updated in top-down and bottom-up pathways so that the temporal contextual information for small-to-large and large-to-small scale changes is exploited. The proposed multi-input convLSTM module uses two input feature maps of different resolutions to store and exchange temporal contextual information of different scales between neighboring convLSTM modules. The outputs of the proposed convLSTM pyramid network constitute a feature pyramid where each feature map contains multi-scale temporal contextual information from earlier frames. The proposed convLSTM pyramid can be combined with various still-image object detectors to improve the performance of video object detection. Experimental results on ImageNet VID dataset show that the proposed method achieves state-of-the-art performance and can handle scale changes efficiently in video object detection.

show abstract

Section: Methodsmentioning

confidence: 99%

“…Cuboid proposal network and tubelet linking algorithm are proposed in [30] to improve the performance of detecting moving objects in videos. In [41], objects' interactions are captured in spatio-temporal domain. Full-sequence level feature aggregation is proposed in [42] to generate robust features for video object detection.…”

Section: B Video Object Detectionmentioning

confidence: 99%

Video Object Detection With Two-Path Convolutional LSTM Pyramid

Zhang

Kim

2020

IEEE Access

View full text Add to dashboard Cite

show abstract

“…STMN [22] adopts spatiotemporal memory module with spatial alignment mechanism to model long-term temporal appearance and motion dynamics. Besides, RDN [46] and SELSA [47] strengthen region-level features by exploiting the relation/affinity between region proposals across frames…”

Section: B Object Detection In Videosmentioning

confidence: 99%

Single Shot Video Object Detector

Deng

Pan

Yao

et al. 2021

IEEE Trans. Multimedia

Self Cite

View full text Add to dashboard Cite

Single shot detectors that are potentially faster and simpler than two-stage detectors tend to be more applicable to object detection in videos. Nevertheless, the extension of such object detectors from image to video is not trivial especially when appearance deterioration exists in videos, e.g., motion blur or occlusion. A valid question is how to explore temporal coherence across frames for boosting detection. In this paper, we propose to address the problem by enhancing per-frame features through aggregation of neighboring frames. Specifically, we present Single Shot Video Object Detector (SSVD)-a new architecture that novelly integrates feature aggregation into a one-stage detector for object detection in videos. Technically, SSVD takes Feature Pyramid Network (FPN) as backbone network to produce multiscale features. Unlike the existing feature aggregation methods, SSVD, on one hand, estimates the motion and aggregates the nearby features along the motion path, and on the other, hallucinates features by directly sampling features from the adjacent frames in a two-stream structure. Extensive experiments are conducted on ImageNet VID dataset, and competitive results are reported when comparing to state-of-the-art approaches. More remarkably, for 448 × 448 input, SSVD achieves 79.2% mAP on ImageNet VID, by processing one frame in 85 ms on an Nvidia Titan X Pascal GPU. The code is available at https://github.com/ddjiajun/SSVD.

show abstract

“…Relational Reasoning. There has been strong evidences on the use of relational reasoning to support various tasks, e.g., object detection [11,14,15,16], feature learning [17], vision-language [18,19]. For example, [16] plugs non-local operation into the conventional CNN to enable the pixel-level relational interaction within feature maps, and [11] presents an object relation module to model the relations of regions via the interaction among appearance features and geometry.…”

Section: Related Workmentioning

confidence: 99%

Core-Text: Improving Scene Text Detection with Contrastive Relational Reasoning

Lin

Pan

Lai

et al. 2021

2021 IEEE International Conference on Multimedia and Expo (ICME)

Self Cite

View full text Add to dashboard Cite

Localizing text instances in natural scenes is regarded as a fundamental challenge in computer vision. Nevertheless, owing to the extremely varied aspect ratios and scales of text instances in real scenes, most conventional text detectors suffer from the sub-text problem that only localizes the fragments of text instance (i.e., sub-texts). In this work, we quantitatively analyze the sub-text problem and present a simple yet effective design, COntrastive RElation (CORE) module, to mitigate that issue. CORE first leverages a vanilla relation block to model the relations among all text proposals (subtexts of multiple text instances) and further enhances relational reasoning via instance-level sub-text discrimination in a contrastive manner. Such way naturally learns instanceaware representations of text proposals and thus facilitates scene text detection. We integrate the CORE module into a two-stage text detector of Mask R-CNN and devise our text detector CORE-Text. Extensive experiments on four benchmarks demonstrate the superiority of CORE-Text.

show abstract

Relation Distillation Networks for Video Object Detection

Cited by 198 publications

References 46 publications

Video Object Detection With Two-Path Convolutional LSTM Pyramid

Video Object Detection With Two-Path Convolutional LSTM Pyramid

Single Shot Video Object Detector

Core-Text: Improving Scene Text Detection with Contrastive Relational Reasoning

Contact Info

Product

Resources

About