Sequence Level Semantics Aggregation for Video Object Detection

Wu, Haiping; Chen, Yuntao; Wang, Naiyan; Zhang, Zhaoxiang

doi:10.1109/iccv.2019.00931

Cited by 193 publications

(196 citation statements)

References 35 publications

Supporting

Mentioning

185

Contrasting

Order By: Relevance

“…Notice also that the best methods detected all four players in all or nearly all frames, without requiring video-based object detection techniques [ 42 , 43 , 44 ] which exploit temporal coherence across consecutive frames. We did not apply any temporal filtering to the data, as this would partially hide the actual accuracy of the methods being compared.…”

Section: Discussionmentioning

confidence: 99%

Estimating Player Positions from Padel High-Angle Videos: Accuracy Comparison of Recent Computer Vision Methods

Javadiha

Andújar

Lacasa

et al. 2021

Sensors

View full text Add to dashboard Cite

The estimation of player positions is key for performance analysis in sport. In this paper, we focus on image-based, single-angle, player position estimation in padel. Unlike tennis, the primary camera view in professional padel videos follows a de facto standard, consisting of a high-angle shot at about 7.6 m above the court floor. This camera angle reduces the occlusion impact of the mesh that stands over the glass walls, and offers a convenient view for judging the depth of the ball and the player positions and poses. We evaluate and compare the accuracy of state-of-the-art computer vision methods on a large set of images from both amateur videos and publicly available videos from the major international padel circuit. The methods we analyze include object detection, image segmentation and pose estimation techniques, all of them based on deep convolutional neural networks. We report accuracy and average precision with respect to manually-annotated video frames. The best results are obtained by top-down pose estimation methods, which offer a detection rate of 99.8% and a RMSE below 5 and 12 cm for horizontal/vertical court-space coordinates (deviations from predicted and ground-truth player positions). These results demonstrate the suitability of pose estimation methods based on deep convolutional neural networks for estimating player positions from single-angle padel videos. Immediate applications of this work include the player and team analysis of the large collection of publicly available videos from international circuits, as well as an inexpensive method to get player positional data in amateur padel clubs.

show abstract

Section: Discussionmentioning

confidence: 99%

Estimating Player Positions from Padel High-Angle Videos: Accuracy Comparison of Recent Computer Vision Methods

Javadiha

Andújar

Lacasa

et al. 2021

Sensors

View full text Add to dashboard Cite

show abstract

“…Causal? Backbone mAP(%) mAP gain(%) T-CNN [13] No GoogLeNet + VGG + Fast-RCNN 73.8 6.1 MANet [14] No ResNet101 + R-FCN 78.1 4.5 FGFA [16] No ResNet101 + R-FCN 78.4 5.0 Scale-time lattice [20] No ResNet101+ Faster R-CNN 79.6 N/A Object linking [30] No ResNet101+ Fast R-CNN 74.5 5.4 Seq-NMS [19] No VGG + Faster R-CNN 52.2 7.3 STMN [18] No ResNet101 + R-FCN 80.5 N/A STSN [21] No ResNet101 + R-FCN 78.9 2.9 RDN [41] No ResNet101 + Faster R-CNN 81.8 6.4 SELSA [42] No ResNet101 + Faster R-CNN 80.3 6.7 D&T [15] No mance despite the fact that a less powerful detection network is used. Since our method focuses on causal video object detection where no future frames are allowed, no video-level post-processing is applied.…”

Section: Methodsmentioning

confidence: 99%

“…In [41], objects' interactions are captured in spatio-temporal domain. Full-sequence level feature aggregation is proposed in [42] to generate robust features for video object detection. External memory is used in [44] to store informative temporal features.…”

Section: B Video Object Detectionmentioning

confidence: 99%

Video Object Detection With Two-Path Convolutional LSTM Pyramid

Zhang

Kim

2020

IEEE Access

View full text Add to dashboard Cite

One of the major challenges in video object detection is drastic scale changes of objects due to camera motion. In this paper, we propose a two-path Convolutional Long Short-Term Memory (convLSTM) pyramid network designed to extract and convey multi-scale temporal contextual information in order to handle object scale changes efficiently. The proposed two-path convLSTM pyramid consists of a stack of multi-input convLSTM modules. It is updated in top-down and bottom-up pathways so that the temporal contextual information for small-to-large and large-to-small scale changes is exploited. The proposed multi-input convLSTM module uses two input feature maps of different resolutions to store and exchange temporal contextual information of different scales between neighboring convLSTM modules. The outputs of the proposed convLSTM pyramid network constitute a feature pyramid where each feature map contains multi-scale temporal contextual information from earlier frames. The proposed convLSTM pyramid can be combined with various still-image object detectors to improve the performance of video object detection. Experimental results on ImageNet VID dataset show that the proposed method achieves state-of-the-art performance and can handle scale changes efficiently in video object detection.

show abstract

“…STMN [22] adopts spatiotemporal memory module with spatial alignment mechanism to model long-term temporal appearance and motion dynamics. Besides, RDN [46] and SELSA [47] strengthen region-level features by exploiting the relation/affinity between region proposals across frames…”

Section: B Object Detection In Videosmentioning

confidence: 99%

Single Shot Video Object Detector

Deng

Pan

Yao

et al. 2021

IEEE Trans. Multimedia

View full text Add to dashboard Cite

Single shot detectors that are potentially faster and simpler than two-stage detectors tend to be more applicable to object detection in videos. Nevertheless, the extension of such object detectors from image to video is not trivial especially when appearance deterioration exists in videos, e.g., motion blur or occlusion. A valid question is how to explore temporal coherence across frames for boosting detection. In this paper, we propose to address the problem by enhancing per-frame features through aggregation of neighboring frames. Specifically, we present Single Shot Video Object Detector (SSVD)-a new architecture that novelly integrates feature aggregation into a one-stage detector for object detection in videos. Technically, SSVD takes Feature Pyramid Network (FPN) as backbone network to produce multiscale features. Unlike the existing feature aggregation methods, SSVD, on one hand, estimates the motion and aggregates the nearby features along the motion path, and on the other, hallucinates features by directly sampling features from the adjacent frames in a two-stream structure. Extensive experiments are conducted on ImageNet VID dataset, and competitive results are reported when comparing to state-of-the-art approaches. More remarkably, for 448 × 448 input, SSVD achieves 79.2% mAP on ImageNet VID, by processing one frame in 85 ms on an Nvidia Titan X Pascal GPU. The code is available at https://github.com/ddjiajun/SSVD.

show abstract

Sequence Level Semantics Aggregation for Video Object Detection

Cited by 193 publications

References 35 publications

Estimating Player Positions from Padel High-Angle Videos: Accuracy Comparison of Recent Computer Vision Methods

Estimating Player Positions from Padel High-Angle Videos: Accuracy Comparison of Recent Computer Vision Methods

Video Object Detection With Two-Path Convolutional LSTM Pyramid

Single Shot Video Object Detector

Contact Info

Product

Resources

About