Fast Online Object Tracking and Segmentation: A Unifying Approach

Wang, Qiang; Zhang, Li; Bertinetto, Luca; Hu, Weiming; Torr, Philip H. S.

doi:10.1109/cvpr.2019.00142

Cited by 1,280 publications

(929 citation statements)

References 65 publications

Supporting

Mentioning

840

Contrasting

Unclassified

Order By: Relevance

“…When aiming at very high segmentation accuracy, methods generally perform online fine-tuning on the basis of this supervision [3,25,35,40,43,50,62], sometimes exploiting data-augmentation techniques [3,25] or self-supervision [62]. As online fine-tuning can take up to several minutes per video, many recently proposed methods renounce to it and instead aim at a faster online speed (e.g., [7,8,64]). These faster semi-supervised approaches come in many flavours.…”

Section: Related Workmentioning

confidence: 99%

Anchor Diffusion for Unsupervised Video Object Segmentation

Zhao

Wang

Bertinetto³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

114

View full text Add to dashboard Cite

Unsupervised video object segmentation has often been tackled by methods based on recurrent neural networks and optical flow. Despite their complexity, these kinds of approaches tend to favour short-term temporal dependencies and are thus prone to accumulating inaccuracies, which cause drift over time. Moreover, simple (static) image segmentation models, alone, can perform competitively against these methods, which further suggests that the way temporal dependencies are modelled should be reconsidered. Motivated by these observations, in this paper we explore simple yet effective strategies to model long-term temporal dependencies. Inspired by the non-local operators of [70], we introduce a technique to establish dense correspondences between pixel embeddings of a reference "anchor" frame and the current one. This allows the learning of pairwise dependencies at arbitrarily long distances without conditioning on intermediate frames. Without online supervision, our approach can suppress the background and precisely segment the foreground object even in challenging scenarios, while maintaining consistent performance over time. With a mean IoU of 81.7%, our method ranks first on the DAVIS-2016 leaderboard of unsupervised methods, while still being competitive against state-of-the-art online semisupervised approaches. We further evaluate our method on the FBMS dataset and the ViSal video saliency dataset, showing results competitive with the state of the art.

show abstract

Section: Related Workmentioning

confidence: 99%

Anchor Diffusion for Unsupervised Video Object Segmentation

Zhao

Wang

Bertinetto³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

114

View full text Add to dashboard Cite

show abstract

“…The similarity-weighted combination of feature is used to predict the final mask. A fully convolutional Siamese network based approach (SiamMask) is proposed in [40]. It computes the depth-wise cross correlation between features of templates in the reference and the current frames.…”

Section: Related Workmentioning

confidence: 99%

“…We first train the class-agnostic binary mask proposal network on COCO. Following the strategy used in [41], we then finetune the proposal network on the combination of COCO and YouTube-VOS with learning rate 0.02, batch size 8 and number of training iteration 200, 000.…”

Section: Mask Proposal Generationmentioning

confidence: 99%

DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation

Zeng

Liao

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

In this paper, we propose the differentiable maskmatching network (DMM-Net) for solving the video object segmentation problem where the initial object masks are provided. Relying on the Mask R-CNN backbone, we extract mask proposals per frame and formulate the matching between object templates and proposals at one time step as a linear assignment problem where the cost matrix is predicted by a CNN. We propose a differentiable matching layer by unrolling a projected gradient descent algorithm in which the projection exploits the Dykstra's algorithm. We prove that under mild conditions, the matching is guaranteed to converge to the optimum. In practice, it performs similarly to the Hungarian algorithm during inference. Meanwhile, we can back-propagate through it to learn the cost matrix. After matching, a refinement head is leveraged to improve the quality of the matched mask. Our DMM-Net achieves competitive results on the largest video object segmentation dataset YouTube-VOS. On DAVIS 2017, DMM-Net achieves the best performance without online learning on the first frames. Without any fine-tuning, DMM-Net performs comparably to state-of-the-art methods on SegTrack v2 dataset. At last, our matching layer is very simple to implement; we attach the PyTorch code (< 50 lines) in the supplementary material. Our code is released at https://github.com/ZENGXH/DMM_Net.

show abstract

“…Moreover, a self-attention mechanism was integrated to force the network to capture the non-local features. SiamMask [42] used Siamese networks for object tracking using augmentation loss to produce a binary segmentation mask. In addition, the binary segmentation mask locates the object of interest accurately.…”

Section: Siamese-based Trackersmentioning

confidence: 99%

DomainSiam: Domain-Aware Siamese Network for Visual Object Tracking

Abdelpakey

Shehata

2019

Advances in Visual Computing

View full text Add to dashboard Cite

Visual object tracking is a fundamental task in the field of computer vision. Recently, Siamese trackers have achieved state-of-theart performance on recent benchmarks. However, Siamese trackers do not fully utilize semantic and objectness information from pre-trained networks that have been trained on the image classification task. Furthermore, the pre-trained Siamese architecture is sparsely activated by the category label which leads to unnecessary calculations and overfitting. In this paper, we propose to learn a Domain-Aware, that is fully utilizing semantic and objectness information while producing a class-agnostic using a ridge regression network. Moreover, to reduce the sparsity problem, we solve the ridge regression problem with a differentiable weighteddynamic loss function. Our tracker, dubbed DomainSiam, improves the feature learning in the training phase and generalization capability to other domains. Extensive experiments are performed on five tracking benchmarks including OTB2013 and OTB2015 for a validation set; as well as the VOT2017, VOT2018, LaSOT, TrackingNet, and GOT10k for a testing set. DomainSiam achieves a state-of-the-art performance on these benchmarks while running at 53 FPS.

show abstract

Fast Online Object Tracking and Segmentation: A Unifying Approach

Cited by 1,280 publications

References 65 publications

Anchor Diffusion for Unsupervised Video Object Segmentation

Anchor Diffusion for Unsupervised Video Object Segmentation

DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation

DomainSiam: Domain-Aware Siamese Network for Visual Object Tracking

Contact Info

Product

Resources

About