MoNet: Deep Motion Exploitation for Video Object Segmentation

Xiao, Huaxin; Feng, Jiashi; Lin, Guosheng; Liu, Yu; Zhang, Maojun

doi:10.1109/cvpr.2018.00125

Cited by 130 publications

(62 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Location-sensitive embeddings used to refine an initial foreground prediction are explored in LSE [9]. MoNet [38] exploits optical flow motion cues by feature alignment and a distance transform layer. Using reinforcement learning to estimate a region of interest to be segmented is explored by Han et al [13].…”

Section: Related Workmentioning

confidence: 99%

FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation

Voigtlaender

Chai

Schroff

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

419

329

View full text Add to dashboard Cite

Many of the recent successful methods for video object segmentation (VOS) are overly complicated, heavily rely on fine-tuning on the first frame, and/or are slow, and are hence of limited practical use. In this work, we propose FEELVOS as a simple and fast method which does not rely on fine-tuning. In order to segment a video, for each frame FEELVOS uses a semantic pixel-wise embedding together with a global and a local matching mechanism to transfer information from the first frame and from the previous frame of the video to the current frame. In contrast to previous work, our embedding is only used as an internal guidance of a convolutional network. Our novel dynamic segmentation head allows us to train the network, including the embedding, end-to-end for the multiple object segmentation task with a cross entropy loss. We achieve a new state of the art in video object segmentation without fine-tuning with a J &F measure of 71.5% on the DAVIS 2017 validation set. We make our code and models available at https://github.com/tensorflow/ models/tree/master/research/feelvos. * Work done during an internship at Google Inc.† Now at Waymo LLC. Simple Fast End-to-end Strong PML [6] OSMN [40] FAVOS [7] VideoMatch [17] RGMP [37] FEELVOS (ours) PReMVOS [26] OnAVOS [35]

show abstract

Section: Related Workmentioning

confidence: 99%

FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation

Voigtlaender

Chai

Schroff

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

419

329

View full text Add to dashboard Cite

show abstract

“…For SVOS methods, the target object(s) is provided in the first frame and tracked automatically [60,8,5,68,2,69,64,71] or interactively by users [1] in the subsequent frames. Numerous algorithms were proposed based on graphical models [54], object proposals [46], supertrajectories [61], etc.…”

Section: Video Object Segmentationmentioning

confidence: 99%

See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks

Yang

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

452

226

View full text Add to dashboard Cite

We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments. The co-attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space. We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better. We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin.

show abstract

“…Wang et al [33] proposed a global Gaussian distribution embedding network (G 2 DeNet), where one multivariate Gaussian, identified as a symmetric positive definite matrix of covariance matrix and mean vector [20], is plugged at network end. MoNet [38] proposed a sub-matrix square-root layer, making G 2 DeNet to have compact representation. In [3], the first-order information are combined with the second-order one which achieves consistent improvements over the standard bilinear networks on texture recognition.…”

Section: Related Workmentioning

confidence: 99%

Global Second-Order Pooling Convolutional Networks

Gao

Xie

Wang

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

366

149

View full text Add to dashboard Cite

Deep Convolutional Networks (ConvNets) are fundamental to, besides large-scale visual recognition, a lot of vision tasks. As the primary goal of the ConvNets is to characterize complex boundaries of thousands of classes in a high-dimensional space, it is critical to learn higherorder representations for enhancing non-linear modeling capability. Recently, Global Second-order Pooling (GSoP), plugged at the end of networks, has attracted increasing attentions, achieving much better performance than classical, first-order networks in a variety of vision tasks. However, how to effectively introduce higher-order representation in earlier layers for improving non-linear capability of Con-vNets is still an open problem. In this paper, we propose a novel network model introducing GSoP across from lower to higher layers for exploiting holistic image information throughout a network. Given an input 3D tensor outputted by some previous convolutional layer, we perform GSoP to obtain a covariance matrix which, after nonlinear transformation, is used for tensor scaling along channel dimension. Similarly, we can perform GSoP along spatial dimension for tensor scaling as well. In this way, we can make full use of the second-order statistics of the holistic image throughout a network. The proposed networks are thoroughly evaluated on large-scale ImageNet-1K, and experiments have shown that they outperformed non-trivially the counterparts while achieving state-of-the-art results.

show abstract

MoNet: Deep Motion Exploitation for Video Object Segmentation

Cited by 130 publications

References 33 publications

FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation

FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation

See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks

Global Second-Order Pooling Convolutional Networks

Contact Info

Product

Resources

About