SCOPS: Self-Supervised Co-Part Segmentation

Hung, Wei-Chih; Jampani, Varun; Liu, Sifei; Molchanov, Pavlo; Yang, Ming–Hsuan; Kautz, Jan

doi:10.1109/cvpr.2019.00096

Cited by 127 publications

(160 citation statements)

References 47 publications

Supporting

Mentioning

159

Contrasting

Order By: Relevance

“…Existing datasets [43,88] are relatively small in size, and only provide sparse correspondence ground truths since manually annotating dense ones is prohibitive. In light of this challenge, weakly supervised semantic correspondence are proposed to learn correspondence without correspondence ground truths [25][26][27][28][29][30]. In addition, existing benchmarks and methods have predominantly focused on "objectcentric" scenarios where each image is occupied by a major object.…”

Section: Finding Correspondencementioning

confidence: 99%

“…Even though the advantage of learning correspondences and instance segmentation jointly is clear, many state of the art methods do not make use of this approach due to the lack of large scale datasets with both masks and correspondences. To overcome this challenge, weakly supervised methods have been recently introduced to relax the need for costly supervision in both tasks [25][26][27][28][29][30][46][47][48][49]. Our work is aligned with these efforts as we aim to address instance segmentation and semantic correspondence jointly with inexpensive bounding box supervision.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision

Lan

Yu²,

Choy³

et al. 2021

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

We introduce DiscoBox, a novel framework that jointly learns instance segmentation and semantic correspondence using bounding box supervision. Specifically, we propose a self-ensembling framework where instance segmentation and semantic correspondence are jointly guided by a structured teacher in addition to the bounding box supervision. The teacher is a structured energy model incorporating a pairwise potential and a cross-image potential to model the pairwise pixel relationships both within and across the boxes. Minimizing the teacher energy simultaneously yields refined object masks and dense correspondences between intra-class objects, which are taken as pseudo-labels to supervise the task network and provide positive/negative correspondence pairs for dense constrastive learning. We show a symbiotic relationship where the two tasks mutually benefit from each other. Our best model achieves 37.9% AP on COCO instance segmentation, surpassing prior weakly supervised methods and is competitive to supervised methods. We also obtain state of the art weakly supervised results on PASCAL VOC12 and PF-PASCAL with real-time inference.

show abstract

Section: Finding Correspondencementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision

Lan

Yu²,

Choy³

et al. 2021

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

show abstract

“…Mask Concentration Loss: In order to promote compactness for object masks, we use a geometric concentration loss as in [35]. The tuple (x, y) in Eq.…”

Section: Lossmentioning

confidence: 99%

Self-Supervision By Prediction For Object Discovery In Videos

Besbinar

Frossard

2021

2021 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

Despite their irresistible success, deep learning algorithms still heavily rely on annotated data, and unsupervised settings pose many challenges, such as finding the right inductive bias in diverse scenarios. In this paper, we propose an object-centric model for image sequence representation that uses the prediction task for self-supervision. By disentangling object representation and motion dynamics, our novel compositional structure explicitly handles occlusion and inpaints inferred objects and background for the composition of the predicted frame. Using auxiliary losses to promote spatially and temporally consistent object representations, we train our self-supervised framework without the help of any annotation or pretrained network. Initial experiments confirm that our new pipeline is a promising step towards object-centric video prediction.

show abstract

“…Learning discriminative image representation in an unsupervised/ self-supervised manner has attracted increasing interest (Agrawal, Carreira, and Malik 2015;Doersch, Gupta, and Efros 2015;Xie et al 2021), for it gets rid of the costly manually-labeled data and achieves promising performance on many down-stream tasks (Larsson et al 2019;Hung et al 2019;Doersch and Zisserman 2017). These methods generally design pretext tasks and learn the representation from the label generated by the tasks, such as rotation predicting (Komodakis and Gidaris 2018), jigsaw (Noroozi and Favaro 2016;Kim et al 2018), in-painting (Pathak et al 2016), colorization (Zhang, Isola, and Efros 2016;Larsson, Maire, and Shakhnarovich 2017) and clustering (Noroozi et al 2018;Caron et al 2018).…”

Section: Introductionmentioning

confidence: 99%

MixSiam: A Mixture-based Approach to Self-supervised Representation Learning

Guo¹,

Zhao²,

Lin³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently contrastive learning has shown significant progress in learning visual representations from unlabeled data. The core idea is training the backbone to be invariant to different augmentations of an instance. While most methods only maximize the feature similarity between two augmented data, we further generate more challenging training samples and force the model to keep predicting discriminative representation on these hard samples. In this paper, we propose MixSiam, a mixture-based approach upon the traditional siamese network. On the one hand, we input two augmented images of an instance to the backbone and obtain the discriminative representation by performing an element-wise maximum of two features. On the other hand, we take the mixture of these augmented images as input, and expect the model prediction to be close to the discriminative representation. In this way, the model could access more variant data samples of an instance and keep predicting invariant discriminative representations for them. Thus the learned model is more robust compared to previous contrastive learning methods. Extensive experiments on large-scale datasets show that MixSiam steadily improves the baseline and achieves competitive results with state-of-the-art methods. Our code will be released soon.

show abstract

SCOPS: Self-Supervised Co-Part Segmentation

Cited by 127 publications

References 47 publications

DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision

DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision

Self-Supervision By Prediction For Object Discovery In Videos

MixSiam: A Mixture-based Approach to Self-supervised Representation Learning

Contact Info

Product

Resources

About