Railroad is not a Train: Saliency as Pseudo-pixel Supervision for Weakly Supervised Semantic Segmentation

Lee, Seungho; Lee, Minhyun; Lee, Jongwuk; Shim, Hyunjung

doi:10.1109/cvpr46437.2021.00545

Cited by 162 publications

(109 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, Ours-L achieves 69.2% and 70.6% mIoU on the PASCAL VOC val set with DeepLabV2 initialized with ImageNet and MS COCO pre-trained weights, respectively, which recover 90.7% and 91.0% of the upper bound of their fully-supervised counterparts. Our methods also achieve comparable performance with recent state-of-the-art WSSS methods us-ing extra saliency maps, such as NSROM (Yao et al, 2021), DRS (Kim et al, 2021), EPS (Lee et al, 2021c), AuxSegNet (Xu et al, 2021), and EDAM (Wu et al, 2021). Our method also outperforms recent methods with superior backbone networks, such as PMM (Li et al, 2021b), which uses Res2Net101 (Gao et al, 2021) as the backbone for semantic segmentation.…”

Section: Imagementioning

confidence: 57%

See 1 more Smart Citation

Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling

Ru¹,

Du²,

Zhan³

et al. 2022

Preprint

View full text Add to dashboard Cite

Weakly-Supervised Semantic Segmentation (WSSS) methods with image-level labels generally train a classification network to generate the Class Activation Maps (CAMs) as the initial coarse segmentation labels. However, current WSSS methods still perform far from satisfactorily because their adopted CAMs 1) typically focus on partial discriminative object regions and 2) usually contain useless background regions. These two problems are attributed to the sole image-level supervision and aggregation of global information when training the classification networks. In this work, we propose the visual words learning module and hybrid pooling approach, and incorporate them in the classification network to mitigate the above problems. In the visual words learning module, we counter the first problem by enforcing the classification network to learn fine-grained visual word labels so that more object extents could be discovered. Specifically, the visual words are learned with a codebook, which could be updated via two proposed strategies, i.e. learning-based strategy and memory-bank strategy. The second drawback of CAMs is alleviated with the proposed hybrid pooling, which incorporates the global average and local discriminative information to simultaneously ensure object completeness and reduce background regions. We

show abstract

Section: Imagementioning

confidence: 57%

“…MS COCO 2014 dataset (Lin et al, 2014) is a largescale dataset with 81 semantic categories, including the background class. After excluding the images without annotations (Lee et al, 2021c), the MS COCO dataset consists of 82,081 and 40,137 images in train and val set, respectively. Classification Network.…”

Section: Implementation Detailsmentioning

confidence: 99%

Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling

Ru¹,

Du²,

Zhan³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Image-level labels are probably the most popular form of weak supervision, due to their simplicity and the possibility of obtaining them from public datasets or web data. A typical WSSS pipeline begins with generating a pseudo mask, followed by training a new semantic segmentation network [24]. Interpretability techniques such as CAM [38] are often used to infer incomplete pixel-level annotations automatically.…”

Section: Related Workmentioning

confidence: 99%

Semantic Segmentation In-the-Wild Without Seeing Any Segmentation Examples

Zabari¹,

Hoshen²

2021

Preprint

View full text Add to dashboard Cite

Semantic segmentation is a key computer vision task that has been actively researched for decades. In recent years, supervised methods have reached unprecedented accuracy, however they require many pixel-level annotations for every new class category which is very time-consuming and expensive. Additionally, the ability of current semantic segmentation networks to handle a large number of categories is limited. That means that images containing rare class categories are unlikely to be well segmented by current methods. In this paper we propose a novel approach for creating semantic segmentation masks for every object, without the need for training segmentation networks or seeing any segmentation masks. Our method takes as input the image-level labels of the class categories present in the image; they can be obtained automatically or manually. We utilize a vision-language embedding model (specifically CLIP) to create a rough segmentation map for each class, using model interpretability methods. We refine the maps using a test-time augmentation technique. The output of this stage provides pixel-level pseudo-labels, instead of the manual pixel-level labels required by supervised methods. Given the pseudo-labels, we utilize single-image segmentation techniques to obtain high-quality output segmentation masks. Our method is shown quantitatively and qualitatively to outperform methods that use a similar amount of supervision. Our results are particularly remarkable for images containing rare categories.

show abstract

“…As an additional guidance for network to pay attention to the entire region of objects, some existing works attempt to devise auxiliary tasks such as sub-category classification [3], self-equivariant regularization with scale variance minimization [49], class-wise co-attention extraction [33,42], anti-adversarial attack [28], and complementary patch loss [60]. Many WSSS methods [15,16,21,30,33,42,55,56] have been proposed to employ the pre-trained saliency detection module, which distinguishes dominant foreground object from its background, as a complementary source of information for enhancing CAMs and generating precise pseudo-pixel labels.…”

Section: Related Workmentioning

confidence: 99%

Exploring Pixel-level Self-supervision for Weakly Supervised Semantic Segmentation

Yoon¹,

Kweon²,

Jeong³

et al. 2021

Preprint

View full text Add to dashboard Cite

Existing studies in weakly supervised semantic segmentation (WSSS) have utilized class activation maps (CAMs) to localize the class objects. However, since a classification loss is insufficient for providing precise object regions, CAMs tend to be biased towards discriminative patterns (i.e., sparseness) and do not provide precise object boundary information (i.e., impreciseness). To resolve these limitations, we propose a novel framework (composed of MainNet and SupportNet.) that derives pixel-level selfsupervision from given image-level supervision. In our framework, with the help of the proposed Regional Contrastive Module (RCM) and Multi-scale Attentive Module (MAM), MainNet is trained by self-supervision from the SupportNet. The RCM extracts two forms of selfsupervision from SupportNet: (1) class region masks generated from the CAMs and (2) class-wise prototypes obtained from the features according to the class region masks. Then, every pixel-wise feature of the MainNet is trained by the prototype in a contrastive manner, sharpening the resulting CAMs. The MAM utilizes CAMs inferred at multiple scales from the SupportNet as self-supervision to guide the MainNet. Based on the dissimilarity between the multiscale CAMs from MainNet and SupportNet, CAMs from the MainNet are trained to expand to the less-discriminative regions. The proposed method shows state-of-the-art WSSS performance both on the train and validation sets on the PASCAL VOC 2012 dataset. For reproducibility, code will be available publicly soon.

show abstract

Railroad is not a Train: Saliency as Pseudo-pixel Supervision for Weakly Supervised Semantic Segmentation

Cited by 162 publications

References 39 publications

Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling

Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling

Semantic Segmentation In-the-Wild Without Seeing Any Segmentation Examples

Exploring Pixel-level Self-supervision for Weakly Supervised Semantic Segmentation

Contact Info

Product

Resources

About