Embedded Discriminative Attention Mechanism for Weakly Supervised Semantic Segmentation

Wu, Tong; Huang, Junshi; Gao, Guangyu; Wei, Xiaoming; Wei, Xiaolin; Luo, Xuan; Liu, Chi Harold

doi:10.1109/cvpr46437.2021.01649

Cited by 103 publications

(60 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, Ours-L achieves 69.2% and 70.6% mIoU on the PASCAL VOC val set with DeepLabV2 initialized with ImageNet and MS COCO pre-trained weights, respectively, which recover 90.7% and 91.0% of the upper bound of their fully-supervised counterparts. Our methods also achieve comparable performance with recent state-of-the-art WSSS methods us-ing extra saliency maps, such as NSROM (Yao et al, 2021), DRS (Kim et al, 2021), EPS (Lee et al, 2021c), AuxSegNet (Xu et al, 2021), and EDAM (Wu et al, 2021). Our method also outperforms recent methods with superior backbone networks, such as PMM (Li et al, 2021b), which uses Res2Net101 (Gao et al, 2021) as the backbone for semantic segmentation.…”

Section: Imagementioning

confidence: 57%

Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling

Ru¹,

Du²,

Zhan³

et al. 2022

Preprint

View full text Add to dashboard Cite

Weakly-Supervised Semantic Segmentation (WSSS) methods with image-level labels generally train a classification network to generate the Class Activation Maps (CAMs) as the initial coarse segmentation labels. However, current WSSS methods still perform far from satisfactorily because their adopted CAMs 1) typically focus on partial discriminative object regions and 2) usually contain useless background regions. These two problems are attributed to the sole image-level supervision and aggregation of global information when training the classification networks. In this work, we propose the visual words learning module and hybrid pooling approach, and incorporate them in the classification network to mitigate the above problems. In the visual words learning module, we counter the first problem by enforcing the classification network to learn fine-grained visual word labels so that more object extents could be discovered. Specifically, the visual words are learned with a codebook, which could be updated via two proposed strategies, i.e. learning-based strategy and memory-bank strategy. The second drawback of CAMs is alleviated with the proposed hybrid pooling, which incorporates the global average and local discriminative information to simultaneously ensure object completeness and reduce background regions. We

show abstract

Section: Imagementioning

confidence: 57%

Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling

Ru¹,

Du²,

Zhan³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To make a fair comparison, we follow SEAM [31], PuzzleCAM [13], and AdvCAM [18] to adopt PSA [2] for initial CAM refinement. [28] V1 ‡ V16 66.2 66.9 LIID TPAMI'21 [21] V2 R50 66.5 67.5 NSROM CVPR'21 [35] V2 ‡ V16 68.3 68.5 DRS AAAI'21 [14] V2 ‡ V16 70.4 70.7 EPS CVPR'21 [19] V2 ‡ WR38 70.9 70.8 EDAM CVPR'21 [32] V2 ‡ WR38 70.9 70.6 AuxSegNet ICCV'21 [34] WR38 -69.0 68.6…”

Section: Methodsmentioning

confidence: 99%

“…This semantic affinity is then applied to refine the generated initial CAMs as pseudo ground-truth masks. Previous works [12,19,21,32] instead use additional saliency maps from a fully supervised saliency detector to refine the generated initial CAMs. The series of DeepLab [5,6] models are typically used to train a semantic segmentation network with the pseudo ground-truth masks.…”

Section: Related Workmentioning

confidence: 99%

Cross Language Image Matching for Weakly Supervised Semantic Segmentation

Xie¹,

Hou²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

It has been widely known that CAM (Class Activation Map) usually only activates discriminative object regions and falsely includes lots of object-related backgrounds. As only a fixed set of image-level object labels are available to the WSSS (weakly supervised semantic segmentation) model, it could be very difficult to suppress those diverse background regions consisting of open set objects. In this paper, we propose a novel Cross Language Image Matching (CLIMS) framework, based on the recently introduced Contrastive Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress closely-related open background regions. In particular, we design object, background region and text label matching losses to guide the model to excite more reasonable object regions for CAM of each category.In addition, we design a co-occurring background suppression loss to prevent the model from activating closelyrelated background regions, with a predefined set of classrelated background text descriptions. These designs enable the proposed CLIMS to generate a more complete and compact activation map for the target objects. Extensive experiments on PASCAL VOC2012 dataset show that our CLIMS significantly outperforms the previous state-of-the-art methods. Code will be available at https://github.com/CVI-SZU/CLIMS.

show abstract

“…Several studies exploit region masks for pseudo-labels for semantic segmentation [47,52,59]; however, the proposed method is advantageous in terms of the quality and the stability of the self-supervision. Instead of using fixed sources of regional information such as the off-the-shelf saliency module [47] or pre-trained classifier [36], we obtain region masks from the CAMs of SupportNet.…”

Section: Regional Contrastive Module (Rcm)mentioning

confidence: 99%

“…Moreover, we separate the network providing self-supervision for object localization (SupportNet) and the network learning from that guidance (MainNet), using EMA [18]. This enables more stable training than the methods of which backbone is updated by self-supervision from itself [52,59]. Therefore, the acquired self-supervision is not only continually revised as training proceeds but also stably delivered to the MainNet.…”

Section: Regional Contrastive Module (Rcm)mentioning

confidence: 99%

Exploring Pixel-level Self-supervision for Weakly Supervised Semantic Segmentation

Yoon¹,

Kweon²,

Jeong³

et al. 2021

Preprint

View full text Add to dashboard Cite

Existing studies in weakly supervised semantic segmentation (WSSS) have utilized class activation maps (CAMs) to localize the class objects. However, since a classification loss is insufficient for providing precise object regions, CAMs tend to be biased towards discriminative patterns (i.e., sparseness) and do not provide precise object boundary information (i.e., impreciseness). To resolve these limitations, we propose a novel framework (composed of MainNet and SupportNet.) that derives pixel-level selfsupervision from given image-level supervision. In our framework, with the help of the proposed Regional Contrastive Module (RCM) and Multi-scale Attentive Module (MAM), MainNet is trained by self-supervision from the SupportNet. The RCM extracts two forms of selfsupervision from SupportNet: (1) class region masks generated from the CAMs and (2) class-wise prototypes obtained from the features according to the class region masks. Then, every pixel-wise feature of the MainNet is trained by the prototype in a contrastive manner, sharpening the resulting CAMs. The MAM utilizes CAMs inferred at multiple scales from the SupportNet as self-supervision to guide the MainNet. Based on the dissimilarity between the multiscale CAMs from MainNet and SupportNet, CAMs from the MainNet are trained to expand to the less-discriminative regions. The proposed method shows state-of-the-art WSSS performance both on the train and validation sets on the PASCAL VOC 2012 dataset. For reproducibility, code will be available publicly soon.

show abstract

Embedded Discriminative Attention Mechanism for Weakly Supervised Semantic Segmentation

Cited by 103 publications

References 35 publications

Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling

Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling

Cross Language Image Matching for Weakly Supervised Semantic Segmentation

Exploring Pixel-level Self-supervision for Weakly Supervised Semantic Segmentation

Contact Info

Product

Resources

About