Learning Visual Words for Weakly-Supervised Semantic Segmentation

Ru, Lixiang; Du, Bo; Wu, Chen

doi:10.24963/ijcai.2021/136

Cited by 16 publications

(22 citation statements)

References 16 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CAMs +RW PSA CVPR'2018 [2] WR38 48.0 61.0 SC-CAM CVPR'2020 [4] WR38 50.9 63.4 SEAM CVPR'2020 [31] WR38 55.4 63.6 PuzzleCAM ICIP'2021 [13] R50 51.5 64.7 VWE IJCAI'2021 [26] R50 52.9 -AdvCAM CVPR'2021 [18] R50 55.6 68.0 CLIMS (Ours) R50 56.6 70.5…”

Section: Methodsmentioning

confidence: 99%

“…IAL IJCV'20 [30] V2 -64.3 65.4 SEAM CVPR'20 [31] V3 WR38 64.5 65.7 BES ECCV'20 [7] V2 R50 65.7 66.6 SC-CAM CVPR'20 [4] V2 ‡ WR38 66.1 65.9 CONTA NeurIPS'20 [37] V3 WR38 66.1 66.7 A 2 GNN TPAMI'21 [36] V2 WR38 66.8 67.4 VWE IJCAI'2021 [26] V2 ‡ R50 67.2 67.3 AdvCAM CVPR'21 [18] V2 R50 68.1 68.0 Kweon et al ICCV'21 [17] Segmentation Network. Given pseudo ground-truth masks, we follow VWE [26], SC-CAM [4] and Adv-CAM [18] to adopt DeepLabV2 with ResNet-101 [10] as the segmentation network. For experiments on PASCAL VOC2012 dataset, we follow the default setting of deeplabpytorch toolkit † to train DeepLabV2 with weights pretrained using MS COCO dataset.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Cross Language Image Matching for Weakly Supervised Semantic Segmentation

Xie¹,

Hou²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

It has been widely known that CAM (Class Activation Map) usually only activates discriminative object regions and falsely includes lots of object-related backgrounds. As only a fixed set of image-level object labels are available to the WSSS (weakly supervised semantic segmentation) model, it could be very difficult to suppress those diverse background regions consisting of open set objects. In this paper, we propose a novel Cross Language Image Matching (CLIMS) framework, based on the recently introduced Contrastive Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress closely-related open background regions. In particular, we design object, background region and text label matching losses to guide the model to excite more reasonable object regions for CAM of each category.In addition, we design a co-occurring background suppression loss to prevent the model from activating closelyrelated background regions, with a predefined set of classrelated background text descriptions. These designs enable the proposed CLIMS to generate a more complete and compact activation map for the target objects. Extensive experiments on PASCAL VOC2012 dataset show that our CLIMS significantly outperforms the previous state-of-the-art methods. Code will be available at https://github.com/CVI-SZU/CLIMS.

show abstract

Section: Methodsmentioning

confidence: 99%

mentioning

confidence: 99%

Cross Language Image Matching for Weakly Supervised Semantic Segmentation

Xie¹,

Hou²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…This paper is an improved version of our preliminary work (Ru et al, 2021). Compared with the conference version, this work further improves the learning-based strategy and proposes the memory-bank strategy which could learn visual words better.…”

Section: Image Ours Camsmentioning

confidence: 99%

“…In Fig. 6, we visualize the generated CAMs and compare them with the results of recent methods, including IRNet (Ahn et al, 2019), SEAM (Wang et al, 2020b), and VWE (our previous work with HP and simple visual words encoder) (Ru et al, 2021). The results of the learning-based strategy (Ours-L) and the memory-bank strategy (Ours-M) are both presented.…”

Section: Imagementioning

confidence: 99%

See 1 more Smart Citation

Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling

Ru¹,

Du²,

Zhan³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Weakly-Supervised Semantic Segmentation (WSSS) methods with image-level labels generally train a classification network to generate the Class Activation Maps (CAMs) as the initial coarse segmentation labels. However, current WSSS methods still perform far from satisfactorily because their adopted CAMs 1) typically focus on partial discriminative object regions and 2) usually contain useless background regions. These two problems are attributed to the sole image-level supervision and aggregation of global information when training the classification networks. In this work, we propose the visual words learning module and hybrid pooling approach, and incorporate them in the classification network to mitigate the above problems. In the visual words learning module, we counter the first problem by enforcing the classification network to learn fine-grained visual word labels so that more object extents could be discovered. Specifically, the visual words are learned with a codebook, which could be updated via two proposed strategies, i.e. learning-based strategy and memory-bank strategy. The second drawback of CAMs is alleviated with the proposed hybrid pooling, which incorporates the global average and local discriminative information to simultaneously ensure object completeness and reduce background regions. We

show abstract