Background Activation Suppression for Weakly Supervised Object Localization

Wu, Ping-Yu; Zhai, Wei; Cao, Yang

doi:10.48550/arxiv.2112.00580

Cited by 3 publications

(7 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…At first, our method may not achieve satisfactory results in complicated scenes containing many objects. To improve our method, we could adopt a locate-then-segment framework to locate objects [57] then generate the mask. Secondly, Our approach aims at detecting all possible objects in the image and cannot detect the one that best fits the intention.…”

Section: Conclusion and Discussionmentioning

confidence: 99%

Phrase-Based Affordance Detection via Cyclic Bilateral Interaction

Lu¹,

Zhai²,

Luo³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Lift dumbbell PA: lift, lift up, raise, grab, put down, pick up, take down, push, hold up, uplift, cause to raise, hold high, F: exercise, used for exercise of muscle-building E: indoor exercise Pick up chopsticks PA: take and lift upward, hold, grasp, move up and down, hold and lift F: pass food, kitchen utensil E: usually appears in kitchen or dining table AF: usually are made of wood Rolling baseball, croquet ball, golf ball, table tennis ball, tennis ball PA: rolling, move, can roll, move by rotating, roll over, rotate rapidly, turn round and round, rotate, move fast, spin, whirl, move around an axis or a center, cycle, revolve, change orientation or direction, twirl revolve AF: spherical Mix chopsticks, spoon, whisk

show abstract

Section: Conclusion and Discussionmentioning

confidence: 99%

Phrase-Based Affordance Detection via Cyclic Bilateral Interaction

Lu¹,

Zhai²,

Luo³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Metrics. Following [29,19,43], for localization, we utilize GT-known localization accuracy (GT-known Loc), Top-1/Top5 localization accuracy (Top-1/Top-5 Loc), and maximal box accuracy (MaxBoxAccV2) [5] as evaluation metrics. GT-known Loc is correct indicating that the intersection over union (IoU) of the predicted bounding box and the ground-truth bounding box is 50% or more.…”

Section: Methodsmentioning

confidence: 99%

“…In the training phase, the input images are resized to 256×256 and then randomly cropped to 224×224. In the inference phase, following [29,9,36], we adopt ten crop augmentations to obtain classification results and replace random crop with center crop for localization.…”

Section: Methodsmentioning

confidence: 99%

“…I2C [40] and ISIC [28] consider feature similarities across different objects to achieve more complete and robust localization. ORNet [30], FAM [18] and BAS [29] propose to generate a foreground prediction map (FPM) to implement localization. Unlike FPM-based methods that require complex and heavy structures, and the learning of generator relies on a specific feature layer of the classification network.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Spatial-Aware Token for Weakly Supervised Object Localization

Wu¹,

Zhai²,

Cao³

et al. 2023

Preprint

View full text Add to dashboard Cite

Weakly supervised object localization (WSOL) is a challenging task aiming to localize objects with only imagelevel supervision. Recent works apply visual transformer to WSOL and achieve significant success by exploiting the long-range feature dependency in self-attention mechanism. However, existing transformer-based methods synthesize the classification feature maps as the localization map, which leads to optimization conflicts between classification and localization tasks. To address this problem, we propose to learn a task-specific spatial-aware token (SAT) to condition localization in a weakly supervised manner. Specifically, a spatial token is first introduced in the input space to aggregate representations for localization task. Then a spatial aware attention module is constructed, which allows spatial token to generate foreground probabilities of different patches by querying and to extract localization knowledge from the classification task. Besides, for the problem of sparse and unbalanced pixel-level supervision obtained from the image-level label, two spatial constraints, including batch area loss and normalization loss, are designed to compensate and enhance this supervision. Experiments show that the proposed SAT achieves state-of-the-art performance on both CUB-200 and ImageNet, with 98.45% and 73.13% GT-known Loc, respectively. Even under the extreme setting of using only 1 image per class from Ima-geNet for training, SAT already exceeds the SOTA method by 2.1% GT-known Loc. Code and models are available at https://github.com/wpy1999/SAT.

show abstract

“…Weakly Supervised Localization (WSOL) Class Activation Map (CAM) explainability methods have been offered in recent years for solving WSOL tasks [87,88,56,43]. Most of these algorithms train a classifier to distinguish between sub-categories of the main object (Birds, Cars, Dog etc), employing a localization loss term for the explainability map [76,78,48,28,52].…”

Section: Related Workmentioning

confidence: 99%

What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

Shaharabany¹,

Tewel²,

Wolf³

2022

Preprint

View full text Add to dashboard Cite

Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects. This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism. Moreover, training takes place in a weakly supervised setting, where no bounding boxes are provided. To achieve this, our method combines two pre-trained networks: the CLIP image-to-text matching score and the BLIP image captioning tool. Training takes place on COCO images and their captions and is based on CLIP. Then, during inference, BLIP is used to generate a hypothesis regarding various regions of the current image. Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains. It also shows very convincing results in the novel task of weakly-supervised open-world purely visual phrase-grounding presented in our work. For example, on the datasets used for benchmarking phrasegrounding, our method results in a very modest degradation in comparison to methods that employ human captions as an additional input. Our code is available at https://github.com/talshaharabany/what-is-where-by-looking and a live demo can be found at https://replicate.com/talshaharabany/ what-is-where-by-looking.

show abstract

Background Activation Suppression for Weakly Supervised Object Localization

Cited by 3 publications

References 29 publications

Phrase-Based Affordance Detection via Cyclic Bilateral Interaction

Phrase-Based Affordance Detection via Cyclic Bilateral Interaction

Spatial-Aware Token for Weakly Supervised Object Localization

What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

Contact Info

Product

Resources

About