TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization

Gao, Wei; Wan, Fang; Pan, Xingjia; Peng, Zhiliang; Tian, Qi; Han, Zhenjun; Zhou, Bolei; Ye, Qixiang

doi:10.1109/iccv48922.2021.00288

Cited by 120 publications

(76 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We expect to convert tokens into activation maps for each labelclass (i.e. aware of semantic meaning [16]). For strong-label datasets, we can let the model directly calculate the loss in specific time ranges.…”

Section: Token Semantic Modulementioning

confidence: 99%

See 1 more Smart Citation

HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection

Chen¹,

Du²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35\% model parameters and 15\% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.

show abstract

Section: Token Semantic Modulementioning

confidence: 99%

“…This inspires us to design a module that makes every output token of an audio transformer aware of the semantic meaning of events (i.e. a token-semantic module [16]) for supporting more audio tasks (e.g. sound event detection and localization).…”

Section: Introductionmentioning

confidence: 99%

HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection

Chen¹,

Du²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Similar to [8] Baseline models. To validate our F-CAM method, we compare with recent WSOL methods, including: CAM [57], HaS [34], ACoL [53], SPG [54], ADL [9], CutMix [51], CSTN [22], TS-CAM [13], MEIL [21], DANet [47], SPOL [44], ICL [17], NL-CCAM [49], I 2 C [55], RCAM [56], GC-Net [20], ADL-TAP [1], GradCAM [32], Grad-Cam++ [7], Smooth-GradCAM++ [25], XGradCAM [12], LayerCAM [15]. For CAM, HaS, ACoL, SPG, ADL, and CutMix, we present the results reported in [8].…”

Section: Implementation Detailsmentioning

confidence: 99%

F-CAM: Full Resolution Class Activation Maps via Guided Parametric Upscaling

Belharbi¹,

Sarraf²,

Pedersoli³

et al. 2021

Preprint

View full text Add to dashboard Cite

Class Activation Mapping (CAM) methods have recently gained much attention for weakly-supervised object localization (WSOL) tasks, allowing for CNN visualization and interpretation without training on fully annotated image datasets. CAM methods are typically integrated within off-the-shelf CNN backbones, such as ResNet50. Due to convolution and downsampling/pooling operations, these backbones yield low resolution CAMs with a down-scaling factor of up to 32, making accurate localization more difficult. Interpolation is required to restore a full size CAMs, but without considering the statistical properties of the objects, leading to activations with inconsistent boundaries and inaccurate localizations. As an alternative, we introduce a generic method for parametric upscaling of CAMs that allows constructing accurate full resolution CAMs (F-CAMs). In particular, we propose a trainable decoding architecture that can be connected to any CNN classifier to produce more accurate CAMs. Given an original (low resolution) CAM, foreground and background pixels are randomly sampled for fine-tuning the decoder. Additional priors such as image statistics, and size constraints are also considered to expand and refine object boundaries. Extensive experiments 1 using three CNN backbones and six WSOL baselines on the CUB-200-2011 and OpenImages datasets, indicate that our F-CAM method yields a significant improvement in CAM localization accuracy. F-CAM performance is competitive with state-of-art WSOL methods, yet it requires fewer computational resources during inference.

show abstract

“…2) Affordance Detection from Machine Learning Perspectives Weakly Supervised, Semi-Supervised, and Unsupervised Affordance Detection: Affor-dance detection based on supervised learning usually requires large-scale labeled data with pixellevel accurate annotations for training, which are labor-intensive to collect and annotate. Alternatively, weakly supervised, semi-supervised, and unsupervised learning methods are also worth further study for affordance detection (Gao et al 2021;Nagarajan et al 2019;Pan et al 2021;Wang et al 2021b).…”

Section: ) Multimodal Affordance Detectionmentioning

confidence: 99%

One-Shot Object Affordance Detection in the Wild

Zhai¹,

Luo²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Affordance detection refers to identifying the potential action possibilities of objects in an image, which is a crucial ability for robot perception and manipulation. To empower robots with this ability in unseen scenarios, we first study the challenging one-shot affordance detection problem in this paper, i.e., given a support image that depicts the action purpose, all objects in a scene with the common affordance should be detected. To this end, we devise a One-Shot Affordance Detection Network (OSAD-Net) that firstly estimates the human action purpose and then transfers it to help detect the common affordance from all candidate images. Through collaboration learning, OSAD-Net can capture the common characteristics between objects having the same underlying affordance and learn a good adaptation capability for perceiving unseen affordances. Besides, we build a large-scale Purpose-driven Affordance Dataset v2 (PADv2) by collecting and labeling 30k images from 39 affordance and 103 object categories. With complex scenes and rich annotations, our PADv2 dataset can be used as a test bed to benchmark affordance detection methods and may also facilitate downstream vision tasks, such as scene understanding, action recognition, and robot manipulation. Specifically, we conducted comprehensive experiments on PADv2 dataset by including 11 advanced models from several related research fields. Experimental results demonstrate the superiority of our model over previous representative ones in terms of both objec-

show abstract

TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization

Cited by 120 publications

References 52 publications

HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection

HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection

F-CAM: Full Resolution Class Activation Maps via Guided Parametric Upscaling

One-Shot Object Affordance Detection in the Wild

Contact Info

Product

Resources

About