2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00045
|View full text |Cite
|
Sign up to set email alerts
|

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

Abstract: Transformers are increasingly dominating multi-modal reasoning tasks, such as visual question answering, achieving state-of-the-art results thanks to their ability to contextualize information using the self-attention and coattention mechanisms. These attention modules also play a role in other computer vision tasks including object detection and image segmentation. Unlike Transformers that only use self-attention, Transformers with coattention require to consider multiple attention maps in parallel in order t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
39
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
5

Relationship

1
9

Authors

Journals

citations
Cited by 112 publications
(54 citation statements)
references
References 42 publications
1
39
0
Order By: Relevance
“…'Generic Attention Explainability' (GAE) by Chefer et al [2021a] propagates attention gradients together with gradients from other parts of the network, resulting in state-of-the art performance in explaining Transformer architectures.…”
Section: Benchmark Methodsmentioning
confidence: 99%
“…'Generic Attention Explainability' (GAE) by Chefer et al [2021a] propagates attention gradients together with gradients from other parts of the network, resulting in state-of-the art performance in explaining Transformer architectures.…”
Section: Benchmark Methodsmentioning
confidence: 99%
“…We use the pretrained vision transformer [22] ViT-B/32 model of CLIP, that performs global context modeling using self-attention between patches of a given image to capture meaningful features. We use the recent transformer interpretability method by Chefer et al [7] Input Attention Distribution Proposed Random to extract a relevancy map from the self-attention heads, without any text supervision.…”
Section: Strokes Initializationmentioning
confidence: 99%
“…Recently, a first attempt at explaining predictions by a VL transformer was proposed in [6]. There the authors constructed a relevancy map using the model's attention layers to track the interactions between modalities.…”
Section: Related Workmentioning
confidence: 99%