2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.00084
|View full text |Cite
|
Sign up to set email alerts
|

Transformer Interpretability Beyond Attention Visualization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
211
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 388 publications
(261 citation statements)
references
References 22 publications
2
211
1
Order By: Relevance
“…Finally, the underlying deep models are black-boxes, making it hard to understand why specific images are identified and described as inappropriate. Combining Q16 with explainable AI methods such as [6] to explain the reasons is likely to improve the datasheet.…”
Section: Discussionmentioning
confidence: 99%
“…Finally, the underlying deep models are black-boxes, making it hard to understand why specific images are identified and described as inappropriate. Combining Q16 with explainable AI methods such as [6] to explain the reasons is likely to improve the datasheet.…”
Section: Discussionmentioning
confidence: 99%
“…While this does not provide a deep understanding on the kind of relationships the model has learned [205], it yields some insights as to what it deems important for specific samples [206]. Few works have tried to interpret Transformers further than this for vision [207], and so far within the literature of VTs we only find a limited subset of works that visualize these attention activations for specific samples [66], [69], [99], [130]. The work of [14] consistently finds 6 different patterns in their cross-modal attention, showing, for instance, how some heads learn local, modality specific or cross-modal attention.…”
Section: The Road Aheadmentioning
confidence: 99%
“…Compared with dense attention, the k-NN attention filters out most irrelevant information from background regions which are similar to the foreground, and successfully concentrates on the most informative foreground regions. Images from different classes are visualized in Figure 4 using Transformer Attribution method [5] on DeiT-Tiny. It can be seen that the k-NN attention is more concentrated and accurate, especially in the situations of cluttered background and occlusion.…”
Section: Visualizationmentioning
confidence: 99%