Transformer Interpretability Beyond Attention Visualization

Chefer, Hila; Gur, Shir; Wolf, Lior

doi:10.48550/arxiv.2012.09838

Cited by 15 publications

(41 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This, however, neglects the intermediate attention scores, as well as the other components of the Transformers. As noted by Chefer et al [5], the computation in each attention head mixes queries, keys, and values and cannot be fully captured by considering only the inner products of queries and keys, which is what is referred to as attention.…”

Section: Related Workmentioning

confidence: 99%

“…What is common to all of these is that the mapping from the two inputs to the prediction contains interaction between the two modalities. These interactions often challenge the existing explainability methods that are aimed at attention-based models, since, as far as we can ascertain, all existing Transformer explainability methods (e.g., [5,1]) heavily rely on self-attention, and do not provide adaptations to any other form of attention, which is commonly used in multi-modal Transformers.…”

Section: Introductionmentioning

confidence: 99%

“…We use an exemplar model from each architecture, and prove our method's superiority over existing Transformer explainability methods, adapted from their single modality origin. Our explainability prescription is easier to implement than existing methods, such as [5], and can be readily applied to any attention-based architecture.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

Chefer

Gur

Wolf

2021

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

112

View full text Add to dashboard Cite

Transformers are increasingly dominating multi-modal reasoning tasks, such as visual question answering, achieving state-of-the-art results thanks to their ability to contextualize information using the self-attention and coattention mechanisms. These attention modules also play a role in other computer vision tasks including object detection and image segmentation. Unlike Transformers that only use self-attention, Transformers with coattention require to consider multiple attention maps in parallel in order to highlight the information that is relevant to the prediction in the model's input. In this work, we propose the first method to explain prediction by any Transformer-based architecture, including bi-modal Transformers and Transformers with co-attentions. We provide generic solutions and apply these to the three most commonly used of these architectures: (i) pure selfattention, (ii) self-attention combined with co-attention, and (iii) encoder-decoder attention. We show that our method is superior to all existing methods which are adapted from single modality explainability. Our code is available at: https://github.com/hila-chefer/ Transformer-MM-Explainability.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

Chefer

Gur

Wolf

2021

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

112

View full text Add to dashboard Cite

show abstract

“…For the interpretability of the classification model, we adopted a visualization method of saliency map tailored for ViT suggested by (Chefer et al, 2020), which computes relevancy for Transformer network. Specifically, unlike the traditional approaches of gradient propagation methods (Selvaraju et al, 2017;Smilkov et al, 2017;Srinivas and Fleuret, 2019) or attribution propagation methods (Bach et al, 2015;Gu et al, 2018), which rely on the heuristic propagation along attention graph or the obtained attention maps, the method in Chefer et al (2020) calculate the local relevance with deep Taylor decomposition, which is then propagated throughout the layers. This relevance propagation method is especially useful for models based on Transformer architecture, as it overcomes the problem of selfattention operations and skip connections.…”

Section: Vision Transformer For Classificationmentioning

confidence: 99%

Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification

Park¹,

Kim²,

Oh³

et al. 2021

Preprint

View full text Add to dashboard Cite

Developing a robust algorithm to diagnose and quantify the severity of COVID-19 using Chest X-ray (CXR) requires a large number of well-curated COVID-19 datasets, which is difficult to collect under the global COVID-19 pandemic. On the other hand, CXR data with other findings are abundant. This situation is ideally suited for the Vision Transformer (ViT) architecture, where a lot of unlabeled data can be used through structural modeling by the self-attention mechanism. However, the use of existing ViT is not optimal, since feature embedding through direct patch flattening or ResNet backbone in the standard ViT is not intended for CXR. To address this problem, here we propose a novel Vision Transformer that utilizes low-level CXR feature corpus obtained from a backbone network that extracts common CXR findings. Specifically, the backbone network is first trained with large public datasets to detect common abnormal findings such as consolidation, opacity, edema, etc. Then, the embedded features from the backbone network are used as corpora for a Transformer model for the diagnosis and the severity quantification of COVID-19. We evaluate our model on various external test datasets from totally different institutions to evaluate the generalization capability. The experimental results confirm that our model can achieve the state-of-the-art performance in both diagnosis and severity quantification tasks with superior generalization capability, which are sine qua non of widespread deployment.

show abstract

“…As the attention module in the vision transformer computes the fullyconnected relations among all of the input patches, the computational cost is then quadratic with regard to the length of the input sequence. On the other hand, previous works [6,8] have already shown the vulnerable interpretability of the original vision transformer, where the raw attention coming from the architecture sometimes fails to perceive the informative region of the input images.…”

Section: Introductionmentioning

confidence: 96%

IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

Pan¹,

Jiang²,

Panda³

et al. 2021

Preprint

View full text Add to dashboard Cite

The self-attention-based model, transformer, is recently becoming the leading backbone in the field of computer vision. In spite of the impressive success made by transformers in a variety of vision tasks, it still suffers from heavy computation and intensive memory cost. To address this limitation, this paper presents an Interpretability-Aware REDundancy REDuction framework (IA-RED 2 ). We start by observing a large amount of redundant computation, mainly spent on uncorrelated input patches, and then introduce an interpretable module to dynamically and gracefully drop these redundant patches. This novel framework is then extended to a hierarchical structure, where uncorrelated tokens at different stages are gradually removed, resulting in a considerable shrinkage of computational cost. We include extensive experiments on both image and video tasks, where our method could deliver up to 1.4× speed-up for state-of-the-art models like DeiT [47] and TimeSformer [3], by only sacrificing less than 0.7% accuracy. More importantly, contrary to other acceleration approaches, our method is inherently interpretable with substantial visual evidence, making vision transformer closer to a more human-understandable architecture while being lighter. We demonstrate that the interpretability that naturally emerged in our framework can outperform the raw attention learned by the original visual transformer, as well as those generated by off-the-shelf interpretation methods, with both qualitative and quantitative results. Project Page: http://people.csail.mit.edu/bpan/ia-red/.Preprint. Under review.

show abstract

Transformer Interpretability Beyond Attention Visualization

Cited by 15 publications

References 2 publications

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification

IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

Contact Info

Product

Resources

About