Gradient-based Analysis of NLP Models is Manipulable

Wang, Junlin; Tuyls, Jens; Wallace, Eric; Singh, Sameer

doi:10.18653/v1/2020.findings-emnlp.24

Cited by 25 publications

(16 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sundararajan et al (2017) show that in practice, gradients are saturated: they may all be close to zero for a wellfitted function, and thus not reflect importance. Adversarial methods can also distort gradient-based saliences while keeping a model's prediction the same (Ghorbani et al, 2019;Wang et al, 2020). We compare greedy rationalization to gradient saliency methods in Section 8.…”

Section: Related Workmentioning

confidence: 99%

Rationales for Sequential Predictions

Vafa¹,

Deng²,

Blei³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Sequence models are a critical component of modern NLP systems, but their predictions are difficult to explain. We consider model explanations though rationales, subsets of context that can explain individual model predictions. We find sequential rationales by solving a combinatorial optimization: the best rationale is the smallest subset of input tokens that would predict the same output as the full sequence. Enumerating all subsets is intractable, so we propose an efficient greedy algorithm to approximate this objective. The algorithm, which is called greedy rationalization, applies to any model. For this approach to be effective, the model should form compatible conditional distributions when making predictions on incomplete subsets of the context. This condition can be enforced with a short finetuning step. We study greedy rationalization on language modeling and machine translation. Compared to existing baselines, greedy rationalization is best at optimizing the sequential objective and provides the most faithful rationales. On a new dataset of annotated sequential rationales, greedy rationales are most similar to human rationales.

show abstract

Section: Related Workmentioning

confidence: 99%

Rationales for Sequential Predictions

Vafa¹,

Deng²,

Blei³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…L2E can be applied to any Natural Language Processing task to which an underlying feature-based explanation algorithm can be applied, such as Natural Language Inference and Question Answering (Wang et al, 2020). In this paper, we focus on explaining text classification models.…”

Section: Learning To Explain (L2e)mentioning

confidence: 99%

Learning to Explain: Generating Stable Explanations Fast

Situ¹,

Zukerman²,

Paris³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

The importance of explaining the outcome of a machine learning model, especially a blackbox model, is widely acknowledged. Recent approaches explain an outcome by identifying the contributions of input features to this outcome. In environments involving large blackbox models or complex inputs, this leads to computationally demanding algorithms. Further, these algorithms often suffer from low stability, with explanations varying significantly across similar examples. In this paper, we propose a Learning to Explain (L2E) approach that learns the behaviour of an underlying explanation algorithm simultaneously from all training examples. Once the explanation algorithm is distilled into an explainer network, it can be used to explain new instances. Our experiments on three classification tasks, which compare our approach to six explanation algorithms, show that L2E is between 5 and 7.5 × 10 4 times faster than these algorithms, while generating more stable explanations, and having comparable faithfulness to the black-box model.

show abstract

“…Some other popular explainability methods include neuron-based analysis and transfer learning ( Rethmeier et al, 2020 ) and promising gradient-based analysis, which directly reflects the knowledge learned by the model ( Wallace et al, 2019 ). However, it has been recently shown that it is relatively easy to manipulate and corrupt gradient-based explainability methods ( Wang et al, 2020 ).…”

Section: Introductionmentioning

confidence: 99%

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations

Ilinykh

Dobnik

2021

Front. Artif. Intell.

View full text Add to dashboard Cite

Neural networks have proven to be very successful in automatically capturing the composition of language and different structures across a range of multi-modal tasks. Thus, an important question to investigate is how neural networks learn and organise such structures. Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. However, very few have explored what structures are acquired by multi-modal transformers where linguistic and visual features are combined. It is critical to understand the representations learned by each modality, their respective interplay, and the task’s effect on these representations in large-scale architectures. In this paper, we take a multi-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. Our results indicate that the information about different relations between objects in the visual stream is hierarchical and varies from local to a global object-level understanding of the image. In particular, while visual representations in the first layers encode the knowledge of relations between semantically similar object detections, often constituting neighbouring objects, deeper layers expand their attention across more distant objects and learn global relations between them. We also show that globally attended objects in deeper layers can be linked with entities described in image descriptions, indicating a critical finding - the indirect effect of language on visual representations. In addition, we highlight how object-based input representations affect the structure of learned visual knowledge and guide the model towards more accurate image descriptions. A parallel question that we investigate is whether the insights from cognitive science echo the structure of representations that the current neural architecture learns. The proposed analysis of the inner workings of multi-modal transformers can be used to better understand and improve on such problems as pre-training of large-scale multi-modal architectures, multi-modal information fusion and probing of attention weights. In general, we contribute to the explainable multi-modal natural language processing and currently shallow understanding of how the input representations and the structure of the multi-modal transformer affect visual representations.

show abstract

Gradient-based Analysis of NLP Models is Manipulable

Cited by 25 publications

References 29 publications

Rationales for Sequential Predictions

Rationales for Sequential Predictions

Learning to Explain: Generating Stable Explanations Fast

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations

Contact Info

Product

Resources

About