CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations

Arras, Leila; Osman, Ahmed; Samek, Wojciech

doi:10.1016/j.inffus.2021.11.008

Cited by 80 publications

(61 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this example, we assume to have a pre-trained model (model), a batch of inputand output pairs (x_batch, y_batch) and a set of attributions (a_batch). Needless to say, XAI evaluation is intrinsically difficult and there is no one-size-fits-all metric for all tasks -evaluation of explanations must be understood and calibrated from its context: the application, data, model, and intended stakeholders [10,34]. To this end, we designed Quantus to be highly customisable and easily extendable -documentation and examples on how to create new metrics as well as how to customise existing ones are included.…”

Section: Library Designmentioning

confidence: 99%

“…Despite much excitement and activity in the field of eXplainable Artificial Intelligence (XAI) [1,2,3,4,5], the evaluation of explainable methods still remains an unsolved problem [6,7,8,9,10]. Unlike in traditional Machine Learning (ML), the task of explaining inherently lacks "ground-truth" datathere is no universally accepted definition of what constitutes a "correct" explanation and less so, which properties an explanation ought to fulfill [11].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Quantus: An Explainable AI Toolkit for Responsible Evaluation of Neural Network Explanations

Hedström¹,

Bareeva²,

Krakowczyk³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The evaluation of explanation methods is a research topic that has not yet been explored deeply, however, since explainability is supposed to strengthen trust in artificial intelligence, it is necessary to systematically review and compare explanation methods in order to confirm their correctness. Until now, no tool exists that exhaustively and speedily allows researchers to quantitatively evaluate explanations of neural network predictions. To increase transparency and reproducibility in the field, we therefore built Quantus -a comprehensive, open-source toolkit in Python that includes a growing, well-organised collection of evaluation metrics and tutorials for evaluating explainable methods. The toolkit has been thoroughly tested and is available under open source license on PyPi (or on https://github.com/understandable-machine-intelligence-lab/quantus/).

show abstract

Section: Library Designmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Quantus: An Explainable AI Toolkit for Responsible Evaluation of Neural Network Explanations

Hedström¹,

Bareeva²,

Krakowczyk³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…A limitation that also the synthetic dataset in shares. The synthetic dataset created by Arras et al (2021) created a dataset to technically evaluate saliency methods on a visual question answering task technically. TWO4TWO is the first dataset designed explicitly for human subject evaluations.…”

Section: Related Workmentioning

confidence: 99%

Do Users Benefit From Interpretable Vision? A User Study, Baseline, And Dataset

Sixt¹,

Schuessler²,

Popescu³

et al. 2022

Preprint

View full text Add to dashboard Cite

A variety of methods exist to explain image classification models. However, it remains unclear whether they provide any benefit to users over simply comparing various inputs and the model's respective predictions. We conducted a user study (N=240) to test how such a baseline explanation technique performs against concept-based and counterfactual explanations. To this end, we contribute a synthetic dataset generator capable of biasing individual attributes and quantifying their relevance to the model. In a study, we assess if participants can identify the relevant set of attributes compared to the ground-truth. Our results show that the baseline outperformed concept-based explanations. Counterfactual explanations from an invertible neural network performed similarly as the baseline. Still, they allowed users to identify some attributes more accurately. Our results highlight the importance of measuring how well users can reason about biases of a model, rather than solely relying on technical evaluations or proxy tasks. We open-source our study and dataset so it can serve as a blue-print for future studies.

show abstract

“…It can be used for testing the quality of explanations and concept learning. Additionally, [6] proposed the CLEVR-XAI-simple and CLEVR-XAI-complex datasets which provide ground-truth segmentation information for heatmap-based visual explanations. Our CLEVR-X augments the existing CLEVR dataset with explanations, but in contrast to (heatmap-based) visual explanations, we focus on natural language explanations.…”

Section: Related Workmentioning

confidence: 99%

“…selected; (3) A CAPTCHA [3] to verify that the user is human; (4) The problem definition consisting of a question and an image; (5) A user qualification step, for which the user has to correctly answer a question about an image. This ensures that the user is able to answer the question in the first place, a necessary condition to participate in our user study; (6) Two explanations from which the user needs to choose one. Example screenshots of the user interface for the user study are shown in Fig.…”

Section: User Study On Explanation Completeness and Relevancementioning

confidence: 99%

CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations

Salewski,

Koepke,

Lensch

et al. 2022

Preprint

View full text Add to dashboard Cite

Providing explanations in the context of Visual Question Answering (VQA) presents a fundamental problem in machine learning. To obtain detailed insights into the process of generating natural language explanations for VQA, we introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with natural language explanations. For each image-question pair in the CLEVR dataset, CLEVR-X contains multiple structured textual explanations which are derived from the original scene graphs. By construction, the CLEVR-X explanations are correct and describe the reasoning and visual information that is necessary to answer a given question. We conducted a user study to confirm that the groundtruth explanations in our proposed dataset are indeed complete and relevant. We present baseline results for generating natural language explanations in the context of VQA using two state-of-the-art frameworks on the CLEVR-X dataset. Furthermore, we provide a detailed analysis of the explanation generation quality for different question and answer types. Additionally, we study the influence of using different numbers of ground-truth explanations on the convergence of natural language generation (NLG) metrics. The CLEVR-X dataset is publicly available at https://explainableml.github.io/CLEVR-X/.

show abstract

CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations

Cited by 80 publications

References 41 publications

Quantus: An Explainable AI Toolkit for Responsible Evaluation of Neural Network Explanations

Quantus: An Explainable AI Toolkit for Responsible Evaluation of Neural Network Explanations

Do Users Benefit From Interpretable Vision? A User Study, Baseline, And Dataset

CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations

Contact Info

Product

Resources

About