Do Explanations Help Users Detect Errors in Open-Domain QA? An Evaluation of Spoken vs. Visual Explanations

Gonzalez, Ana Valeria; Bansal, Gagan; Fan, Angela; Mehdad, Yashar; Jia, Robin; Iyer, Srinivasan

doi:10.18653/v1/2021.findings-acl.95

Cited by 6 publications

(3 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, multiple studies do find evidence of synergistic human-computer systems. For instance, in the study with the highest ρ ratio in our study, [41] demonstrate how algorithms can improve human decision-making in open-domain question answering tasks. In their experiment, the condition in which humans work alone achieves an accuracy of 57% and the condition in which the algorithm works alone achieves an accuracy of 50%.…”

Section: Study 1: Analysis Of Recent Studies That Evaluate Human-comp...mentioning

confidence: 68%

A Test for Evaluating Performance in Human-AI Systems

Malone

Vaccaro

Campero

et al. 2023

Preprint

View full text Add to dashboard Cite

Many important uses of AI involve augmenting humans, not replacing them. But there is not yet a widely used and broadly comparable test for evaluating the performance of these human-AI systems relative to humans alone, AI alone, or other baselines. Here we describe such a test and demonstrate its use in three ways. First, in an analysis of 79 recently published results, we find that, surprisingly, the median performance improvement ratio corresponds to no improvement at all, and the maximum improvement is only 36%. Second, we experimentally find a 27% performance improvement when 100 human programmers develop software using GPT-3, a modern, generative AI system. Finally, we find that 50 human non-programmers using GPT-3 perform the task about as well as –- and less expensively than –- the human programmers. Since neither the non-programmers nor the computer could perform the task alone, this illustrates a strong form of human-AI synergy.

show abstract

Section: Study 1: Analysis Of Recent Studies That Evaluate Human-comp...mentioning

confidence: 68%

A Test for Evaluating Performance in Human-AI Systems

Malone

Vaccaro

Campero

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Specifically, we need to know what users do with the model output across multiple interactions (e.g., verify, fact check, revise, accept). For example, González et al (2021) investigate the connection between explanations (D2) and user trust in the context of question answering systems. In their study users are presented with explanations in different modalities and either accept (trust) or reject (don't trust) candidate answers.…”

Section: Trustworthiness and User Trustmentioning

confidence: 99%

Establishing Trustworthiness: Rethinking Tasks and Model Evaluation

Litschko,

Müller-Eberstein,

van der Goot

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Language understanding is a multi-faceted cognitive capability, which the Natural Language Processing (NLP) community has striven to model computationally for decades. Traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. With the advent of large language models (LLMs) the community has witnessed a dramatic shift towards general purpose, task-agnostic approaches powered by generative models. As a consequence, the traditional compartmentalized notion of language tasks is breaking down, followed by an increasing challenge for evaluation and analysis. At the same time, LLMs are being deployed in more real-world scenarios, including previously unforeseen zero-shot setups, increasing the need for trustworthy and reliable systems. Therefore, we argue that it is time to rethink what constitutes tasks and model evaluation in NLP, and pursue a more holistic view on language, placing trustworthiness at the center. Towards this goal, we review existing compartmentalized approaches for understanding the origins of a model's functional capacity, and provide recommendations for more multifaceted evaluation protocols."Trust arises from knowledge of origin as well as from knowledge of functional capacity." Trustworthiness -Working DefinitionDavid G. Hays, 1979

show abstract

“…Other work that addresses human-in-the-loop evaluation of interpretability for deep neural models (a) includes Gonzalez and Søgaard (2020) and González et al (2021), but both evaluate interpretability methods with lay people and on non-critical tasks, ignoring (b) and (c). Attempts to evaluate interpretability methods for experts performing critical tasks, have, to the best of our knowledge, been limited to automatic evaluation or evaluation against gold-standard human rationales.…”

Section: Related Workmentioning

confidence: 99%

Evaluating Deep Taylor Decomposition for Reliability Assessment in the Wild

Brandl

Hershcovich

Søgaard

2022

ICWSM

View full text Add to dashboard Cite

We argue that we need to evaluate model interpretability methods 'in the wild', i.e., in situations where professionals make critical decisions, and models can potentially assist them. We present an in-the-wild evaluation of token attribution based on Deep Taylor Decomposition, with professional journalists performing reliability assessments. We find that using this method in conjunction with RoBERTa-Large, fine-tuned on the Gossip Corpus, led to faster and better human decision-making, as well as a more critical attitude toward news sources among the journalists. We present a comparison of human and model rationales, as well as a qualitative analysis of the journalists' experiences with machine-in-the-loop decision making.

show abstract

Do Explanations Help Users Detect Errors in Open-Domain QA? An Evaluation of Spoken vs. Visual Explanations

Cited by 6 publications

References 28 publications

A Test for Evaluating Performance in Human-AI Systems

A Test for Evaluating Performance in Human-AI Systems

Establishing Trustworthiness: Rethinking Tasks and Model Evaluation

Evaluating Deep Taylor Decomposition for Reliability Assessment in the Wild

Contact Info

Product

Resources

About