2021
DOI: 10.1145/3479569
|View full text |Cite
|
Sign up to set email alerts
|

Discovering and Validating AI Errors With Crowdsourced Failure Reports

Abstract: AI systems can fail to learn important behaviors, leading to real-world issues like safety concerns and biases. Discovering these systematic failures often requires significant developer attention, from hypothesizing potential edge cases to collecting evidence and validating patterns. To scale and streamline this process, we introduce crowdsourced failure reports, end-user descriptions of how or why a model failed, and show how developers can use them to detect AI errors. We also design and implement Deblinder… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
23
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 39 publications
(27 citation statements)
references
References 63 publications
0
23
0
Order By: Relevance
“…For instance, designers might develop auditing interfaces that automatically surface potentially important instances for everyday auditors to examine further. As a foundation for such interfaces, it may be possible to build upon emerging algorithmic techniques for crowd-in-the-loop detection of "unknown unknowns" in ML models (e.g., [6,56,59,64,87,101]), which automatically surface cases that are more likely to be mislabelled and/or misclassified. These methods focus on surfacing regions of a model's error space in which the model is highly confident yet incorrect [56].…”
Section: Design For Algorithmic Guidancementioning
confidence: 99%
“…For instance, designers might develop auditing interfaces that automatically surface potentially important instances for everyday auditors to examine further. As a foundation for such interfaces, it may be possible to build upon emerging algorithmic techniques for crowd-in-the-loop detection of "unknown unknowns" in ML models (e.g., [6,56,59,64,87,101]), which automatically surface cases that are more likely to be mislabelled and/or misclassified. These methods focus on surfacing regions of a model's error space in which the model is highly confident yet incorrect [56].…”
Section: Design For Algorithmic Guidancementioning
confidence: 99%
“…Adding data [16,45,51,121] Relabeling data [76] Reweighting data [12,64,137] Collecting expert labels [98] Passive observation [69,84,118]…”
Section: Active Data Collectionmentioning
confidence: 99%
“…Recently, active learning has been studied alongside model transparency, specifically using explanations to assist experts with choosing which points to add to D [51]. Cabrera et al [16] propose an extensive visual analytics system that allows experts to verify and produce examples of crowd-sourced errors, which can be thought of as additional data. Passive observation.…”
Section: Observation To Datasetmentioning
confidence: 99%
“…While data labeling represents the most common use of crowdsourcing in regard to training and evaluating machine learning models, human intelligence can be tapped in a much wider and more creative variety of ways. For example, the crowd might verify output from machine learning models, identify, and categorize blind spots (Attenberg et al, 2011 ; Vandenhof, 2019 ) and other failure modes (Cabrera et al, 2021 ), and suggest useful features for a machine learning classifier (Cheng and Bernstein, 2015 ).…”
Section: Motivation and Backgroundmentioning
confidence: 99%