Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.442
|View full text |Cite
|
Sign up to set email alerts
|

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Abstract: Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a taskagnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive tes… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
253
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 474 publications
(346 citation statements)
references
References 21 publications
0
253
0
Order By: Relevance
“…We experimentally validated our findings from saliency maps and GANs by modifying important radiographic features. To detect whether the higher-level features that our saliency maps highlight are major contributors to the model’s classification, we used methods inspired by a behavioral testing approach 44 . For example, saliency maps highlight dataset-specific laterality markers and text within the images.…”
Section: Methodsmentioning
confidence: 99%
“…We experimentally validated our findings from saliency maps and GANs by modifying important radiographic features. To detect whether the higher-level features that our saliency maps highlight are major contributors to the model’s classification, we used methods inspired by a behavioral testing approach 44 . For example, saliency maps highlight dataset-specific laterality markers and text within the images.…”
Section: Methodsmentioning
confidence: 99%
“…In addition to failure with respect to adversarially optimized noise maps, some models fail on simple, commonsense reasoning tasks. Ribeiro et al [149] propose the Check-List evaluation system to test language models on linguistic capabilities such as negation and vocabulary. A solution to these behavior tests, and adversarial examples would be to simply train the model on this task data.…”
Section: Generalization Metricsmentioning
confidence: 99%
“…For NLP applications, typical ML testing practices struggle to translate to real-world settings, often overestimating performance capabilities. An effective way to address this is devising a checklist of linguistic capabilities and test types, as in Ribeiro et al 45 -interestingly their test suite was inspired by metamorphic testing, which we suggested earlier in Level 7 for testing systems AI integrations. A survey by Paleyes et al 32 go over numerous case studies to discuss challenges in ML deployment.…”
Section: Related Workmentioning
confidence: 99%