Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Ribeiro, Marco Túlio; Wu, Tongshuang; Guestrin, Carlos; Singh, Sameer

doi:10.18653/v1/2020.acl-main.442

Cited by 474 publications

(346 citation statements)

References 21 publications

Supporting

Mentioning

253

Contrasting

Order By: Relevance

“…We experimentally validated our findings from saliency maps and GANs by modifying important radiographic features. To detect whether the higher-level features that our saliency maps highlight are major contributors to the model’s classification, we used methods inspired by a behavioral testing approach 44 . For example, saliency maps highlight dataset-specific laterality markers and text within the images.…”

Section: Methodsmentioning

confidence: 99%

AI for radiographic COVID-19 detection selects shortcuts over signal

DeGrave

Janizek

Lee

2020

Preprint

136

169

View full text Add to dashboard Cite

Artificial intelligence (AI) researchers and radiologists have recently reported AI systems that accurately detect COVID-19 in chest radiographs. However, the robustness of these systems remains unclear. Using state-of-the-art techniques in explainable AI, we demonstrate that recent deep learning systems to detect COVID-19 from chest radiographs rely on confounding factors rather than medical pathology, creating an alarming situation in which the systems appear accurate, but fail when tested in new hospitals.

show abstract

Section: Methodsmentioning

confidence: 99%

AI for radiographic COVID-19 detection selects shortcuts over signal

DeGrave

Janizek

Lee

2020

Preprint

136

169

View full text Add to dashboard Cite

show abstract

“…In addition to failure with respect to adversarially optimized noise maps, some models fail on simple, commonsense reasoning tasks. Ribeiro et al [149] propose the Check-List evaluation system to test language models on linguistic capabilities such as negation and vocabulary. A solution to these behavior tests, and adversarial examples would be to simply train the model on this task data.…”

Section: Generalization Metricsmentioning

confidence: 99%

Deep Learning applications for COVID-19

2021

View full text Add to dashboard Cite

This survey explores how Deep Learning has battled the COVID-19 pandemic and provides directions for future research on COVID-19. We cover Deep Learning applications in Natural Language Processing, Computer Vision, Life Sciences, and Epidemiology. We describe how each of these applications vary with the availability of big data and how learning tasks are constructed. We begin by evaluating the current state of Deep Learning and conclude with key limitations of Deep Learning for COVID-19 applications. These limitations include Interpretability, Generalization Metrics, Learning from Limited Labeled Data, and Data Privacy. Natural Language Processing applications include mining COVID-19 research for Information Retrieval and Question Answering, as well as Misinformation Detection, and Public Sentiment Analysis. Computer Vision applications cover Medical Image Analysis, Ambient Intelligence, and Vision-based Robotics. Within Life Sciences, our survey looks at how Deep Learning can be applied to Precision Diagnostics, Protein Structure Prediction, and Drug Repurposing. Deep Learning has additionally been utilized in Spread Forecasting for Epidemiology. Our literature review has found many examples of Deep Learning systems to fight COVID-19. We hope that this survey will help accelerate the use of Deep Learning for COVID-19 research.

show abstract

“…For NLP applications, typical ML testing practices struggle to translate to real-world settings, often overestimating performance capabilities. An effective way to address this is devising a checklist of linguistic capabilities and test types, as in Ribeiro et al 45 -interestingly their test suite was inspired by metamorphic testing, which we suggested earlier in Level 7 for testing systems AI integrations. A survey by Paleyes et al 32 go over numerous case studies to discuss challenges in ML deployment.…”

Section: Related Workmentioning

confidence: 99%

Technology Readiness Levels for Machine Learning Systems

Lavin¹,

Gilligan-Lee²,

Visnjic³

et al. 2021

Preprint

View full text Add to dashboard Cite

The development and deployment of machine learning (ML) systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end. The lack of diligence can lead to technical debt, scope creep and misaligned objectives, model misuse and failures, and expensive consequences. Engineering systems, on the other hand, follow well-defined processes and testing standards to streamline development for high-quality, reliable results. The extreme is spacecraft systems, where mission critical measures and robustness are ingrained in the development process. Drawing on experience in both spacecraft engineering and ML (from research through product across domain areas), we have developed a proven systems engineering approach for machine learning development and deployment. Our Machine Learning Technology Readiness Levels (MLTRL) framework defines a principled process to ensure robust, reliable, and responsible systems while being streamlined for ML workflows, including key distinctions from traditional software engineering. Even more, MLTRL defines a lingua franca for people across teams and organizations to work collaboratively on artificial intelligence and machine learning technologies. Here we describe the framework and elucidate it with several real world use-cases of developing ML methods from basic research through productization and deployment, in areas such as medical diagnostics, consumer computer vision, satellite imagery, and particle physics.

show abstract

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Cited by 474 publications

References 21 publications

AI for radiographic COVID-19 detection selects shortcuts over signal

AI for radiographic COVID-19 detection selects shortcuts over signal

Deep Learning applications for COVID-19

Technology Readiness Levels for Machine Learning Systems

Contact Info

Product

Resources

About