Testing DNN image classifiers for confusion &amp; bias errors

Tian, Yun; Zhong, Ziyuan; Ordóñez, Vicente; Kaiser, Gail E.; Ray, Baishakhi

doi:10.1145/3377811.3380400

Cited by 33 publications

(31 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We further manually classified these 37 papers into three categories: autonomous driving testing [7,8,[26][27][28][29][30][31][32][33] contains papers related to testing and validation of ADSs. deep learning testing [9,[34][35][36][37][38][39][40][41][42][43][44][45][46][47][48][49][50][51][52][53] includes papers about testing methods and criteria for DL-based systems. deep-learning debugging and repair [54][55][56][57][58][59] includes papers referring to the deployment and bug fixing of DL systems.…”

Section: Literature Review Methodologymentioning

confidence: 99%

Investigation into the state-of-the-practice autonomous driving testing

Lou¹,

Deng²,

Zheng³

et al. 2021

Preprint

View full text Add to dashboard Cite

Autonomous driving shows great potential to reform modern transportation and its safety is attracting much attention from public. Autonomous driving systems generally include deep neural networks (DNNs) for gaining better performance (e.g., accuracy on object detection and trajectory prediction). However, compared with traditional software systems, this new paradigm (i.e., program + DNNs) makes software testing more difficult. Recently, software engineering community spent significant effort in developing new testing methods for autonomous driving systems. However, it is not clear that what extent those testing methods have addressed the needs of industrial practitioners of autonomous driving. To fill this gap, in this paper, we present the first comprehensive study to identify the current practices and needs of testing autonomous driving systems in industry. We conducted semi-structured interviews with developers from 10 autonomous driving companies and surveyed 100 developers who have worked on autonomous driving systems. Through thematic analysis of interview and questionnaire data, we identified five urgent needs of testing autonomous driving systems from industry. We further analyzed the limitations of existing testing methods to address those needs and proposed several future directions for software testing researchers.

show abstract

Section: Literature Review Methodologymentioning

confidence: 99%

Investigation into the state-of-the-practice autonomous driving testing

Lou¹,

Deng²,

Zheng³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Zhang et al [78] and Aggarwal et al [2] attempted to test model fairness. Tian et al [69] are focused on testing the confusion and bias errors in DNNs. In this work, DNN testing is used for model reuse detection.…”

Section: Test Input Generation For Dnnmentioning

confidence: 99%

ModelDiff: testing-based DNN similarity comparison for model reuse detection

Zhang

Liu

et al. 2021

Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

View full text Add to dashboard Cite

The knowledge of a deep learning model may be transferred to a student model, leading to intellectual property infringement or vulnerability propagation. Detecting such knowledge reuse is nontrivial because the suspect models may not be white-box accessible and/or may serve different tasks. In this paper, we propose Mod-elDiff, a testing-based approach to deep learning model similarity comparison. Instead of directly comparing the weights, activations, or outputs of two models, we compare their behavioral patterns on the same set of test inputs. Specifically, the behavioral pattern of a model is represented as a decision distance vector (DDV), in which each element is the distance between the model's reactions to a pair of inputs. The knowledge similarity between two models is measured with the cosine similarity between their DDVs. To evaluate ModelDiff, we created a benchmark that contains 144 pairs of models that cover most popular model reuse methods, including transfer learning, model compression, and model stealing. Our method achieved 91.7% correctness on the benchmark, which demonstrates the effectiveness of using ModelDiff for model reuse detection. A study on mobile deep learning apps has shown the feasibility of ModelDiff on real-world models. CCS CONCEPTS• Security and privacy → Software and application security; Digital rights management; • Software and its engineering → Software post-development issues.

show abstract

“…Fairness Testing: Recent approaches on fairness testing [2], [12], [32], [33], [40] are not directly applicable for fairness testing of NLP software. These approaches are mostly focused on the (causal) fairness testing of credit rating or computer vision systems.…”

Section: Related Workmentioning

confidence: 99%

Astraea: Grammar-based Fairness Testing

Soremekun¹,

Udeshi²,

Chattopadhyay³

2020

Preprint

View full text Add to dashboard Cite

Software often produces biased outputs. In particular, machine learning (ML) based software are known to produce erroneous predictions when processing discriminatory inputs. Such unfair program behavior can be caused by societal bias. In the last few years, Amazon, Microsoft and Google have provided software services that produce unfair outputs, mostly due to societal bias (e.g. gender or race). In such events, developers are saddled with the task of conducting fairness testing. Fairness testing is challenging; developers are tasked with generating discriminatory inputs that reveal and explain biases.We propose a grammar-based fairness testing approach (called ASTRAEA) which leverages context-free grammars to generate discriminatory inputs that reveal fairness violations in software systems. Using probabilistic grammars, ASTRAEA also provides fault diagnosis by isolating the cause of observed software bias. ASTRAEA's diagnoses facilitate the improvement of ML fairness.ASTRAEA was evaluated on 18 software systems that provide three major natural language processing (NLP) services. In our evaluation, ASTRAEA generated fairness violations at a rate of about 18%. ASTRAEA generated over 573K discriminatory test cases and found over 102K fairness violations. Furthermore, ASTRAEA improves software fairness by about 76% via modelretraining, on average.

show abstract

Testing DNN image classifiers for confusion & bias errors

Cited by 33 publications

References 48 publications

Investigation into the state-of-the-practice autonomous driving testing

Investigation into the state-of-the-practice autonomous driving testing

ModelDiff: testing-based DNN similarity comparison for model reuse detection

Astraea: Grammar-based Fairness Testing

Contact Info

Product

Resources

About