Is neuron coverage a meaningful measure for testing deep neural networks?

Harel-Canada, Fabrice; Wang, Lingxiao; Gulzar, Muhammad Ali; Gu, Quanquan; Kim, Miryung

doi:10.1145/3368089.3409754

Cited by 124 publications

(100 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Adversarial attacks often achieve higher robustness improvement than all three neuron coverage-guided fuzzing algorithms for simpler datasets such as MNIST, Fashion-MNIST and SVHN. This casts shadow on the usefulness of the test cases generated by neuron coverage-guided fuzzing algorithms in improving model robustness and is consistent with [6], [13], [23].…”

Section: Rq2: How Effective Is Our Fol Metric For Test Case Selection?supporting

confidence: 62%

“…Along with the testing metrics, many test case generation algorithms are also proposed including gradient-guided perturbation [30], [46], black-box [42] and metric-guided fuzzing [12], [21], [43]. However, these testing works lack rigorous evaluation on their usefulness in improving the model robustness (although most of them claim so) and have been shown to be ineffective in multiple recent works [6], [13], [23]. Multiple metrics have been proposed in the machine learning community to quantify the robustness of DL models as well [2], [40], [41], [44].…”

Section: Related Workmentioning

confidence: 99%

“…Another popular line of work is deep learning testing, which aims to generate test cases that can expose the vulnerabilities of DL models. The test cases can then be used to improve the model robustness by retraining the model, however, this should not be taken as granted, as recent studies have shown that test cases generated based on existing testing metrics have limited correlation to model robustness and robustness improvement after retraining [6], [13]. In this work, we highlight and tackle the problem of effectively generating test cases for improving the adversarial robustness of DL models.…”

Section: Introductionmentioning

confidence: 99%

“…Existing test case generation techniques such as DeepXplore [30], DeepConcolic [33], DeepHunter [43] and ADAPT [21] are mostly designed to improve the neuron coverage metrics of the test cases. While existing testing approaches are helpful in exposing vulnerabilities of DL systems to some extent, recent studies have found that neuron coverage metrics are not useful for improving model robustness [6], [13], [23]. As a consequence, unlike in the case of traditional program testing (where the program is surely improved after fixing bugs revealed through testing), one may not improve the robustness of the DL system after testing.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

RobOT: Robustness-Oriented Testing for Deep Learning Systems

Wang

Chen

Sun

et al. 2021

2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)

View full text Add to dashboard Cite

Recently, there has been a significant growth of interest in applying software engineering techniques for the quality assurance of deep learning (DL) systems. One popular direction is deep learning testing, where adversarial examples (a.k.a. bugs) of DL systems are found either by fuzzing or guided search with the help of certain testing metrics. However, recent studies have revealed that the commonly used neuron coverage metrics by existing DL testing approaches are not correlated to model robustness. It is also not an effective measurement on the confidence of the model robustness after testing. In this work, we address this gap by proposing a novel testing framework called Robustness-Oriented T esting (RobOT). A key part of RobOT is a quantitative measurement on 1) the value of each test case in improving model robustness (often via retraining), and 2) the convergence quality of the model robustness improvement. RobOT utilizes the proposed metric to automatically generate test cases valuable for improving model robustness. The proposed metric is also a strong indicator on how well robustness improvement has converged through testing. Experiments on multiple benchmark datasets confirm the effectiveness and efficiency of RobOT in improving DL model robustness, with 67.02% increase on the adversarial robustness that is 50.65% higher than the state-of-the-art work DeepGini.

show abstract

Section: Rq2: How Effective Is Our Fol Metric For Test Case Selection?supporting

confidence: 62%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

RobOT: Robustness-Oriented Testing for Deep Learning Systems

Wang

Chen

Sun

et al. 2021

2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)

View full text Add to dashboard Cite

show abstract

“…Due to the popularity of DL models and the critical importance of their reliability, a growing body of research efforts have been dedicated to testing DL models, with focus on adversarial attacks [14,21,32,[46][47][48] for model robustness, the discussion on various metrics for DL model testing [36,39,43,52,69], and testing DL models for specific applications [63,71,78]. Meanwhile, both running and testing DL models inevitably involve the underlying DL libraries, which serve as central pieces of infrastructures for building, training, optimizing and deploying DL models.…”

Section: Introductionmentioning

confidence: 99%

Free Lunch for Testing: Fuzzing Deep-Learning Libraries from Open Source

Wei¹,

Yinlin²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep learning (DL) systems can make our life much easier, and thus is gaining more and more attention from both academia and industry. Meanwhile, bugs in DL systems can be disastrous, and can even threaten human lives in safety-critical applications. To date, a huge body of research efforts have been dedicated to testing DL models. However, interestingly, there is still limited work for testing the underlying DL libraries, which are the foundation for building, optimizing, and running the DL models. One potential reason is that test generation for the underlying DL libraries can be rather challenging since their public APIs are mainly exposed in Python, making it even hard to automatically determine the API input parameter types due to dynamic typing. In this paper, we propose FreeFuzz, the first approach to fuzzing DL libraries via mining from open source. More specifically, FreeFuzz obtains code/models from three different sources: 1) code snippets from the library documentation, 2) library developer tests, and 3) DL models in the wild. Then, FreeFuzz automatically runs all the collected code/models with instrumentation to trace the dynamic information for each covered API, including the types and values of each parameter during invocation, and shapes of input/output tensors. Lastly, FreeFuzz will leverage the traced dynamic information to perform fuzz testing for each covered API. The extensive study of FreeFuzz on PyTorch and TensorFlow, two of the most popular DL libraries, shows that FreeFuzz is able to automatically trace valid dynamic information for fuzzing 1158 popular APIs, 9X more than state-of-the-art LEMON with 3.5X lower overhead than LEMON. Furthermore, FreeFuzz is able to detect 35 bugs for PyTorch and TensorFlow (with 31 confirmed by developers and 30 previously unknown).

show abstract

Efficient generation of valid test inputs for deep neural networks via gradient search

Jiang

Wang

2023

J Software Evolu Process

View full text Add to dashboard Cite

The safety and robustness of deep neural networks (DNNs) are currently of great concern. Adequate testing is commonly an effective technique to ensure the software's trustworthiness. However, existing DNN testing methods generate many invalid test inputs, which inevitably brings increased computational overhead and reduces the efficiency of DNN testing. In this paper, we focus on testing task‐specific DNN and investigating diverse, valid and natural test input generation based on data augmentation techniques. Specifically, we propose AugTest, a DNN testing method based on stochastic optimization with momentum, searching for optimal compositions of data augmentation parameters to efficiently generate diverse and valid test inputs. Experimental results show that our proposed method can effectively explore the data manifold space and find valid test inputs with high diversity and naturalness. Compared with the best‐performing baseline, AugTest can generate more test inputs with more average diversity and less average time. Furthermore, the generated test inputs have competitive generalizability to DNNs with different structures. The test error rates exceed 70% when testing other DNN models performing similar tasks using the test inputs generated by AugTest. This implies that our method can produce more valid and generalized data to unveil DNNs' errors.

show abstract

Is neuron coverage a meaningful measure for testing deep neural networks?

Cited by 124 publications

References 35 publications

RobOT: Robustness-Oriented Testing for Deep Learning Systems

RobOT: Robustness-Oriented Testing for Deep Learning Systems

Free Lunch for Testing: Fuzzing Deep-Learning Libraries from Open Source

Efficient generation of valid test inputs for deep neural networks via gradient search

Contact Info

Product

Resources

About