Black-Box Testing of Deep Neural Networks through Test Case Diversity

Aghababaeyan, Zohreh; Abdellatif, Manel; Briand, Lionel C.; Ramesh, S; Bagherzadeh, Mojtaba

doi:10.1109/tse.2023.3243522

Cited by 24 publications

(17 citation statements)

References 86 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This adaptive selection strategy for test cases has proven effective for numerical input spaces 13 . For the input space of DL systems, its applicability was also confirmed by some recent studies 20‐22 …”

Section: Lightweight Art For DL Systemssupporting

confidence: 53%

“…13 For the input space of DL systems, its applicability was also confirmed by some recent studies. [20][21][22] Diversity, as a fundamental principle of test case selection, 14 can be expressed in various ways. Categorization is one such approach.…”

Section: The Strategy Of Test Case Selectionmentioning

confidence: 99%

“…17,18 And comparative experiments have also demonstrated that ART has stronger failure detection capability than the basic random testing, often revealing failures in software systems with fewer test cases. 19 In order to expose the vulnerabilities in DL systems as many as possible, some studies 20,21 have shown that it is also necessary to construct or select diverse test inputs to execute deep neural network programs. Based on this point, Yan et al 22 applied the ART method to the testing of DL systems for image classification, and proposed the corresponding algorithm for DL systems, namely ARTDL.…”

mentioning

confidence: 99%

See 2 more Smart Citations

A lightweight adaptive random testing method for deep learning systems

Mao,

Song,

Chen

2023

Softw Pract Exp

View full text Add to dashboard Cite

In recent years, deep learning (DL) systems are increasingly used in the safety‐critical fields such as autonomous driving, medical diagnosis, and financial service. Although these systems have demonstrated an outstanding performance in enhancing the accuracy of decision‐making, they pose significant challenges to the trustworthiness due to their limited interpretability and inherent uncertainty. Adaptive random testing (ART) has been proved as an effective approach for ensuring the reliability of DL systems. However, existing ART methods for DL systems incur a heavy overhead in test case selection due to the computation of distances. To address this issue, we propose a lightweight adaptive random testing (Lw‐ARTDL) method for DL systems. In our improved algorithm, we employ the K‐Means technique to divide the entire test suite into several subsets. Then, for a candidate test case, we only calculate distances between it and the test cases within the category to which it belongs. This partition strategy ensures that the selected test cases are more representative while significantly reducing the computational cost. To validate the proposed algorithm, the comparison experiments between Lw‐ARTDL and the original ARTDL algorithm are conducted on two typical DL systems. The experimental results show that Lw‐ARTDL significantly reduces the overhead of failure detection, and exhibits stronger failure detection capability compared to ARTDL in most similarity metrics.

show abstract

Section: Lightweight Art For DL Systemssupporting

confidence: 53%

Section: The Strategy Of Test Case Selectionmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

A lightweight adaptive random testing method for deep learning systems

Mao,

Song,

Chen

2023

Softw Pract Exp

View full text Add to dashboard Cite

show abstract

“…Furthermore, the performance of most coverage measures is assessed using adversarial inputs, thus focussing on the robustness of the model instead of correctness. Irrespective of the claimed sensitivity of these measures to adversarial inputs, studies have failed to find a significant correlation between these coverage measures and their fault detection capability [5,18,48].…”

mentioning

confidence: 99%

“…Another widely used adequacy measure for traditional and AIbased systems is test suite diversity computed on test inputs or outputs [5,13,29,50]. The diversity metrics are designed based on the intuition that similar test cases exercise similar parts of the source code or training examples, thus revealing the same faults.…”

mentioning

confidence: 99%

Towards Reliable AI: Adequacy Metrics for Ensuring the Quality of System-level Testing of Autonomous Vehicles

Neelofar,

Aleti

2024

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

View full text Add to dashboard Cite

AI-powered systems have gained widespread popularity in various domains, including Autonomous Vehicles (AVs). However, ensuring their reliability and safety is challenging due to their complex nature. Conventional test adequacy metrics, designed to evaluate the effectiveness of traditional software testing, are often insufficient or impractical for these systems. White-box metrics, which are specifically designed for these systems, leverage neuron coverage information. These coverage metrics necessitate access to the underlying AI model and training data, which may not always be available. Furthermore, the existing adequacy metrics exhibit weak correlations with the ability to detect faults in the generated test suite, creating a gap that we aim to bridge in this study.In this paper, we introduce a set of black-box test adequacy metrics called "Test suite Instance Space Adequacy" (TISA) metrics, which can be used to gauge the effectiveness of a test suite. The TISA metrics offer a way to assess both the diversity and coverage of the test suite and the range of bugs detected during testing. Additionally, we introduce a framework that permits testers to visualise the diversity and coverage of the test suite in a two-dimensional space, facilitating the identification of areas that require improvement.We evaluate the efficacy of the TISA metrics by examining their correlation with the number of bugs detected in system-level simulation testing of AVs. A strong correlation, coupled with the short computation time, indicates their effectiveness and efficiency in estimating the adequacy of testing AVs.

show abstract

Validity Matters: Uncertainty‐Guided Testing of Deep Neural Networks

Jiang,

Li,

Wang

et al. 2024

Software Testing Verif & Rel

View full text Add to dashboard Cite

Despite numerous applications of deep learning technologies on critical tasks in various domains, advanced deep neural networks (DNNs) face persistent safety and security challenges, such as the overconfidence in predicting out‐of‐distribution samples and susceptibility to adversarial examples. Thorough testing by exploring the input space serves as a key strategy to ensure their robustness and trustworthiness of these networks. However, existing testing methods focus on disclosing more erroneous model behaviours, overlooking the validity of the generated test inputs. To mitigate this issue, we investigate devising valid test input generation method for DNNs from a predictive uncertainty perspective. Through a large‐scale empirical study across 11 predictive uncertainty metrics for DNNs, we explore the correlation between validity and uncertainty of test inputs. Our findings reveal that the predictive entropy‐based and ensemble‐based uncertainty metrics effectively characterize the input validity demonstration. Building on these insights, we introduce UCTest, an uncertainty‐guided deep learning testing approach, to efficiently generate valid and authentic test inputs. We formulate a joint optimization objective: to uncover the model's misbehaviours by maximizing the loss function and concurrently generate valid test input by minimizing uncertainty. Extensive experiments demonstrate that our approach outperforms the current testing methods in generating valid test inputs. Furthermore, incorporating natural variation through data augmentation techniques into UCTest effectively boosts the diversity of generated test inputs.

show abstract

Black-Box Testing of Deep Neural Networks through Test Case Diversity

Cited by 24 publications

References 86 publications

A lightweight adaptive random testing method for deep learning systems

A lightweight adaptive random testing method for deep learning systems

Towards Reliable AI: Adequacy Metrics for Ensuring the Quality of System-level Testing of Autonomous Vehicles

Validity Matters: Uncertainty‐Guided Testing of Deep Neural Networks

Contact Info

Product

Resources

About