A fundamental question in software testing research is how to compare test suites, often as a means for comparing testgeneration techniques. Researchers frequently compare test suites by measuring their coverage. A coverage criterion C provides a set of test requirements and measures how many requirements a given suite satisfies. A suite that satisfies 100% of the (feasible) requirements is C-adequate. Previous rigorous evaluations of coverage criteria mostly focused on such adequate test suites: given criteria C and C ′ , are C-adequate suites (on average) more effective than C ′ -adequate suites? However, in many realistic cases producing adequate suites is impractical or even impossible.We present the first extensive study that evaluates coverage criteria for the common case of non-adequate test suites: given criteria C and C ′
No abstract
Aggressive random testing tools ("fuzzers") are impressively effective at finding compiler bugs. For example, a single test-case generator has resulted in more than 1,700 bugs reported for a single JavaScript engine. However, fuzzers can be frustrating to use: they indiscriminately and repeatedly find bugs that may not be severe enough to fix right away. Currently, users filter out undesirable test cases using ad hoc methods such as disallowing problematic features in tests and grepping test results. This paper formulates and addresses the fuzzer taming problem: given a potentially large number of random test cases that trigger failures, order them such that diverse, interesting test cases are highly ranked. Our evaluation shows our ability to solve the fuzzer taming problem for 3,799 test cases triggering 46 bugs in a C compiler and 2,603 test cases triggering 28 bugs in a JavaScript engine.
A variety of ternary nanoheterostructures composed of Pt nanoparticles (NPs), SnOx species, and anatase TiO2 are designed elaborately to explore the effect of interfacial electron transfer on photocatalytic H2 evolution from a biofuel-water solution. Among numerous factors controlling the H2 evolution, the significance of Pt sites for the H2 evolution is highlighted by tuning the loading procedure of Pt NPs and SnOx species over TiO2. A synergistic enhancement of H2 evolution can be achieved over the Pt/SnOx/TiO2 heterostructures formed by anchoring Pt NPs at atomically-isolated Sn-oxo sites, whereas the Pt/TiO2/SnOx counterparts prepared by grafting single-site Sn-oxo species on Pt/TiO2 show a marked decrease in the rate of H2 evolution. The characterization results clearly reveal that the synergy of Pt NPs and SnOx species originates from the vectorial electron transfer of TiO2 → SnOx → Pt occurring on the former, while the latter results from the competitive electron transfer from TiO2 to SnOx and to Pt NPs.
Abstract-How do you test a program when only a single user, with no expertise in software testing, is able to determine if the program is performing correctly? Such programs are common today in the form of machine-learned classifiers. We consider the problem of testing this common kind of machine-generated program when the only oracle is an end user : e.g., only you can determine if your email is properly filed. We present test selection methods that provide very good failure rates even for small test suites, and show that these methods work in both large-scale random experiments using a "gold standard" and in studies with real users. Our methods are inexpensive and largely algorithm-independent. Key to our methods is an exploitation of properties of classifiers that is not possible in traditional software testing. Our results suggest that it is plausible for time-pressured end users to interactively detect failures-even very hard-to-find failures-without wading through a large number of successful (and thus less useful) tests. We additionally show that some methods are able to find the arguably most difficult-to-detect faults of classifiers: cases where machine learning algorithms have high confidence in an incorrect result.
A fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques that produce those test suites. Researchers frequently compare test suites by measuring their coverage . A coverage criterion C provides a set of test requirements and measures how many requirements a given suite satisfies. A suite that satisfies 100% of the feasible requirements is called C-adequate . Previous rigorous evaluations of coverage criteria mostly focused on such adequate test suites: given two criteria C and C ′, are C -adequate suites on average more effective than C ′-adequate suites? However, in many realistic cases, producing adequate suites is impractical or even impossible. This article presents the first extensive study that evaluates coverage criteria for the common case of non-adequate test suites: given two criteria C and C ′, which one is better to use to compare test suites? Namely, if suites T 1 , T 2 ,…, T n have coverage values c 1 , c 2 ,…, c n for C and c 1 ′, c 2 ′,…, c n ′ for C ′, is it better to compare suites based on c 1 , c 2 ,…, c n or based on c 1 ′, c 2 ′,…, c n ′ ? We evaluate a large set of plausible criteria, including basic criteria such as statement and branch coverage, as well as stronger criteria used in recent studies, including criteria based on program paths, equivalence classes of covered statements, and predicate states. The criteria are evaluated on a set of Java and C programs with both manually written and automatically generated test suites. The evaluation uses three correlation measures. Based on these experiments, two criteria perform best: branch coverage and an intraprocedural acyclic path coverage. We provide guidelines for testing researchers aiming to evaluate test suites using coverage criteria as well as for other researchers evaluating coverage criteria for research use.
Abstract-In random testing, it is often desirable to produce a "quick test" -an extremely inexpensive test suite that can serve as a frequently applied regression and allow the benefits of random testing to be obtained even in very slow or oversubscribed test environments. Delta debugging is an algorithm that, given a failing test case, produces a smaller test case that also fails, and typically executes much more quickly. Delta debugging of random tests can produce effective regression suites for previously detected faults, but such suites often have little power for detecting new faults, and in some cases provide poor code coverage. This paper proposes extending delta debugging by simplifying tests with respect to code coverage, an instance of a generalization of delta debugging we call cause reduction. We show that test suites reduced in this fashion can provide very effective quick tests for real-world programs. For Mozilla's SpiderMonkey JavaScript engine, the reduced suite is more effective for finding software faults, even if its reduced runtime is not considered. The effectiveness of a reduction-based quick test persists through major changes to the software under test.
Scaling symbolic execution to large programs or programs with complex inputs remains difficult due to path explosion and complex constraints, as well as external method calls. Additionally, creating an effective test structure with symbolic inputs can be difficult. A popular symbolic execution strategy in practice is to perform symbolic execution not "from scratch" but based on existing test cases. This paper proposes that the effectiveness of this approach to symbolic execution can be enhanced by (1) reducing the size of seed test cases and (2) prioritizing seed test cases to maximize exploration efficiency. The proposed test case reduction strategy is based on a recently introduced generalization of deltadebugging, and our prioritization techniques include novel methods that, for this purpose, can outperform some traditional regression testing algorithms. We show that applying these methods can significantly improve the effectiveness of symbolic execution based on existing test cases.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.