Deep neural networks (DNNs) are increasingly being adopted for sensing and control functions in a variety of safety and mission-critical systems such as self-driving cars, autonomous air vehicles, medical diagnostics and industrial robotics. Failures of such systems can lead to loss of life or property, which necessitates stringent verification and validation for providing high assurance. Though formal verification approaches are being investigated, testing remains the primary technique for assessing the dependability of such systems. Due to the nature of the tasks handled by DNNs, the cost of obtaining test oracle data-the expected output, a.k.a. label, for a given input-is high, which significantly impacts the amount and quality of testing that can be performed. Thus, prioritizing input data for testing DNNs in meaningful ways to reduce the cost of labeling can go a long way in increasing testing efficacy. This paper proposes using gauges of the DNN's sentiment derived from the computation performed by the model, as a means to identify inputs that are likely to reveal weaknesses. We empirically assessed the efficacy of three such sentiment measures for prioritization-confidence, uncertainty and surprise-and compare their effectiveness in terms of their fault-revealing capability and retraining effectiveness. The results indicate that sentiment measures can effectively flag inputs that expose unacceptable DNN behavior. For MNIST models, the average percentage of inputs correctly flagged ranged from 88% to 94.8%.
Good tests are important in software development, but it can be hard to tell whether tests will reveal future faults that are themselves unknown. Mutation analysis, which checks whether tests reveal inserted changes in a program, is a strong measure of test suite adequacy, but common source-or compilerlevel approaches to mutation testing are not applicable to software available only in binary form. We explore mutation analysis as an application of the reassembleable disassembly approach to binary rewriting, building a tool for x86 binaries on top of the previously-developed Uroboros system, and apply it to the C benchmarks from SPEC CPU 2006 and to five examples of embedded control software. The results demonstrate that our approach works effectively across these software domains: as expected, tests designed for performance benchmarking reveal fewer mutants than tests generated to achieve high code coverage, but mutation scores indicate differences in test origins and features such as code size and fault-tolerance. Our binary-level tool also achieves comparable results to source-level mutation analysis despite supporting a more limited set of mutation operators. More generally we also argue that our experience shows how reassembleable disassembly is a valuable approach for constructing novel binary rewriting tools.
Path-merging is a known technique for accelerating symbolic execution. One technique, named “veritesting” by Avgerinos et al. uses summaries of bounded control-flow regions and has been shown to accelerate symbolic execution of binary code. But, when applied to symbolic execution of Java code, veritesting needs to be extended to summarize dynamically dispatched methods and exceptional control-flow. Such an extension of veritesting has been implemented in Java Ranger by implementing as an extension of Symbolic PathFinder, a symbolic executor for Java bytecode. In this paper, we briefly describe the architecture of Java Ranger and describe its setup for SV-COMP 2020.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.