Free Lunch for Testing: Fuzzing Deep-Learning Libraries from Open Source

Wei, Anjiang; Yinlin, Deng,; Yang, Chen; Zhang, Lingming

doi:10.48550/arxiv.2201.06589

Cited by 4 publications

(11 citation statements)

References 43 publications

(71 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Metrics. We mainly target the following metrics for evaluation: • Code coverage: Following prior fuzzing work [7,8,59], we trace source-level branch coverage for both the entire systems and their pass-only components, measuring 1) total coverage counts all hit branches; and 2) unique coverage counts unique branches ("hard" branches) that other baselines cannot cover. • Bug counting: Following prior work [29,59,60], we use the number of independent patches as the number of detected bugs, except that we directly count the number of bug reports for closed-source systems (i.e., TensorRT) and unfixed ones.…”

Section: Methodsmentioning

confidence: 99%

“…There are several challenges facing the basic approach of fuzzing and differential testing, which are not addressed by prior work [33,57,59]. Next, we illustrate these challenges using concrete examples.…”

Section: Challenges In Finding DL Compiler Bugsmentioning

confidence: 99%

“…Finding DL compiler bugs requires generating input graphs that contain a variety of operators and connections. Some prior fuzzers [59] test only using single-operators and thus are too limiting. LEMON [57] and GraphFuzzer [33] generate multi-operator computation graphs, but they are restricted to certain types of operators and connections in order to avoid "type check" errors on the generated graphs (detailed in §6.1).…”

Section: Challenges In Finding DL Compiler Bugsmentioning

confidence: 99%

“…Therefore, prior techniques, such as Predoo [68], require users to manually set up the function arguments, and can only be evaluated on a limited number of APIs. More recently, FreeFuzz [59] aims to address this challenge via dynamically tracing API executions from various sources (including library documents, developer tests, and real-world models), and further mutates the traced inputs for each API to test DL libraries. While such API-level testing techniques are adequate for testing first-generation DL libraries ( §2.1), they can hardly find bugs in graph-level optimizations (e.g., 86% of the transformation bugs detected by NNSmith require multiple operators to trigger).…”

Section: System Fuzzingmentioning

confidence: 99%

See 3 more Smart Citations

Finding Deep-Learning Compilation Bugs with NNSmith

Liu¹,

Lin²,

Ruffy³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Deep-learning (DL) compilers such as TVM and TensorRT are increasingly used to optimize deep neural network (DNN) models to meet performance, resource utilization and other requirements. Bugs in these compilers can produce optimized models whose semantics differ from the original models, and produce incorrect results impacting the correctness of down stream applications. However, finding bugs in these compilers is challenging due to their complexity. In this work, we propose a new fuzz testing approach for finding bugs in deep-learning compilers. Our core approach uses (i) light-weight operator specifications to generate diverse yet valid DNN models allowing us to exercise a large part of the compiler's transformation logic; (ii) a gradient-based search process for finding model inputs that avoid any floating-point exceptional values during model execution, reducing the chance of missed bugs or false alarms; and (iii) differential testing to identify bugs. We implemented this approach in NNSmith which has found 65 new bugs in the last seven months for TVM, TensorRT, ONNXRuntime, and PyTorch. Of these 52 have been confirmed and 44 have been fixed by project maintainers.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Challenges In Finding DL Compiler Bugsmentioning

confidence: 99%

Section: Challenges In Finding DL Compiler Bugsmentioning

confidence: 99%

Section: System Fuzzingmentioning

confidence: 99%

See 2 more Smart Citations

Finding Deep-Learning Compilation Bugs with NNSmith

Liu¹,

Lin²,

Ruffy³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The functionality of these equivalent graphs is verified to be identical. FreeFuzz takes the approach of mining usage of functions from open source code and carries out differential testing between similar concepts (e.g., CPU vs GPU computation should yield the same results) [33]. EAGLE and FreeFuzz are able to develop bugs, but another issue with debugging machine learning libraries is that these are frequently updated, and thus, the expected functionality of various methods may change over time.…”

Section: Machine Learning Librariesmentioning

confidence: 99%

A Retrospective on ICSE 2022

Winston¹,

Caleb²,

Winston³

et al. 2022

Preprint

View full text Add to dashboard Cite

COMET: Coverage-guided Model Generation For Deep Learning Library Testing

Meiziniu¹,

Cao²,

Tian³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent deep learning (DL) applications are mostly built on top of DL libraries. The quality assurance of these libraries is critical to the dependable deployment of DL applications. A few techniques have thereby been proposed to test DL libraries by generating DL models as test inputs. Then these techniques feed those DL models to DL libraries for making inferences, in order to exercise DL libraries modules related to a DL model's execution. However, the test effectiveness of these techniques is constrained by the diversity of generated DL models. Our investigation finds that these techniques can cover at most 11.7% of layer pairs (i.e., call sequence between two layer APIs) and 55.8% of layer parameters (e.g., "padding" in Conv2D). As a result, we find that many bugs arising from specific layer pairs and parameters can be missed by existing techniques.In view of the limitations of existing DL library testing techniques, we propose MEMO to efficiently generate diverse DL models by exploring layer types, layer pairs, and layer parameters. MEMO:(1) designs an initial model reduction technique to boost test efficiency without compromising model diversity; and (2) designs a set of mutation operators for a customized Markov Chain Monte Carlo (MCMC) algorithm to explore new layer types, layer pairs, and layer parameters. We evaluate MEMO on seven popular DL libraries, including four for model execution (TensorFlow, PyTorch and MXNet, and ONNX) and three for model conversions (Keras-MXNet, TF2ONNX, ONNX2PyTorch). The evaluation result shows that MEMO outperforms recent works by covering 10.3% more layer pairs, 15.3% more layer parameters, and 2.3% library branches. Moreover, MEMO detects 29 new bugs in the latest version of DL libraries, with 17 of them confirmed by DL library developers, and 5 of those confirmed bugs have been fixed.

show abstract

Free Lunch for Testing: Fuzzing Deep-Learning Libraries from Open Source

Cited by 4 publications

References 43 publications

Finding Deep-Learning Compilation Bugs with NNSmith

Finding Deep-Learning Compilation Bugs with NNSmith

A Retrospective on ICSE 2022

COMET: Coverage-guided Model Generation For Deep Learning Library Testing

Contact Info

Product

Resources

About