Thousands of vulnerabilities are discovered in programs every day, which is extremely harmful to software security. Thus, discovering vulnerabilities in projects has become a central issue. Facing a sustained growth of software complexity and large code size, manual code auditing becomes time-consuming and labor-intensive. With more open source programs available and a high degree of code formalization, it is possible to study features from source code to guide vulnerability discovery work. In this paper, we present a lightweight-assisted vulnerability discovery method using a deep neural network (LAVDNN) to detect weakness and to provide guidance for manual auditing. The method proposed in this paper leverages function names as semantics features to uncover weak functions in large-scale open source programs. First, we extract function names and classify into weak and benign datasets. Then, we construct deep neural networks and compare the performances of different models. According to the experimental results, our method performs well for both C/C++ and Python programs, with the F 2 -score reaching 0.91 and 0.915, respectively. Ultimately, we evaluate the method by comparing with other approaches using the libraries FFmpeg 0.6 and LibTIFF 4.0.6. The results show that the LAVDNN could narrow the range of functions to be analyzed and report more weak functions without any prior vulnerability information. As a lightweight-assisted tool, the LAVDNN significantly reduces the false positive rate and hardly misses weak functions.INDEX TERMS Code auditing, deep neural network, source code, vulnerability discovery.
Dynamic taint analysis techniques are a popular dynamic software analysis method. Marking a key segment of program function by dynamic taint analysis is an important part of software vulnerability research. Key segment marking usually related to the control flow taint analysis, however, several specific program structure may cause failure in key segment marking due to the control flow dependence, and overtainting and undertainting problem. In this paper, we proposed a novel method to mark a key segment accurately and efficiently with deep learning technology. Firstly, we fit the program function execution into a continuous function by the convoluntional network, and then mark the key segment roughly through derivative information of fitted nerual network. Finally, we mark the key segment of specific program function completely and accurately by filtering and diffusion algorithm. We developed the key segment marking tool NeuralTaint on this principle. We design an experiment to select the specific neural network structure of NeuralTaint. Our extensive evaluations demonstrate that NeuralTaint significantly outperforms the two state-of-the-art traditional dynamic taint analysis tool on seven popular real-world programs.
The number of software vulnerabilities is increasing year by year. In the era of big data, data-processing software with many users is more concerned by hackers. It is essential to improve the efficiency of discovering vulnerabilities in data-processing software. We noticed that in the process of discovering vulnerabilities, some problems of existing technology such as fuzzing, symbolic execution, and taint analysis have more or fewer relationships with data-processing functions. In fuzzing, there are two types of sanity checks toward the target program: NCC (Non-critical check) and CC (critical check). It is usually challenging to bypass such a sanity check, which leads to low code coverage during fuzzing. In symbolic execution, the constraint solver still has the problem of trying to deal with the constraints of complex algorithms. In taint analysis, the problem of over-taint and under-taint is always the key to affect the accuracy of the results. Therefore, to solve the above problems, it is necessary to identify the data-processing function. Based on identifying data-processing functions, we could identify those sanity checks, ease the solution of complex constraints, and understand the way of taints propagation to assist in software vulnerability discovery and analysis. This paper proposed a method called DPFI(data-processing function identification) for identifying data-processing functions with deep neural networks. We collected 37000 functions from GitHub and implemented the method on the data set with several neural networks, among which the performance of CNN achieved best and F 1-score was 0.90. We then applied the trained model on CGC(cyber grand challenge) data and real softwares for testing. For CGC, we got 448 functions in 20 programs, in which 35 were identified as data-processing functions. For real softwares, such as FFmpeg, 7zip, jpeg, the precision rate all reached 0.90 and F 1-score was above 0.87.
Key segment of a program input is the specific part of the input that has significant affect on the execution of target function. Marking key segment plays an important role in software security analysis. Traditional dynamic analysis methods can not mark the key segments correctly because of control flow dependency problem. The root cause of such problem is that implicit flow analysis method cannot cover all the behavior of the code fragment in a branch, especially when the code snippet contains unexpected jump behavior. The neural network can learn to fit the behavior of the program with proper training data. In this paper, we introduce the attention based neural network to mark the key segments of program input accurately and efficiently. We propose an attention based two-parts network structure and map program inputs into the target code execution by such network. Then we propose a two-step training method to train our network to calculate the importance of each input component on the execution of target function. Finally, we mark the key segments by statistical analysis method. We implement such method and develop a key segment marking tool AttentionMark. Experiments on four real-world software show that AttentionMark outperforms NeuralTaint and traditional dynamic analysis tool in key segment marking.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.