DIRE: A Neural Approach to Decompiled Identifier Naming

Lacomis, Jeremy; Yin, Pengcheng; Schwartz, Edward J.; Allamanis, Miltiadis; Goues, Claire Le; Neubig, Graham; Vasilescu, Bogdan

doi:10.1109/ase.2019.00064

Cited by 58 publications

(62 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Code completion [32,34,36,52] is one of the most widely explored topics. Machine learning models of code are also used to predict names of variables and functions [2,3,8,9,12], with applications to deobfuscation [51,59] and decompilation [22,30,41]. Significant effort has been made towards automatically generating documentation from code or vice versa [8,11,24,35].…”

Section: Related Workmentioning

confidence: 99%

Typilus: neural type hints

Allamanis

Barr

Ducousso

et al. 2020

Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation

Self Cite

View full text Add to dashboard Cite

Type inference over partial contexts in dynamically typed languages is challenging. In this work, we present a graph neural network model that predicts types by probabilistically reasoning over a program's structure, names, and patterns. The network uses deep similarity learning to learn a TypeSpacea continuous relaxation of the discrete space of types-and how to embed the type properties of a symbol (i.e. identifier) into it. Importantly, our model can employ one-shot learning to predict an open vocabulary of types, including rare and user-defined ones. We realise our approach in Typilus for Python that combines the TypeSpace with an optional type checker. We show that Typilus accurately predicts types. Typilus confidently predicts types for 70% of all annotatable symbols; when it predicts a type, that type optionally type checks 95% of the time. Typilus can also find incorrect type annotations; two important and popular open source libraries, fairseq and allennlp, accepted our pull requests that fixed the annotation errors Typilus discovered. CCS Concepts: • Computing methodologies → Machine learning; • Software and its engineering → Language features.

show abstract

Section: Related Workmentioning

confidence: 99%

Typilus: neural type hints

Allamanis

Barr

Ducousso

et al. 2020

Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation

Self Cite

View full text Add to dashboard Cite

show abstract

“…A binary function can have structural and textual properties. Being inspired by word2vec, different techniques have been proposed by Ding et al [17] and others [18], [19], [20], [21], [22], [23], [24] to model the structural and textual aspects of a binary function as an embedding vector or to compute the similarity using deep neural networks.…”

Section: Deep Learningmentioning

confidence: 99%

BinDiff_NN : Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences

Ullah

2022

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

Binary diffing is a process to discover the differences and similarities in functionality between two binary programs. Previous research on binary diffing approaches it as a function matching problem to formulate an initial 1:1 mapping between functions, and later a sequence matching ratio is computed to classify two functions being an exact match, a partial match or no-match. The accuracy of existing techniques is best only when detecting exact matches and they are not efficient in detecting partially changed functions; especially those with minor patches. These drawbacks are due to two major challenges (i) In the 1:1 mapping phase, using a strict policy to match function features (ii) In the classification phase, considering an assembly snippet as a normal text, and using sequence matching for similarity comparison. Instruction has a unique structure i.e. mnemonics and registers have a specific position in instruction and also have a semantic relationship, which makes assembly code different from general text. Sequence matching performs best for general text but it fails to detect structural and semantic changes at an instruction level thus, its use for classification produces many false results. In this research, we have addressed the aforementioned underlying challenges by proposing a two-fold solution. For the 1:1 mapping phase, we have proposed computationally inexpensive features, which are compared with distance-based selection criteria to map similar functions and filter unmatched functions. For the classification phase, we have proposed a Siamese binary-classification neural network where each branch is an attention-based distributed learning embedding neural network -that learn the semantic similarity among assembly instructions, learn to highlight the changes at an instruction level and a final stage fully connected layer learn to accurately classify two 1:1 mapped function either an exact or a partial match. We have used x86 kernel binaries for training and achieved ∼99% classification accuracy; which is higher than existing binary diffing techniques and tools.

show abstract

“…So these conventional modeling methods need to be adjusted. DIRE [21] used both lexical information obtained from the tokenized code as well as structural information obtained from the corresponding ASTs to recover variable names. David et al [22] combined static analysis with encoder-decoder-based models to predict procedure names in stripped binaries.…”

Section: Related Workmentioning

confidence: 99%

Automated Data-Processing Function Identification Using Deep Neural Network

Kuang

Wang

et al. 2020

IEEE Access

View full text Add to dashboard Cite

The number of software vulnerabilities is increasing year by year. In the era of big data, data-processing software with many users is more concerned by hackers. It is essential to improve the efficiency of discovering vulnerabilities in data-processing software. We noticed that in the process of discovering vulnerabilities, some problems of existing technology such as fuzzing, symbolic execution, and taint analysis have more or fewer relationships with data-processing functions. In fuzzing, there are two types of sanity checks toward the target program: NCC (Non-critical check) and CC (critical check). It is usually challenging to bypass such a sanity check, which leads to low code coverage during fuzzing. In symbolic execution, the constraint solver still has the problem of trying to deal with the constraints of complex algorithms. In taint analysis, the problem of over-taint and under-taint is always the key to affect the accuracy of the results. Therefore, to solve the above problems, it is necessary to identify the data-processing function. Based on identifying data-processing functions, we could identify those sanity checks, ease the solution of complex constraints, and understand the way of taints propagation to assist in software vulnerability discovery and analysis. This paper proposed a method called DPFI(data-processing function identification) for identifying data-processing functions with deep neural networks. We collected 37000 functions from GitHub and implemented the method on the data set with several neural networks, among which the performance of CNN achieved best and F 1-score was 0.90. We then applied the trained model on CGC(cyber grand challenge) data and real softwares for testing. For CGC, we got 448 functions in 20 programs, in which 35 were identified as data-processing functions. For real softwares, such as FFmpeg, 7zip, jpeg, the precision rate all reached 0.90 and F 1-score was above 0.87.

show abstract

DIRE: A Neural Approach to Decompiled Identifier Naming

Cited by 58 publications

References 33 publications

Typilus: neural type hints

Typilus: neural type hints

BinDiff_NN : Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences

Automated Data-Processing Function Identification Using Deep Neural Network

Contact Info

Product

Resources

About

DIRE: A Neural Approach to Decompiled Identifier Naming

Cited by 58 publications

References 33 publications

Typilus: neural type hints

Typilus: neural type hints

BinDiffNN : Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences

Automated Data-Processing Function Identification Using Deep Neural Network

Contact Info

Product

Resources

About

BinDiff_NN : Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences