XDA: Accurate, Robust Disassembly with Transfer Learning

Pei, Kexin; Guan, Jonas; Williams-King, David; Yang, Junfeng; Jana, Suman

doi:10.48550/arxiv.2010.00770

Cited by 4 publications

(14 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We want to clarify that like [23], [24], [25], [26] we do not target the function boundary identification task. Ghidra & Hexrays already do this at 90%+ accuracy.…”

Section: Inline Function Recoverymentioning

confidence: 99%

Learning to Find Usages of Library Functions in Optimized Binaries

Ahmed,

Devanbu,

Sawant

2021

Preprint

View full text Add to dashboard Cite

Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries' behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensible, "natural" source constructions into something entirely unrecognizable. Reverse engineering binaries, especially those suspected of being malevolent or guilty of intellectual property theft, are important and time-consuming tasks. There is a great deal of interest in tools to "decompile" binaries back into more natural source code to aid reverse engineering. Decompilation involves several desirable steps, including recreating source-language constructions, variable names, and perhaps even comments. One central optimization step in creating binaries is inlining functions. Recovering these inlined functions from optimized binaries is an essential task that most state-of-the-art decompiler tools try to do but do not perform very well. In this paper, we evaluate a supervised learning approach to the problem of recovering inlined functions. We leverage open-source software and develop an automated labeling scheme to generate a reasonably large dataset of binaries labeled with inlined functions. We augment this large but limited labeled dataset with a pre-training step, which learns the decompiled code statistics from a much larger unlabeled dataset. Thus augmented, our learned labeling model can be combined with an existing decompilation tool, Ghidra, to achieve substantially improved performance in inlined function recovery, especially at higher levels of optimization.

show abstract

“…We want to clarify that like [23], [24], [25], [26] we do not target the function boundary identification task. Ghidra & Hexrays already do this at 90%+ accuracy.…”

Section: Inline Function Recoverymentioning

confidence: 99%

Learning to Find Usages of Library Functions in Optimized Binaries

Ahmed,

Devanbu,

Sawant

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Similarly, many previous studies leverage neural networks to learn binary or assembly code representation. They perform well on binary-based downstream analysis tasks, including code clone detection [39], malicious code detection [2], and disassembly [170]. The application of deep learning in software analysis and software reverse engineering significantly reduces human resources and time costs, no matter from the view of developers or analysts.…”

mentioning

confidence: 99%

“…The application of deep learning in software analysis and software reverse engineering significantly reduces human resources and time costs, no matter from the view of developers or analysts. In addition, compared to traditional tools, the faster speed of deep neural-based disassembly approaches [170] makes them a powerful engine for downstream models like malware classification. It is meaningful to study how to make neural network (NN) models work well in software reverse engineering and software analysis.…”

mentioning

confidence: 99%

Boosting Neural Networks to Decompile Optimized Binaries

Cao

Liang

Chen

et al. 2022

Proceedings of the 38th Annual Computer Security Applications Conference

View full text Add to dashboard Cite

Decompilation aims to transform a low-level program language (LPL) (eg., binary file) into its functionally-equivalent high-level program language (HPL) (e.g., C/C++). It is a core technology in software security, especially in vulnerability discovery and malware analysis. In recent years, with the successful application of neural machine translation (NMT) models in natural language processing (NLP), researchers have tried to build neural decompilers by borrowing the idea of NMT. They formulate the decompilation process as a translation problem between LPL and HPL, aiming to reduce the human cost required to develop decompilation tools and improve their generalizability. However, state-of-the-art learning-based decompilers do not cope well with compiler-optimized binaries. Since real-world binaries are mostly compiler-optimized, decompilers that do not consider optimized binaries have limited practical significance. In this paper, we propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries. NeurDP uses a graph neural network (GNN) model to convert LPL to an intermediate representation (IR), which bridges the gap between source code and optimized binary. We also design an Optimized Translation Unit (OTU) to split functions into smaller code fragments for better translation performance. Evaluation results on datasets containing various types of statements show that NeurDP can decompile optimized binaries with 45.21% higher accuracy than state-of-the-art neural decompilation frameworks.

show abstract

“…[76], [115], [175], [176], [197], function boundary detection [32], [42], [59], [62], [176], [197], static similarity detection [49], [73], [99], [107], [109], [113], [126], [130], [160], [169], type recovery [19], and full decompilation [28], [44], [101]. Each of these capabilities is in turn crucial for downstream security tasks such as malware analysis [51], [67], [81], [122] and software hardening via control-flow-integrity (CFI) enforcement, artificial diversification, or debloating when source code is not available.…”

Section: Introductionmentioning

confidence: 99%

“…Neural binary analyses (NBAs) are seemingly wellmatched to the problem domain, where inference is necessary due to the lossy compilation process. Recent work has shown great promise for performing accurate disassembly [176], [197], function boundary detection [42], [62], [176], [197], and static binary similarity detection [49], [73], [99], [109], [113], [126], [130], [160], [169] that is simultaneously more efficient than deterministic methods.…”

Section: Introductionmentioning

confidence: 99%

Towards rigorous evaluation of binary testing and analysis

Bundt

View full text Add to dashboard Cite

Towards Rigorous Evaluation of Binary Testing and Analysis by Joshua BundtComputer security research is an ever-evolving field that aims to make technology more secure.Attackers constantly seek out vulnerabilities in systems, and defenders strive to introduce new controls to prevent these attacks. Attack research typically involves demonstrating the validity of an attack through a proof of concept. In contrast, defense research requires a higher level of rigor to substantiate that defenses are secure under various conditions and against a willful adversary. In this thesis, we examine the state of rigor in a specific area of defense research: binary testing and analysis. Binary testing and analysis encompasses the tasks and techniques required to evaluate binary code, which is the machine-readable representation of software programs, in order to understand program behavior, identify vulnerabilities, and ensure correctness and security. To assess the robustness of the current techniques and to provide a more rigorous methodology, we first examine the utility of synthetic bug generation as a solution to the scarcity of real bugs for fuzz testing evaluation. We conducted a large-scale measurement study evaluating existing synthetic bug generators with eight fuzzers on 20 software libraries and found that synthetic bugs are easier to discover than organic bugs and the most popular synthetic bug benchmark, LAVA-M, exhibits fundamental flaws that make it unsuitable to recommend for future research. Second, we propose a new workflow to enable humans to more effectively assist fuzz testing through compartment analysis. An empirical study of seven software libraries revealed that compartment analysis can significantly improve a fuzzing campaign even when conducted after a few hours of fuzzing. Finally, we consider the fragility of neural network binary disassemblers at the task of function boundary detection. When comparing traditional disassemblers to neural binary disassemblers, we found the latter to be vulnerable to adversarial attacks which allows the attacker to degrade function boundary detection. In response, we proposed an expanded set of benchmarks and adversarial techniques to provide a better evaluation of neural binary disassemblers. Throughout this dissertation, we propose and demonstrate improved methodologies for rigorously examining and assessing binary testing and analysis efficacy. v AcknowledgementsThe PhD journey starts with someone agreeing to take a long term risk on you despite knowing little more than your resume. For taking the initial risk, I would like to thank my advisor Wil Robertson who has guided me from start to finish while providing advice, unwaverable optimism, and tremendous patience. A special thanks to Tim Leek who took on mentoring me long before I started at Northeastern and continues to provide keen insight and frank opinions that shape my research efforts. I would also like to thank the rest of the members of my committee Pete Manolios, Guevara Noubir, and Davide Balzarotti who pr...

show abstract

XDA: Accurate, Robust Disassembly with Transfer Learning

Cited by 4 publications

References 42 publications

Learning to Find Usages of Library Functions in Optimized Binaries

Learning to Find Usages of Library Functions in Optimized Binaries

Boosting Neural Networks to Decompile Optimized Binaries

Towards rigorous evaluation of binary testing and analysis

Contact Info

Product

Resources

About