Compiler-Agnostic Function Detection in Binaries

Andriesse, Dennis; Slowinska, Asia; Bos, Herbert

doi:10.1109/eurosp.2017.11

Cited by 78 publications

(74 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Even if the program comes with metadata identifying the code sections, compiler optimizations make static analysis harder [11]: Often, compilers embed small chunks of data in the instruction stream. Microsoft Visual Studio includes data and padding bytes between instructions when producing x86 and x86-64 code [12], and ARM code often contains jump tables and large constants embedded in the instruction stream [13]. This "inline" data, if wrongly identified as an instruction (or vice-versa), leads to an erroneous analysis.…”

Section: Introductionmentioning

confidence: 99%

ELISA: ELiciting ISA of Raw Binaries for Fine-Grained Code and Data Separation

Nicolao

Pogliani

Polino

et al. 2018

Detection of Intrusions and Malware, and Vulnerability Assessment

View full text Add to dashboard Cite

Static binary analysis techniques are widely used to reconstruct the behavior and discover vulnerabilities in software when source code is not available. To avoid errors due to mis-interpreting data as machine instructions (or vice-versa), disassemblers and static analysis tools must precisely infer the boundaries between code and data. However, this information is often not readily available. Worse, compilers may embed small chunks of data inside the code section. Most state of the art approaches to separate code and data are rooted on recursive traversal disassembly, with severe limitations when dealing with indirect control instructions. We propose ELISA, a technique to separate code from data and ease the static analysis of executable files. ELISA leverages supervised sequential learning techniques to locate the code section(s) boundaries of header-less binary files, and to predict the instruction boundaries inside the identified code section. As a preliminary step, if the Instruction Set Architecture (ISA) of the binary is unknown, ELISA leverages a logistic regression model to identify the correct ISA from the file content. We provide a comprehensive evaluation on a dataset of executables compiled for different ISAs, and we show that our method is capable to identify code sections with a byte-level accuracy (F1 score) ranging from 98.13% to over 99.9% depending on the ISA. Fine-grained separation of code from embedded data on x86, x86-64 and ARM executables is accomplished with an accuracy of over 99.9%.

show abstract

Section: Introductionmentioning

confidence: 99%

ELISA: ELiciting ISA of Raw Binaries for Fine-Grained Code and Data Separation

Nicolao

Pogliani

Polino

et al. 2018

Detection of Intrusions and Malware, and Vulnerability Assessment

View full text Add to dashboard Cite

show abstract

“…In this section, firstly, we present the experimental results of our proposed Code Action Network for the machine instruction level (CAN-M) and the byte level (CAN-B) compared with other baselines including IDA, ByteWeight (BW) no-RFCR, ByteWeight (BW) [2], the Bidirectional RNN (BRNN) [12] and Nucleus [1]. Secondly, we perform error analysis to qualitatively investigate our proposed methods.…”

Section: Methodsmentioning

confidence: 99%

“…We also compared the average predictive performance for case by case including the function start, function bound and function scope identifications of our CAN-M and CAN-B using the hidden size of 256 and LSTM cell with the Bidirectional RNN, ByteWeight, and Nucleus in both Linux and Windows platforms. For Nucleus [1], we reported the experimental results reported in that paper. The experimental results in Table 2 indicate that our CAN-M and CAN-B again outperformed the baselines, while CAN-M obtained the highest predictive performances in all measures (Recall, Precision and F1 score).…”

Section: Code Action Network Versus Bidirectional Rnn Byteweight Andmentioning

confidence: 99%

“…However, later research in [14] argued that this task is non-trivial and complex in some specific cases wherein it is too challenging for heuristics-based methods to discover all function boundaries. Other influential works and tools that rely on signature database and structural graphs include IDA Pro, Dyninst, (Binary Analysis Platform) BAP, and Nucleus [1]. Andriesse et al [1] has recently proposed a new signature-less approach to function detection for stripped binaries named Nucleus which is based on structural Control Flow Graph analysis.…”

Section: Introductionmentioning

confidence: 99%

“…Other influential works and tools that rely on signature database and structural graphs include IDA Pro, Dyninst, (Binary Analysis Platform) BAP, and Nucleus [1]. Andriesse et al [1] has recently proposed a new signature-less approach to function detection for stripped binaries named Nucleus which is based on structural Control Flow Graph analysis. More specifically, Nucleus identifies functions in the intraprocedural control flow graph (ICFG) by analyzing the control flow between basic blocks, based on the observation that intraprocedural control flow tends to use different types and patterns of control flow instructions than inter-procedural control flow.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Code Action Network for Binary Function Scope Identification

Nguyen

Le³

et al. 2020

Advances in Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

Function identification is a preliminary step in binary analysis for many applications from malware detection, common vulnerability detection and binary instrumentation to name a few. In this paper, we propose the Code Action Network (CAN) whose key idea is to encode the task of function scope identification to a sequence of three action states NI (i.e., next inclusion), NE (i.e., next exclusion), and FE (i.e., function end) to efficiently and effectively tackle function scope identification, the hardest and most crucial task in function identification. A bidirectional Recurrent Neural Network is trained to match binary programs with their sequence of action states. To work out function scopes in a binary, this binary is first fed to a trained CAN to output its sequence of action states which can be further decoded to know the function scopes in the binary. We undertake extensive experiments to compare our proposed method with other state-of-the-art baselines. Experimental results demonstrate that our proposed method outperforms the state-of-the-art baselines in terms of predictive performance on real-world datasets which include binaries from well-known libraries.

show abstract