Lannan Luo scite author profile

Binary code analysis allows analyzing binary code without having access to the corresponding source code. A binary, after disassembly, is expressed in an assembly language. This inspires us to approach binary analysis by leveraging ideas and techniques from Natural Language Processing (NLP), a fruitful area focused on processing text of various natural languages. We notice that binary code analysis and NLP share many analogical topics, such as semantics extraction, classification, and code/text comparison. This work thus borrows ideas from NLP to address two important code similarity comparison problems. (I) Given a pair of basic blocks of different instruction set architectures (ISAs), determining whether their semantics is similar; and (II) given a piece of code of interest, determining if it is contained in another piece of code of a different ISA. The solutions to these two problems have many applications, such as cross-architecture vulnerability discovery and code plagiarism detection.Despite the evident importance of Problem I, existing solutions are either inefficient or imprecise. Inspired by Neural Machine Translation (NMT), which is a new approach that tackles text across natural languages very well, we regard instructions as words and basic blocks as sentences, and propose a novel cross-(assembly)-lingual deep learning approach to solving Problem I, attaining high efficiency and precision. Many solutions have been proposed to determine whether two pieces of code, e.g., functions, are equivalent (called the equivalence problem), which is different from Problem II (called the containment problem). Resolving the cross-architecture code containment problem is a new and more challenging endeavor. Employing our technique for crossarchitecture basic-block comparison, we propose the first solution to Problem II. We implement a prototype system INNEREYE and perform a comprehensive evaluation. A comparison between our approach and existing approaches to Problem I shows that our system outperforms them in terms of accuracy, efficiency and scalability. The case studies applying the system demonstrate that our solution to Problem II is effective. Moreover, this research showcases how to apply ideas and techniques from NLP to largescale binary code analysis.

show abstract

Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection

Luo

Jiang

et al. 2014

171

155

View full text Add to dashboard Cite

Existing code similarity comparison methods, whether source or binary code based, are mostly not resilient to obfuscations. In the case of software plagiarism, emerging obfuscation techniques have made automated detection increasingly difficult.In this paper, we propose a binary-oriented, obfuscationresilient method based on a new concept, longest common subsequence of semantically equivalent basic blocks, which combines rigorous program semantics with longest common subsequence based fuzzy matching. We model the semantics of a basic block by a set of symbolic formulas representing the input-output relations of the block. This way, the semantics equivalence (and similarity) of two blocks can be checked by a theorem prover. We then model the semantics similarity of two paths using the longest common subsequence with basic blocks as elements. This novel combination has resulted in strong resiliency to code obfuscation. We have developed a prototype and our experimental results show that our method is effective and practical when applied to real-world software.

show abstract

Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software and Algorithm Plagiarism Detection

Luo

Jiang

et al. 2017

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis

Redmond¹,

Luo²,

Zeng³

2019

View full text Add to dashboard Cite

Given a closed-source program, such as most of proprietary software and viruses, binary code analysis is indispensable for many tasks, such as code plagiarism detection and malware analysis. Today, source code is very often compiled for various architectures, making cross-architecture binary code analysis increasingly important. A binary, after being disassembled, is expressed in an assembly languages. Thus, recent work starts exploring Natural Language Processing (NLP) inspired binary code analysis. In NLP, words are usually represented in high-dimensional vectors (i.e., embeddings) to facilitate further processing, which is one of the most common and critical steps in many NLP tasks. We regard instructions as words in NLPinspired binary code analysis, and aim to represent instructions as embeddings as well.To facilitate cross-architecture binary code analysis, our goal is that similar instructions, regardless of their architectures, have embeddings close to each other. To this end, we propose a joint learning approach to generating instruction embeddings that capture not only the semantics of instructions within an architecture, but also their semantic relationships across architectures. To the best of our knowledge, this is the first work on building crossarchitecture instruction embedding model. As a showcase, we apply the model to resolving one of the most fundamental problems for binary code similarity comparison-semantics-based basic block comparison, and the solution outperforms the code statistics based approach. It demonstrates that it is promising to apply the model to other cross-architecture binary code analysis tasks.

show abstract

A Multiversion Programming Inspired Approach to Detecting Audio Adversarial Examples

Zeng

et al. 2019

View full text Add to dashboard Cite

Adversarial examples (AEs) are crafted by adding human-imperceptible perturbations to inputs such that a machine-learning based classifier incorrectly labels them. They have become a severe threat to the trustworthiness of machine learning. While AEs in the image domain have been well studied, audio AEs are less investigated. Recently, multiple techniques are proposed to generate audio AEs, which makes countermeasures against them urgent. Our experiments show that, given an audio AE, the transcription results by Automatic Speech Recognition (ASR) systems differ significantly (that is, poor transferability), as different ASR systems use different architectures, parameters, and training datasets. Based on this fact and inspired by Multiversion Programming, we propose a novel audio AE detection approach MVP-EARS, which utilizes the diverse off-the-shelf ASRs to determine whether an audio is an AE. We build the largest audio AE dataset to our knowledge, and the evaluation shows that the detection accuracy reaches 99.88%.While transferable audio AEs are difficult to generate at this moment, they may become a reality in future. We further adapt the idea above to proactively train the detection system for coping with transferable audio AEs. Thus, the proactive detection system is one giant step ahead of attackers working on transferable AEs.

show abstract

Resilient decentralized Android application repackaging detection using logic bombs

Zeng

Luo

Qian

et al. 2018

View full text Add to dashboard Cite

Repackage-Proofing Android Apps

Luo

et al. 2016

View full text Add to dashboard Cite

A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis

Redmond¹,

Luo²,

Zeng³

2018

Preprint

View full text Add to dashboard Cite

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Lannan Luo

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection

Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software and Algorithm Plagiarism Detection

A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis

A Multiversion Programming Inspired Approach to Detecting Audio Adversarial Examples

Resilient decentralized Android application repackaging detection using logic bombs

Repackage-Proofing Android Apps

A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis

Contact Info

Product

Resources

About