Suggesting meaningful variable names for decompiled code: a machine translation approach

Jaffe, Alan

doi:10.1145/3106237.3121274

Cited by 13 publications

(23 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In t r o d u c t io n Deep Learning (DL) has been used to support a vast variety of code-related tasks. Some examples include automatic bug fixing [1]- [4], learning generic code changes [5], code migration [6 ], [7], code summarization [8]- [11], pseudo-code generation [12], code deobfuscation [13], [14], injection of code mutants [15], automatic generation of assert statements [16], and code completion [17]- [21]. These works customize DL models proposed in the Natural Language Processing (NLP) field to support the previously listed tasks.…”

mentioning

confidence: 99%

Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks

Mastropaolo

Scalabrino

Cooper

et al. 2021

2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)

165

View full text Add to dashboard Cite

Deep learning (DL) techniques are gaining more and more attention in the software engineering community. They have been used to support several code-related tasks, such as automatic bug fixing and code comments generation. Recent studies in the Natural Language Processing (NLP) field have shown that the Text-To-Text Transfer Transformer (T5) architecture can achieve state-of-the-art performance for a variety of NLP tasks. The basic idea behind T5 is to first pre-train a model on a large and generic dataset using a self-supervised task (e.g., filling masked words in sentences). Once the model is pre-trained, it is fine-tuned on smaller and specialized datasets, each one related to a specific task (e.g., language translation, sentence classification). In this paper, we empirically investigate how the T5 model performs when pre-trained and fine-tuned to support code-related tasks. We pre-train a T5 model on a dataset composed of natural language English text and source code. Then, we fine-tune such a model by reusing datasets used in four previous works that used DL techniques to: (i) fix bugs, (ii) inject code mutants, (iii) generate assert statements, and (iv) generate code comments. We compared the performance of this single model with the results reported in the four original papers proposing DL-based solutions for those four tasks. We show that our T5 model, exploiting additional data for the self-supervised pre-training phase, can achieve performance improvements over the four baselines.

show abstract

mentioning

confidence: 99%

Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks

Mastropaolo

Scalabrino

Cooper

et al. 2021

2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)

165

View full text Add to dashboard Cite

show abstract

“…This is achieved by using two encoders-a lexical encoder (Section III-B1) and a structural encoder (Section III-B2)-to separately capture the lexical and structural signals in the decompiled code. As we will show, this combination of lexical and structural information allows DIRE to outperform techniques that rely on lexical information alone [21].…”

Section: A Overviewmentioning

confidence: 86%

“…The remaining step, mapping developer-chosen names to variable IDs, is the core challenge in automatic corpus generation. Following our previous approach [21], we leverage the decompiler's ability to incorporate developer-chosen identifier names into decompiled code when DWARF debugging symbols [26] are present in the binary. However, this alone is not sufficient to identify which developer-chosen name maps to a particular variable ID generated in the first step.…”

Section: Generation Of Training Datamentioning

confidence: 99%

“…One solution to these problems proposed by prior work is to post-process the decompiler output using heuristics to align decompiler-assigned and developer-assigned names [21]. However, this technique can only correctly align 72.8% of variable names, therefore limiting the overall accuracy of any subsequent model trained on this data.…”

Section: Generation Of Training Datamentioning

confidence: 99%

“…While this is conceptually straight-forward, the two outputs are not simply α-renamings, making the process of calculating these alignments far from trivial. Prior work identified alignments based entirely on heuristics [21]. In contrast, we observe that the set of instruction addresses that access each variable uniquely identifies that variable, and this can be used to generate accurate alignments (Section IV).…”

Section: Introductionmentioning

confidence: 97%

See 2 more Smart Citations

DIRE: A Neural Approach to Decompiled Identifier Naming

Lacomis

Yin

Schwartz

et al. 2019

2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE)

View full text Add to dashboard Cite

The decompiler is one of the most common tools for examining binaries without corresponding source code. It transforms binaries into high-level code, reversing the compilation process. Decompilers can reconstruct much of the information that is lost during the compilation process (e.g., structure and type information). Unfortunately, they do not reconstruct semantically meaningful variable names, which are known to increase code understandability. We propose the Decompiled Identifier Renaming Engine (DIRE), a novel probabilistic technique for variable name recovery that uses both lexical and structural information recovered by the decompiler. We also present a technique for generating corpora suitable for training and evaluating models of decompiled code renaming, which we use to create a corpus of 164,632 unique x86-64 binaries generated from C projects mined from GITHUB. 1 Our results show that on this corpus DIRE can predict variable names identical to the names in the original source code up to 74.3% of the time.

show abstract

The Strengths and Behavioral Quirks of Java Bytecode Decompilers

Harrand

Soto-Valero

Monperrus

et al. 2019

2019 19th International Working Conference on Source Code Analysis and Manipulation (SCAM)

View full text Add to dashboard Cite

During compilation from Java source code to bytecode, some information is irreversibly lost. In other words, compilation and decompilation of Java code is not symmetric. Consequently, the decompilation process, which aims at producing source code from bytecode, must establish some strategies to reconstruct the information that has been lost. Modern Java decompilers tend to use distinct strategies to achieve proper decompilation. In this work, we hypothesize that the diverse ways in which bytecode can be decompiled has a direct impact on the quality of the source code produced by decompilers.We study the effectiveness of eight Java decompilers with respect to three quality indicators: syntactic correctness, syntactic distortion and semantic equivalence modulo inputs. This study relies on a benchmark set of 14 real-world open-source software projects to be decompiled (2041 classes in total).Our results show that no single modern decompiler is able to correctly handle the variety of bytecode structures coming from real-world programs. Even the highest ranking decompiler in this study produces syntactically correct output for 84% of classes of our dataset and semantically equivalent code output for 78% of classes.

show abstract

Suggesting meaningful variable names for decompiled code: a machine translation approach

Cited by 13 publications

References 14 publications

Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks

Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks

DIRE: A Neural Approach to Decompiled Identifier Naming

The Strengths and Behavioral Quirks of Java Bytecode Decompilers

Contact Info

Product

Resources

About