Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts

Bavishi, Rohan; Pradel, Michael; Sen, Koushik

doi:10.48550/arxiv.1809.05193

Cited by 16 publications

(23 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a recent work, CONTEXT2NAME [4] attempted to assign meaningful names to the identifiers based on the context of minified JavaScript codes. They were able to successfully predict 47.5% of meaningful identifiers on 15,000 minified codes using recurrent neural networks.…”

Section: Neural Modelsmentioning

confidence: 99%

“…While researchers have taken steps in predicting variable names in high-level programming languages [1,3,4,26,41,45], it is worth noting that inferring variable names in decompiled binary code poses a unique set of challenges. High-level programming languages like Java, Python, and JavaScript are syntactically rich: Variable types are preserved in these languages, while they are usually eliminated in binaries.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Variable Name Recovery in Decompiled Binary Code using Constrained Masked Language Modeling

Banerjee,

Pal,

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

Decompilation is the procedure of transforming binary programs into a high-level representation, such as source code, for human analysts to examine. While modern decompilers can reconstruct and recover much information that is discarded during compilation, inferring variable names is still extremely difficult. Inspired by recent advances in natural language processing, we propose a novel solution to infer variable names in decompiled code based on Masked Language Modeling, Byte-Pair Encoding, and neural architectures such as Transformers and BERT. Our solution takes raw decompiler output, the less semantically meaningful code, as input, and enriches it using our proposed finetuning technique, Constrained Masked Language Modeling. Using Constrained Masked Language Modeling introduces the challenge of predicting the number of masked tokens for the original variable name. We address this count of token prediction challenge with our post-processing algorithm. Compared to the state-of-the-art approaches, our trained VarBERT model is simpler and of much better performance. We evaluated our model on an existing large-scale data set with 164,632 binaries and showed that it can predict variable names identical to the ones present in the original source code up to 84.15% of the time. CCS CONCEPTS• Security and privacy → Software reverse engineering; • Computing methodologies → Natural language processing; Machine learning; Model development and analysis.

show abstract

Section: Neural Modelsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Variable Name Recovery in Decompiled Binary Code using Constrained Masked Language Modeling

Banerjee,

Pal,

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Vasilescu et al [70] describe an approach to recover original names from minified JavaScript programs based on statistical machine translation (SMT). Bavishi et al [11] accomplish this using a deep learning-based technique. Jaffe et al [37] generate meaningful variable names for decompiled code by combining a translation model trained on a parallel corpus with a language model trained on unmodified C code.…”

Section: Related Workmentioning

confidence: 99%

“…We find that adding words found in strings and comments appears to have little impact on BPE 5K and 10K, both of which slightly increase the size of the corpus by 1-2%. A vocabulary of 10K words is more than 1,000 times smaller than the initial configuration (11,357,210), at the cost of increasing the number of tokens in the corpus by a factor of 1.7.…”

Section: Byte-pair Encodingmentioning

confidence: 99%

Modeling Vocabulary for Big Code Machine Learning

Babii,

Janes,

Robbes

2019

Preprint

View full text Add to dashboard Cite

When building machine learning models that operate on source code, several decisions have to be made to model source-code vocabulary. These decisions can have a large impact: some can lead to not being able to train models at all, others significantly affect performance, particularly for Neural Language Models. Yet, these decisions are not often fully described. This paper lists important modeling choices for source code vocabulary, and explores their impact on the resulting vocabulary on a large-scale corpus of 14,436 projects. We show that a subset of decisions have decisive characteristics, allowing to train accurate Neural Language Models quickly on a large corpus of 10,106 projects.

show abstract

“…However, we are the first to focus on name-value inconsistencies, whereas prior work targets other kinds of problems. Nalin also relates to learned models that predict missing identifier names [12,17,48]. Our work differs by analyzing code with names supposed to be meaningful, instead of targeting obfuscated or compiled code.…”

Section: Introductionmentioning

confidence: 99%

Nalin: Learning from Runtime Behavior to Find Name-Value Inconsistencies in Jupyter Notebooks

Patra¹,

Pradel²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Variable names are important to understand and maintain code. If a variable name and the value stored in the variable do not match, then the program suffers from a name-value inconsistency, which is due to one of two situations that developers may want to fix: Either a correct value is referred to through a misleading name, which negatively affects code understandability and maintainability, or the correct name is bound to a wrong value, which may cause unexpected runtime behavior. Finding name-value inconsistencies is hard because it requires an understanding of the meaning of names and knowledge about the values assigned to a variable at runtime. This paper presents Nalin, a technique to automatically detect name-value inconsistencies. The approach combines a dynamic analysis that tracks assignments of values to names with a neural machine learning model that predicts whether a name and a value fit together. To the best of our knowledge, this is the first work to formulate the problem of finding coding issues as a classification problem over names and runtime values. We apply Nalin to 106,652 real-world Python programs, where meaningful names are particularly important due to the absence of statically declared types. Our results show that the classifier detects name-value inconsistencies with high accuracy, that the warnings reported by Nalin have a precision of 80% and a recall of 76% w.r.t. a ground truth created in a user study, and that our approach complements existing techniques for finding coding issues. CCS CONCEPTS• Software and its engineering → Software maintenance tools; Software post-development issues;

show abstract

Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts

Cited by 16 publications

References 36 publications

Variable Name Recovery in Decompiled Binary Code using Constrained Masked Language Modeling

Variable Name Recovery in Decompiled Binary Code using Constrained Masked Language Modeling

Modeling Vocabulary for Big Code Machine Learning

Nalin: Learning from Runtime Behavior to Find Name-Value Inconsistencies in Jupyter Notebooks

Contact Info

Product

Resources

About