VarCLR

Chen, Qibin; Lacomis, Jeremy; Schwartz, Edward J.; Neubig, Graham; Vasilescu, Bogdan; Goues, Claire Le

doi:10.1145/3510003.3510162

Cited by 19 publications

(10 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous studies [25,29,44] spend much effort building tremendous datasets to contain various source codes as much as possible. This is because the effectiveness of their model highly depends on the fed training set.…”

Section: Datasetmentioning

confidence: 99%

“…Metrics. Unlike previous studies [25,29,44], that focus on specific types of optimizations, DeGPT involves several optimizations, including structure simplification, appending comments, and variable renaming. Moreover, some optimizations are adopted for the first time on the output of the decompiler.…”

Section: Datasetmentioning

confidence: 99%

“…The routine is to calculate the embedding of the information from the assembly function with the help of various program analyses and send the embedding to the prediction model, which usually belongs to encoder-decoder architecture. (ii) Facilitate reverse engineering by enhancing the output of the decompiler rather than the disassembler [25,40,44]. This is because the output of the decompiler can provide additional information that facilitates semantic recovery compared to the disassembler.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

DeGPT: Optimizing Decompiler Output with LLM

Hu,

Liang,

Chen

2024

Proceedings 2024 Network and Distributed System Security Symposium

View full text Add to dashboard Cite

Reverse engineering is essential in malware analysis, vulnerability discovery, etc. Decompilers assist the reverse engineers by lifting the assembly to the high-level programming language, which highly boosts binary comprehension. However, decompilers suffer from problems such as meaningless variable names, redundant variables, and lacking comments describing the purpose of the code. Previous studies have shown promising performance in refining the decompiler output by training the models with huge datasets containing various decompiler outputs. However, even datasets that take much time to construct cover limited binaries in the real world. The performance degrades severely facing the binary migration.In this paper, we present DeGPT, an end-to-end framework aiming to optimize the decompiler output to improve its readability and simplicity and further assist the reverse engineers in understanding the binaries better. The Large Language Model (LLM) can mitigate performance degradation with its extraordinary ability endowed by large model size and training set containing rich multi-modal data. However, its potential is difficult to unlock through one-shot use. Thus, we propose the three-role mechanism, which includes referee (R_ref), advisor (R_adv), and operator (R_ope), to adapt the LLM to our optimization tasks. Specifically, R_ref provides the optimization scheme for the target decompiler output, while R_adv gives the rectification measures based on the scheme, and R_ope inspects whether the optimization changes the original function semantics and concludes the final verdict about whether to accept the optimizations. We evaluate DeGPT on the datasets containing decompiler outputs of various software, such as the practical command line tools, malware, a library for audio processing, and implementations of algorithms. The experimental results show that even on the output of the current top-level decompiler (Ghidra), DeGPT can achieve 24.4% reduction in the cognitive burden of understanding the decompiler outputs and provide comments of which 62.9% can provide practical semantics for the reverse engineers to help the understanding of binaries. Our user surveys also show that the optimizations can significantly simplify the code and add helpful semantic information (variable names and comments), facilitating a quick and accurate understanding of the binary.1 We use semantic in two categories of meaning. When it is used for "semantic information", it refers to the semantics of information like comments and variable names. When it is used for "function/code semantics", it refers to the behaviors of the function/code.

show abstract

Section: Datasetmentioning

confidence: 99%

Section: Datasetmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

DeGPT: Optimizing Decompiler Output with LLM

Hu,

Liang,

Chen

2024

Proceedings 2024 Network and Distributed System Security Symposium

View full text Add to dashboard Cite

show abstract

“…Machine learning models are widely-used in binary program analysis tasks (Pei et al, 2020;2021a;Jin et al, 2022;Chen et al, 2022b;Xu et al, 2023b;a;Wang et al, 2022). However, these models are typically designed for specific downstream tasks such as binary code similarity detection (Pei et al, 2020;Xu et al, 2023a;Wang et al, 2022), variable name prediction (Chen et al, 2022b;Xu et al, 2023b), and binary code type inference (Pei et al, 2021a). In contrast, Nova + is a pre-trained binary code model that can be generalized to various downstream tasks, and it is shown outperforming the existing state-of-the-art techniques in three downstream tasks.…”

Section: Binary Code Modelsmentioning

confidence: 99%

Unexpected giant negative area compressibility in palladium diselenide

Jiang

Zhang

Jiang

et al. 2023

National Science Review

View full text Add to dashboard Cite

Negative area compressibility (NAC) is a counterintutive “squeeze-expand” behavior in solids that is very rare but attractive due to possible pressure-response applications and coupling with rich physicochemical properties. Herein, NAC behavior is reported in palladium diselenide with a large magnitude and wide pressure range. We discover that, apart from the rigid flattening of layers that has been generally recognized, the unexpected giant NAC effect in PdSe2 largely comes from anomalous elongation of intralayer chemical bonds. Both structural variations are driven by intralayer-to-interlayer charge transfer with enhanced interlayer interactions under pressure. Our work updates the mechanical understanding of this anomaly and establishes a new guideline to explore novel compression-induced properties.

show abstract

“…Recent studies have shown that state-of-the-art models heavily rely on variables [13,28], specific tokens [29], and even structures [30]. Chen et al [31] focus on semantic representations of program variables, and study how well models can learn similarity between variables that have similar meaning (e.g., minimum and minimal). Ding et al [32] explore the problem of learning functional similarities (and dissimilarities) between codes, towards which they rename variables to inject variable-misuse bugs in order to generate buggy programs that are structurally similar to benign ones.…”

Section: Related Workmentioning

confidence: 99%

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Hussain¹,

Rabin²,

Xu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Although deep neural models substantially reduce the overhead of feature engineering, the features readily available in the inputs might significantly impact training cost and the performance of the models. In this paper, we explore the impact of an unsuperivsed feature enrichment approach based on variable roles on the performance of neural models of code. The notion of variable roles (as introduced in the works of Sajaniemi et al. [1,2]) has been found to help students' abilities in programming. In this paper, we investigate if this notion would improve the performance of neural models of code. To the best of our knowledge, this is the first work to investigate how Sajaniemi et al.'s concept of variable roles can affect neural models of code. In particular, we enrich a source code dataset by adding the role of individual variables in the dataset programs, and thereby conduct a study on the impact of variable role enrichment in training the Code2Seq model. In addition, we shed light on some challenges and opportunities in feature enrichment for neural code intelligence models.

show abstract

VarCLR

Cited by 19 publications

References 40 publications

DeGPT: Optimizing Decompiler Output with LLM

DeGPT: Optimizing Decompiler Output with LLM

Unexpected giant negative area compressibility in palladium diselenide

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Contact Info

Product

Resources

About