VGX: Large-Scale Sample Generation for Boosting Learning-Based Software Vulnerability Analyses

Nong, Yu; Fang, Richard; Yi, Guangbei; Zhao, Kunsong; Luo, Xiapu; Chen, Feng; Cai, Haipeng

doi:10.1145/3597503.3639116

Cited by 3 publications

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Learning to Detect and Localize Multilingual Bugs

Yang,

Nong,

Zhang

et al. 2024

Proc. ACM Softw. Eng.

View full text Add to dashboard Cite

Increasing studies have shown bugs in multi-language software as a critical loophole in modern software quality assurance, especially those induced by language interactions (i.e., multilingual bugs). Yet existing tool support for bug detection/localization remains largely limited to single-language software, despite the long-standing prevalence of multi-language systems in various real-world software domains. Extant static/dynamic analysis and deep learning (DL) based approaches all face major challenges in addressing multilingual bugs. In this paper, we present xLoc, a DL-based technique/tool for detecting and localizing multilingual bugs. Motivated by results of our bug-characteristics study on top locations of multilingual bugs, xLoc first learns the general knowledge relevant to differentiating various multilingual control-flow structures. This is achieved by pre-training a Transformer model with customized position encoding against novel objectives. Then, xLoc learns task-specific knowledge for the task of multilingual bug detection/localization, through another new position encoding scheme (based on cross-language API vicinity) that allows for the model to attend particularly to control-flow constructs that bear most multilingual bugs during fine-tuning. We have implemented xLoc for Python-C software and curated a dataset of 3,770 buggy and 15,884 non-buggy Python-C samples, which enabled our extensive evaluation of xLoc against two state-of-the-art baselines: fine-tuned CodeT5 and zero-shot ChatGPT. Our results show that xLoc achieved 94.98% F1 and 87.24%@Top-1 accuracy, which are significantly (up to 162.88% and 511.75%) higher than the baselines. Ablation studies further confirmed significant contributions of each of the novel design elements in xLoc. With respective bug-location characteristics and labeled bug datasets for fine-tuning, our design may be applied to other language combinations beyond Python-C.

show abstract

Learning to Detect and Localize Multilingual Bugs

Yang,

Nong,

Zhang

et al. 2024

Proc. ACM Softw. Eng.

View full text Add to dashboard Cite

show abstract

VinJ: An Automated Tool for Large-Scale Software Vulnerability Data Generation

Nong,

Yang,

Chen

et al. 2024

Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering

View full text Add to dashboard Cite

Improving VulRepair’s Perfect Prediction by Leveraging the LION Optimizer

Kishiyama,

Lee,

Yang

2024

Applied Sciences

View full text Add to dashboard Cite

In current software applications, numerous vulnerabilities may be present. Attackers attempt to exploit these vulnerabilities, leading to security breaches, unauthorized entry, data theft, or the incapacitation of computer systems. Instead of addressing software or hardware vulnerabilities at a later stage, it is better to address them immediately or during the development phase. Tools such as AIBugHunter provide solutions designed to tackle software issues by predicting, categorizing, and fixing coding vulnerabilities. Essentially, developers can see where their code is susceptible to attacks and obtain details about the nature and severity of these vulnerabilities. AIBugHunter incorporates VulRepair to detect and repair vulnerabilities. VulRepair currently predicts patches for vulnerable functions at 44%. To be truly effective, this number needs to be increased. This study examines VulRepair to see whether the 44% perfect prediction can be increased. VulRepair is based on T5 and uses both natural language and programming languages during its pretraining phase, along with byte pair encoding. T5 is a text-to-text transfer transformer model with an encoder and decoder as part of its neural network. It outperforms other models such as VRepair and CodeBERT. However, the hyperparameters may not be optimized due to the development of new optimizers. We reviewed a deep neural network (DNN) optimizer developed by Google in 2023. This optimizer, the Evolved Sign Momentum (LION), is available in PyTorch. We applied LION to VulRepair and tested its influence on the hyperparameters. After adjusting the hyperparameters, we obtained a 56% perfect prediction, which exceeds the value of the VulRepair report of 44%. This means that VulRepair can repair more vulnerabilities and avoid more attacks. As far as we know, our approach utilizing an alternative to AdamW, the standard optimizer, has not been previously applied to enhance VulRepair and similar models.

show abstract

VGX: Large-Scale Sample Generation for Boosting Learning-Based Software Vulnerability Analyses

Cited by 3 publications

References 35 publications

Learning to Detect and Localize Multilingual Bugs

Learning to Detect and Localize Multilingual Bugs

VinJ: An Automated Tool for Large-Scale Software Vulnerability Data Generation

Improving VulRepair’s Perfect Prediction by Leveraging the LION Optimizer

Contact Info

Product

Resources

About