Shellcode_IA32: A Dataset for Automatic Shellcode Generation

Liguori, Pietro; Al-Hossami, Erfan; Cotroneo, Domenico; Natella, Roberto; Čukić, Bojan; Shaikh, Samira

doi:10.18653/v1/2021.nlp4prog-1.7

Cited by 11 publications

(8 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…BLEU is used to evaluate code generation systems since many prior works in code generation formulated the problem as a machine translation problem of translating English to code snippets (e.g. (Liguori et al, 2021a)). Both exact match and averaged token level BLEU scores have been extensively used in evaluating code generation models (Liguori et al, 2021a,b;Oda et al, 2015b;Ling et al, 2016;Gemmell et al, 2020).…”

Section: Discussionmentioning

confidence: 99%

“…Furthermore, code generation enables users to build code more effectively and efficiently and enhance the overall software engineering process. Code software engineering (Miltner et al, 2019;, robotics (Kuhlmann et al, 2004), and cyber-security (You et al, 2017;Liguori et al, 2021a;Frempong et al, 2021; (a) An assembly code generation task. The task is to generate the assembly code that is then compiled into shellcode (small pieces of code used as a payload to exploit software vulnerabilities) using the natural language descriptions on the right.…”

Section: Overviewmentioning

confidence: 99%

“…The dataset contains multi-line snippets mapping onto one intent. Lines 4-5, 6-7-8, 9-10, 11-12 are multi-line snippets (Liguori et al, 2021a).…”

Section: Overviewmentioning

confidence: 99%

“…Recent works in code generation have been used to assist data scientists to perform data visualization and file manipulation , generating bash commands (Lin et al, 2018), generate exploits (Liguori et al, 2021b,a;Frempong et al, 2021;Liguori et al, 2022), solve interviewlevel programming questions (Hendrycks et al, 2021), manipulate data , and generate code snippets from natural language descriptions in many programming languages. These programming languages include but are not limited to Python (Xu et al, 2020a;Ling et al, 2016;Liguori et al, 2021b), Java (Ling et al, 2016), SQL (Zhong et al, 2017), Excel macro commands (Gulwani and Marron, 2014), Assembly (Liguori et al, 2021a(Liguori et al, ,b, 2022, and JavaScript (Frempong et al, 2021).…”

Section: Overviewmentioning

confidence: 99%

“…Automatic exploit generation is defined as an offensive security technique in which software exploits are automatically generated to explore and test critical vulnerabilities before malicious attackers discover such vulnerabilities (Avgerinos et al, 2011). With the goal of offensive security in mind, Liguori et al (Liguori et al, 2021a) share a dataset, Shell-code_IA32, for shellcode generation from natural language descriptions. Shellcodes are compiled from assembly programs for the 32-bit version of the x86 Intel Architecture and contain a payload that is used in exploiting software vulnerabilities.…”

Section: Datasetsmentioning

confidence: 99%

See 4 more Smart Citations

A Survey on Artificial Intelligence for Source Code: A Dialogue Systems Perspective

Al-Hossami¹,

Shaikh²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this survey paper, we overview major deep learning methods used in Natural Language Processing (NLP) and source code over the last 35 years. Next, we present a survey of the applications of Artificial Intelligence (AI) for source code, also known as Code Intelligence (CI) and Programming Language Processing (PLP). We survey over 287 publications and present a software-engineering centered taxonomy for CI placing each of the works into one category describing how it best assists the software development cycle. Then, we overview the field of conversational assistants and their applications in software engineering and education. Lastly, we highlight research opportunities at the intersection of AI for code and conversational assistants and provide future directions for researching conversational assistants with CI capabilities.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Overviewmentioning

confidence: 99%

“…The dataset contains multi-line snippets mapping onto one intent. Lines 4-5, 6-7-8, 9-10, 11-12 are multi-line snippets (Liguori et al, 2021a).…”

Section: Overviewmentioning

confidence: 99%

Section: Overviewmentioning

confidence: 99%

Section: Datasetsmentioning

confidence: 99%

See 3 more Smart Citations

A Survey on Artificial Intelligence for Source Code: A Dialogue Systems Perspective

Al-Hossami¹,

Shaikh²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Can we generate shellcodes via natural language? An empirical study

et al. 2022

Self Cite

View full text Add to dashboard Cite

Writing software exploits is an important practice for offensive security analysts to investigate and prevent attacks. In particular, shellcodes are especially time-consuming and a technical challenge, as they are written in assembly language. In this work, we address the task of automatically generating shellcodes, starting purely from descriptions in natural language, by proposing an approach based on Neural Machine Translation (NMT). We then present an empirical study using a novel dataset (Shellcode_IA32), which consists of 3200 assembly code snippets of real Linux/x86 shellcodes from public databases, annotated using natural language. Moreover, we propose novel metrics to evaluate the accuracy of NMT at generating shellcodes. The empirical analysis shows that NMT can generate assembly code snippets from the natural language with high accuracy and that in many cases can generate entire shellcodes with no errors.

show abstract