Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution

Simko, Lucy; Zettlemoyer, Luke; Kohno, Tadayoshi

doi:10.1515/popets-2018-0007

Cited by 26 publications

(25 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The applications of AA are vast and include: assigning authorship to literature/text, and ascertaining the demography of an author (e.g., age, gender, native language) (López-Monroy et al, 2020). AA can also be applied to predicting author(s) of source code (Simko et al, 2018), chatbot detec-tion , and even detecting authors intentionally trying to mask their writing style (Juola, 2012;Sánchez-Junquera et al, 2020). Finally, our work bears similarity to (Manjavacas et al, 2017), which investigates the stylistic properties of different neural text generation techniques (i.e., Ngram-based and RNN-based).…”

Section: Applications Of Authorship Attributionmentioning

confidence: 99%

Authorship Attribution for Neural Text Generation

Uchendu¹,

Le²,

Shu³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

In recent years, the task of generating realistic short and long texts have made tremendous advancements. In particular, several recently proposed neural network-based language models have demonstrated their astonishing capabilities to generate texts that are challenging to distinguish from human-written texts with the naked eye. Despite many benefits and utilities of such neural methods, in some applications, being able to tell the "author" of a text in question becomes critically important. In this work, in the context of this Turing Test, we investigate the so-called authorship attribution problem in three versions: (1) given two texts T 1 and T 2 , are both generated by the same method or not? (2) is the given text T written by a human or machine? (3) given a text T and k candidate neural methods, can we single out the method (among k alternatives) that generated T ? Against one humanwritten and eight machine-generated texts (i.e., CTRL, GPT, GPT2, GROVER, XLM, XL-NET, PPLM, FAIR), we empirically experiment with the performance of various models in three problems. By and large, we find that most generators still generate texts significantly different from human-written ones, thereby making three problems easier to solve. However, the qualities of texts generated by GPT2, GROVER, and FAIR are better, often confusing machine classifiers in solving three problems. All codes and datasets of our experiments are

show abstract

Section: Applications Of Authorship Attributionmentioning

confidence: 99%

Authorship Attribution for Neural Text Generation

Uchendu¹,

Le²,

Shu³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…The data is shared publicly to enable further research in this area 3 . This dataset is a good representative of real-world code examples, as opposed to the competitive coding examples, used by most of the other works [24,3,7,1]. The problems with large competitive coding datasets that give near-perfect accuracy are thoroughly discussed in [5].…”

Section: Datasetmentioning

confidence: 99%

ICodeNet - A Hierarchical Neural Network Approach For Source Code Author Identification

Bora¹,

Awalgaonkar²,

Palve³

et al. 2021

2021 13th International Conference on Machine Learning and Computing

View full text Add to dashboard Cite

With the open-source revolution, source codes are now more easily accessible than ever. This has, however, made it easier for malicious users and institutions to copy the code without giving regards to the license, or credit to the original author. Therefore, source code author identification is a critical task with paramount importance. In this paper, we propose ICodeNet -a hierarchical neural network that can be used for source code file-level tasks. The ICodeNet processes source code in image format and is employed for the task of per file author identification. The ICodeNet consists of an ImageNet trained VGG encoder followed by a shallow neural network. The shallow network is based either on CNN or LSTM. Different variations of models are evaluated on a source code author classification dataset. We have also compared our image-based hierarchical neural network model with simple image-based CNN architecture and text-based CNN and LSTM models to highlight its novelty and efficiency.

show abstract

“…The overall goal of such a software is to help to identify the authors of malicious software. This domain has been very active in the last years [7,11,19]. Our tool is designed to identify coding style pattern used by PDF producer tools to detect PDF producer tool.…”

Section: Related Workmentioning

confidence: 99%

Robust PDF Files Forensics Using Coding Style

Adhatarao¹,

Lauradoux²

2021

Preprint

View full text Add to dashboard Cite

Identifying how a file has been created is often interesting in security. It can be used by both attackers and defenders. Attackers can exploit this information to tune their attacks and defenders can understand how a malicious file has been created after an incident. In this work, we want to identify how a PDF file has been created. This problem is important because PDF files are extremely popular: many organizations publish PDF files online and malicious PDF files are commonly used by attackers.Our approach to detect which software has been used to produce a PDF file is based on coding style: given patterns that are only created by certain PDF producers. We have analyzed the coding style of 900 PDF files produced using 11 PDF producers on 3 different Operating Systems. We have obtained a set of 192 rules which can be used to identify 11 PDF producers. We have tested our detection tool on 508836 PDF files published on scientific preprints servers. Our tool is able to detect certain producers with an accuracy of 100%. Its overall detection is still high (74%). We were able to apply our tool to identify how online PDF services work and to spot inconsistency.

show abstract

Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution

Cited by 26 publications

References 14 publications

Authorship Attribution for Neural Text Generation

Authorship Attribution for Neural Text Generation

ICodeNet - A Hierarchical Neural Network Approach For Source Code Author Identification

Robust PDF Files Forensics Using Coding Style

Contact Info

Product

Resources

About