On the Use of Discretized Source Code Metrics for Author Identification

Shevertalov, M.; Kothari, Jay; Stehle, Edward; Mancoridis, Spiros

doi:10.1109/ssbse.2009.18

Cited by 25 publications

(26 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As discussed in Section 2.2.2, the seven classifier algorithms represented by the machine classifier approaches are case-based reasoning [15], decision trees [18], discriminant analysis variants [12,14,15], nearest-neighbour search [17,19], neural networks [15], Bayesian networks [16], and voting feature intervals [16]. These approaches were published between 1994 [3] and 2009 [19] using either custom-built programs or off-the-shelf software. Our implementation uses the closest available classifier in the Weka machine learning toolkit [41] for each classifier algorithm identified in the literature, as listed in Table VII. In all cases, we used the default Weka parameters for the chosen classifiers except for the k-nearest-neighbour classifier that defaults to k D 1, where we used k D 20, which represents 33% of the instances for one run on COLL-T and a lower proportion for the other collections.…”

Section: Machine Learning Algorithms In Wekamentioning

confidence: 99%

“…Then, statistical analysis, machine learning, or similarity measurement methods are used to classify work samples. This paper considers the machine classifier contributions of Krsul , MacDonell , Ding , Kothari , Lange , Elenbogen , and Shevertalov .…”

Section: Introductionmentioning

confidence: 99%

“…1 1The only omission is the Shevertalov set, as these metric classes are a subset of those used by Lange .…”

mentioning

confidence: 99%

See 2 more Smart Citations

Comparing techniques for authorship attribution of source code

Burrows

Uitdenbogerd

Turpin

2012

Softw. Pract. Exper.

View full text Add to dashboard Cite

SUMMARYAttributing authorship of documents with unknown creators has been studied extensively for natural language text such as essays and literature, but less so for non‐natural languages such as computer source code. Previous attempts at attributing authorship of source code can be categorised by two attributes: the software features used for the classification, either strings of n tokens/bytes (n‐grams) or software metrics; and the classification technique that exploits those features, either information retrieval ranking or machine learning. The results of existing studies, however, are not directly comparable as all use different test beds and evaluation methodologies, making it difficult to assess which approach is superior. This paper summarises all previous techniques to source code authorship attribution, implements feature sets that are motivated by the literature, and applies information retrieval ranking methods or machine classifiers for each approach. Importantly, all approaches are tested on identical collections from varying programming languages and author types. Our conclusions are as follows: (i) ranking and machine classifier approaches are around 90% and 85% accurate, respectively, for a one‐in‐10 classification problem; (ii) the byte‐level n‐gram approach is best used with different parameters to those previously published; (iii) neural networks and support vector machines were found to be the most accurate machine classifiers of the eight evaluated; (iv) use of n‐gram features in combination with machine classifiers shows promise, but there are scalability problems that still must be overcome; and (v) approaches based on information retrieval techniques are currently more accurate than approaches based on machine learning. Copyright © 2012 John Wiley & Sons, Ltd.

show abstract

Section: Machine Learning Algorithms In Wekamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Comparing techniques for authorship attribution of source code

Burrows

Uitdenbogerd

Turpin

2012

Softw. Pract. Exper.

View full text Add to dashboard Cite

show abstract

“…In reported results the system has achieved an accuracy of 60% for range based discretization, 70% for frequency based discretization and 65% with no discretization. In [9], the author used a variety of metrics including number of each type of data types, the cyclomatic complexity, quantity and quality of comments, type of variables and layout of code. They are also working on IDENTIFIED toolkit for automatic extraction of these metrics.…”

Section: Related Workmentioning

confidence: 99%

Source Code Author Attribution Using Author’s Programming Style and Code Smells

Gull¹,

Zia²,

Ilyas³

2017

IJISA

View full text Add to dashboard Cite

Abstract-Source code is an intellectual property and using it without author's permission is a violation of property right. Source code authorship attribution is vital for dealing with software theft, copyright issues and piracies. Characterizing author's signature for identifying their footprints is the core task of authorship attribution. Different aspects of source code have been considered for characterizing signatures including author's coding style and programming structure, etc. The objective of this research is to explore another trait of authors' coding behavior for personifying their footprints. The main question that we want to address is that "can code smells are useful for characterizing authors' signatures? A machine learning based methodology is described not only to address the question but also for designing a system. Two different aspects of source code are considered for its representation into features: author's style and code smells. The author's style related feature representation is used as baseline. Results have shown that code smell can improves the authorship attribution.

show abstract

“…In the research process, the codes of each author were disposed by feature extraction and SVM training. Use SVM powerful pattern recognition capabilities [12] detect software homology. It will provide effective help to the malware forensics (Author tracking), and copyright disputes solving [13] …”

Section: Svm(support Vector Machine Theroy)mentioning

confidence: 99%

A SVM-based Software Homology Detection Method

Sun¹,

Liu²,

Lei³

et al. 2016

Proceedings of the 2016 International Conference on Intelligent Control and Computer Application

View full text Add to dashboard Cite

With the development of software technology, software is becoming increasingly essential in our daily lives. However, there are more and more sorts of pirated software. Therefore, we need to figure out an effective method to prevent plagiarism, protect intellectual property and so on. Software homology detection can be used as protection of software intellectual property, malware defense, and can chase an important means of accountability and evidence. We did some experiments to evaluate the performance of our method .And the results showed that the proposed method can achieve satisfactory outcome.

show abstract

On the Use of Discretized Source Code Metrics for Author Identification

Abstract: Abstract

Cited by 25 publications

References 18 publications

Comparing techniques for authorship attribution of source code

Comparing techniques for authorship attribution of source code

Source Code Author Attribution Using Author’s Programming Style and Code Smells

A SVM-based Software Homology Detection Method

Contact Info

Product

Resources

About