Syntax tree fingerprinting for source code similarity detection

Chilowicz, Michel; Duris, Étienne; Roussel, Gilles

doi:10.1109/icpc.2009.5090050

Cited by 70 publications

(36 citation statements)

References 22 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Abstract syntax tree representations could allow more sophisticate patterns of pre-processing of the representation for better abstraction and normalization of the code, a topic that has been neglected in this article. We are investigating some new techniques in this way [21,32,23] that could also consider the function call graphs of the projects. for the computed similarity metrics between the original project and the obfuscated versions.…”

Section: Resultsmentioning

confidence: 99%

“…It explains the choice of a suffix array as an indexing structure rather than a suffix tree. As introduced by [18], some tools, like CCFinderX [19] or Phoenix [20], have successfully used suffix indexation structures to find duplication in source code using a tokenized form or sibling abstracted syntax sub-trees [21].…”

Section: Studying the Factorized Graph Nodes And Its Inferred Metricsmentioning

confidence: 99%

See 1 more Smart Citation

Viewing functions as token sequences to highlight similarities in source code

Chilowicz

Duris

Roussel

2013

Science of Computer Programming

Self Cite

View full text Add to dashboard Cite

The detection of similarities in source code has applications not only in software re-engineering (to eliminate redundancies) but also in software plagiarism detection. This latter can be a challenging problem since more or less extensive edits may have been performed on the original copy: insertion or removal of useless chunks of code, rewriting of expressions, transposition of code, inlining and outlining of functions, etc. In this paper, we propose a new similarity detection technique not only based on token sequence matching but also on the factorization of the function call graphs. The factorization process merges shared chunks (factors) of codes to cope, in particular, with inlining and outlining. The resulting call graph offers a view of the similarities with their nesting relations. It is useful to infer metrics quantifying similarity at a function level.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Studying the Factorized Graph Nodes And Its Inferred Metricsmentioning

confidence: 99%

Viewing functions as token sequences to highlight similarities in source code

Chilowicz

Duris

Roussel

2013

Science of Computer Programming

Self Cite

View full text Add to dashboard Cite

show abstract

“…Each source code is converted into a parse tree and its contents are translated into token sequence by applying inorder traversal. Chilowicz et al [40] also incorporates parse-tree approach. Yet, their work generates token sequence based on fingerprinting mechanism instead of inorder traversal.…”

Section: Related Workmentioning

confidence: 99%

“…Several works incorporate additional preprocessing to generate more declarative lexical token sequence [38,39,40]. Chilowics et al [38] incorporates function factorization when generating lexical token sequence.…”

Section: Related Workmentioning

confidence: 99%

Detecting Source Code Plagiarism on .NET Programming Languages using Low-level Representation and Adaptive Local Alignment

Rabbani¹,

Karnalim²

2017

JIOS

View full text Add to dashboard Cite

Even though there are various source code plagiarism detection approaches, only a few works which are focused on low-level representation for deducting similarity. Most of them are only focused on lexical token sequence extracted from source code. In our point of view, low-level representation is more beneficial than lexical token since its form is more compact than the source code itself. It only considers semantic-preserving instructions and ignores many source code delimiter tokens. This paper proposes a source code plagiarism detection which rely on low-level representation. For a case study, we focus our work on .NET programming languages with Common Intermediate Language as its low-level representation. In addition, we also incorporate Adaptive Local Alignment for detecting similarity. According to Lim et al, this algorithm outperforms code similarity state-of-the-art algorithm (i.e. Greedy String Tiling) in term of effectiveness. According to our evaluation which involves various plagiarism attacks, our approach is more effective and efficient when compared with standard lexical-token approach.

show abstract

“…Traditional machine learning approaches largely depend on human feature engineering, e.g., [17] for bug detection, [18] for clone detection. Such feature engineering is labelconsuming and ad hoc to a specific task.…”

Section: Motivation a From Machine Learning To Deep Learningmentioning

confidence: 99%

Building Program Vector Representations for Deep Learning

Peng

Mou

et al. 2015

Knowledge Science, Engineering and Management

View full text Add to dashboard Cite

Abstract-Deep learning has made significant breakthroughs in various fields of artificial intelligence. Advantages of deep learning include the ability to capture highly complicated features, weak involvement of human engineering, etc. However, it is still virtually impossible to use deep learning to analyze programs since deep architectures cannot be trained effectively with pure back propagation. In this pioneering paper, we propose the "coding criterion" to build program vector representations, which are the premise of deep learning for program analysis. Our representation learning approach directly makes deep learning a reality in this new field. We evaluate the learned vector representations both qualitatively and quantitatively. We conclude, based on the experiments, the coding criterion is successful in building program representations. To evaluate whether deep learning is beneficial for program analysis, we feed the representations to deep neural networks, and achieve higher accuracy in the program classification task than "shallow" methods, such as logistic regression and the support vector machine. This result confirms the feasibility of deep learning to analyze programs. It also gives primary evidence of its success in this new field. We believe deep learning will become an outstanding technique for program analysis in the near future.

show abstract

Syntax tree fingerprinting for source code similarity detection

Cited by 70 publications

References 22 publications

Viewing functions as token sequences to highlight similarities in source code

Viewing functions as token sequences to highlight similarities in source code

Detecting Source Code Plagiarism on .NET Programming Languages using Low-level Representation and Adaptive Local Alignment

Building Program Vector Representations for Deep Learning

Contact Info

Product

Resources

About