CCLearner: A Deep Learning-Based Clone Detection Approach

Li, Liuqing; Feng, He; Zhuang, Wenjie; Meng, Na; Ryder, Barbara G.

doi:10.1109/icsme.2017.46

Cited by 157 publications

(101 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The embedding vectors are concatenated and then used to compute a prediction score for the patch. Different from existing deep learning techniques working on the source code [16], [17], [24], [36], [44], [66], [68], our hierarchical deep learning-based architecture takes into account the structure of code changes (i.e., files, hunks, lines) and the sequential nature of source code (by considering each line of code as a sequence of words) to predict stable patches in the Linux kernel.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

PatchNet: Hierarchical Deep Learning-Based Stable Patch Identification for the Linux Kernel

Hoang

Lawall²,

Tian³

et al. 2021

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

Linux kernel stable versions serve the needs of users who value stability of the kernel over new features. The quality of such stable versions depends on the initiative of kernel developers and maintainers to propagate bug fixing patches to the stable versions. Thus, it is desirable to consider to what extent this process can be automated. A previous approach relies on words from commit messages and a small set of manually constructed code features. This approach, however, shows only moderate accuracy. In this paper, we investigate whether deep learning can provide a more accurate solution. We propose PatchNet, a hierarchical deep learning-based approach capable of automatically extracting features from commit messages and commit code and using them to identify stable patches. PatchNet contains a deep hierarchical structure that mirrors the hierarchical and sequential structure of commit code, making it distinctive from the existing deep learning models on source code. Experiments on 82,403 recent Linux patches confirm the superiority of PatchNet against various state-of-the-art baselines, including the one recently-adopted by Linux kernel maintainers.

show abstract

Section: Resultsmentioning

confidence: 99%

“…Learning code representation. CCLearner [44] learns a deep neural network classifier from clone pairs and non clone pairs to detect clones. To represent code, it extracts features based on different categories (reserved words, operators, etc.)…”

Section: Related Workmentioning

confidence: 99%

PatchNet: Hierarchical Deep Learning-Based Stable Patch Identification for the Linux Kernel

Hoang

Lawall²,

Tian³

et al. 2021

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

show abstract

“…However, it lacks the analysis of the syntax and semantics of the code, and the detection effect of type-3 and type-4 cloning is not ideal. At present, the Token-based code clone detection method mainly include CCFinder [4], CP-Miner [21], CCAligner [22], and CCLearner [23].…”

Section: Research On Token-based Detection Methodsmentioning

confidence: 99%

Study of Clone Code Detection Method

Wang¹,

Li²,

Hou³

2019

Proceedings of the 3rd International Conference on Computer Engineering, Information Science &Amp; Application Technology (ICCI

View full text Add to dashboard Cite

In the process of software development and maintenance, developers often use "copypaste" or use the development framework, so that a large number of clone codes appear in the software system. In order to eliminate clone code and reduce the negative effects of clone code, researchers have proposed many excellent methods for clone code detection. This paper first introduces the significance of the research on clone code detection, and sorts out the related concepts of clone code detection. Then, according to the clone code detection technology, the existing techniques can be categorized to Text-based, Token-based, Tree-based, Metric-based and Graph-based categories. The five categories, respectively introduce the corresponding existing methods or tools, and summarize their advantages and disadvantages. Finally summarize the key problems in the current clone code detection research.

show abstract

“…The benchmark contains 2.9 million files with 8 million manually validated clone pairs of Type-1 up to Type-4. The BigCloneBench data set was used for clone evaluation and scalability test in several large-scale clone detection and clone search studies Li et al, 2017;Sajnani et al, 2016;Svajlenko and Roy, 2015). Lastly, for the evaluation of Siamese's incremental update, we relied on a set of publicly available 130,719 GitHub Java projects.…”

Section: Data Setsmentioning

confidence: 99%

Siamese: scalable and incremental code clone search via multiple code representations

Ragkhitwetsagul

Krinke

2019

Empir Software Eng

View full text Add to dashboard Cite

This paper presents a novel code clone search technique that is accurate, incremental, and scalable to hundreds of million lines of code. Our technique incorporates multiple code representations (i.e., a technique to transform code into various representations to capture different types of clones), query reduction (i.e., a technique to select clone search keywords based on their uniqueness), and a customised ranking function (i.e., a technique to allow a specific clone type to be ranked on top of the search results) to improve clone search performance. We implemented the technique in a clone search tool, called Siamese, and evaluated its search accuracy and scalability on three established clone data sets. Siamese offers the highest mean average precision of 95% and 99% on two clone benchmarks compared to seven state-of-the-art clone detection tools, and reported the largest number of Type-3 clones compared to three other code search engines. Siamese is scalable and can return cloned code snippets within 8 seconds for a code corpus of 365 million lines of code. Using an index of 130,719 GitHub projects, we demonstrate that Siamese's incremental indexing capability dramatically decreases the index preparation time for large-scale data sets with multiple releases of software projects. The paper discusses the applications of Siamese to facilitate software development and research with two use cases including online code clone detection and clone search with automated license analysis.

show abstract

CCLearner: A Deep Learning-Based Clone Detection Approach

Cited by 157 publications

References 29 publications

PatchNet: Hierarchical Deep Learning-Based Stable Patch Identification for the Linux Kernel

PatchNet: Hierarchical Deep Learning-Based Stable Patch Identification for the Linux Kernel

Study of Clone Code Detection Method

Siamese: scalable and incremental code clone search via multiple code representations

Contact Info

Product

Resources

About