Oreo: detection of clones in the twilight zone

Saini, Vaibhav; Farmahinifarahani, Farima; Lu, Yao; Baldi, Pierre; Lopes, Cristina Videira

doi:10.1145/3236024.3236026

Cited by 138 publications

(99 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…BigCloneBench is a benchmark which contains different types of manually validated clones in the repository IJaDataset-2.0 [21] and it defines clone types by syntactic similarity as described in Section II. The framework BigCloneEval [22] summarizes recall performance for different clone types of clone detectors automatically and it is widely used in previous work [4], [6]. We configured the BigCloneEval with minimum clone size 6 lines and 50 tokens which are consistent with the standard minimum clone size.…”

Section: Discussionmentioning

confidence: 99%

“…Deckard [5] builds the characteristic vectors from abstract syntax tree (AST) to detect clones, but suffers from low precision and recall rate. Deep learning methods such as Oreo [6] encode software metrics into semantic vectors and achieve good results, but they mainly focus on semantic clones. For these considerations, we present a tool aimed at detecting large-variance code clones called LVMapper.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LVMapper: A Large-Variance Clone Detector Using Sequencing Alignment Approach

Wang

Yin

et al. 2020

IEEE Access

View full text Add to dashboard Cite

To detect large-variance code clones (i.e. clones with relatively more differences) in large-scale code repositories is difficult because most current tools can only detect almost identical or very similar clones. It will make promotion and changes to some software applications such as bug detection, code completion, software analysis, etc. Recently, CCAligner made an attempt to detect clones with relatively concentrated modifications called large-gap clones. Our contribution is to develop a novel and effective detection approach of large-variance clones to more general cases for not only the concentrated code modifications but also the scattered code modifications. A detector named LVMapper is proposed, borrowing and changing the approach of sequencing alignment in bioinformatics which can find two similar sequences with more differences. The ability of LVMapper was tested on both self-synthetic datasets and real cases, and the results show substantial improvement in detecting large-variance clones compared with other state-of-the-art tools including CCAligner. Furthermore, our new tool also presents good recall and precision for general Type-1, Type-2 and Type-3 clones on the widely used benchmarking dataset, BigCloneBench.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

LVMapper: A Large-Variance Clone Detector Using Sequencing Alignment Approach

Wang

Yin

et al. 2020

IEEE Access

View full text Add to dashboard Cite

show abstract

“…More details about these subcategories can be found elsewhere [19]. Action Token: Action tokens of a method are the tokens corresponding to the methods called and class fields accessed by that method [8]. Additionally, the array accesses made by a method are also special Action tokens namely ArrayAccess and ArrayAccessBinary, where array access of kind arr[i] is an Ar-rayAccess Action token and arr[i+1] is an ArrayAccessBinary Action token.…”

Section: A Definitionsmentioning

confidence: 99%

“…Hence, we use 24 method level software metrics shown in Table I for Type II resolution. The details of these metrics can be found elsewhere [8], [23]. A detailed explanation about the application of Action tokens and software metrics in clone detection can be found in [8].…”

Section: Automatic Resolution Of Type II Clonesmentioning

confidence: 99%

See 1 more Smart Citation

Towards Automating Precision Studies of Clone Detectors

Saini

Farmahinifarahani

et al. 2019

2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)

Self Cite

View full text Add to dashboard Cite

Current research in clone detection suffers from poor ecosystems for evaluating precision of clone detection tools. Corpora of labeled clones are scarce and incomplete, making evaluation labor intensive and idiosyncratic, and limiting intertool comparison. Precision-assessment tools are simply lacking.We present a semi-automated approach to facilitate precision studies of clone detection tools. The approach merges automatic mechanisms of clone classification with manual validation of clone pairs. We demonstrate that the proposed automatic approach has a very high precision and it significantly reduces the number of clone pairs that need human validation during precision experiments. Moreover, we aggregate the individual effort of multiple teams into a single evolving dataset of labeled clone pairs, creating an important asset for software clone research.

show abstract

Nearest‐neighbor, BERT‐based, scalable clone detection: A practical approach for large‐scale industrial code bases

Ahmed,

Patten,

Han

et al. 2024

Softw Pract Exp

View full text Add to dashboard Cite

Hidden code clones negatively impact software maintenance, but manually detecting them in large codebases is impractical. Additionally, automated approaches find detection of syntactically‐divergent clones very challenging. While recent deep neural networks (for example BERT‐based artificial neural networks) seem more effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases. We present SSCD, a BERT‐based clone detection approach that targets high recall of Type 3 and Type 4 clones at a very large scale (in line with our industrial partner's requirements). It computes a representative embedding for each code fragment and finds similar fragments using a nearest neighbor search. Thus, SSCD avoids the pairwise‐comparison bottleneck of other neural network approaches, while also using a parallel, GPU‐accelerated search to tackle scalability. This article describes the approach, proposing and evaluating several refinements to improve Type 3/4 clone detection at scale. It provides a substantial empirical evaluation of the technique, including a speed/efficacy comparison of the approach against SourcererCC and Oreo, the only other neural‐network approach currently capable of scaling to hundreds of millions of LOC. It also includes a large in‐situ evaluation on our industrial collaborator's code base that assesses the original technique, the impact of the proposed refinements and illustrates the impact of incremental, active learning on its efficacy. We find that SSCD is significantly faster and more accurate than SourcererCC and Oreo. SAGA, a GPU‐accelerated traditional clone detection approach, is a little better than SSCD for T1/T2 clones, but substantially worse for T3/T4 clones. Thus, SSCD is both scalable to industrial code sizes, and comparatively more accurate than existing approaches for difficult T3/T4 clone searching. In‐situ evaluation on company datasets shows that SSCD outperforms the baseline approach (CCFinderX) for T3/T4 clones. Whitespace removal and active learning further improve SSCD effectiveness.

show abstract

Oreo: detection of clones in the twilight zone

Cited by 138 publications

References 56 publications

LVMapper: A Large-Variance Clone Detector Using Sequencing Alignment Approach

LVMapper: A Large-Variance Clone Detector Using Sequencing Alignment Approach

Towards Automating Precision Studies of Clone Detectors

Nearest‐neighbor, BERT‐based, scalable clone detection: A practical approach for large‐scale industrial code bases

Contact Info

Product

Resources

About