How are functionally similar code clones syntactically different? An empirical study and a benchmark

Wagner, Stefan; Abdulkhaleq, Asim; Bogicevic, Ivan; Ostberg, Jan-Peter; Ramadani, Jasmin

doi:10.7717/peerj-cs.49

Cited by 18 publications

(10 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We sampled 10 solutions from 30 different problems for a total of 44,850 4 function pairs. Since these solutions all passed the automated test suite from Google they can be considered as type-4 clones [18], [19]. Out of all possible pairs, 1,350 are clones (solutions which were submitted to the same problem).…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Improving Syntactical Clone Detection Methods through the Use of an Intermediate Representation

Caldeira

Sakamoto

Washizaki

et al. 2020

2020 IEEE 14th International Workshop on Software Clones (IWSC)

View full text Add to dashboard Cite

Detection of type-3 and type-4 clones remains a difficult task. Current methods are complex, both on a conceptual and computational level. Similarly, their usage requires substantial implementation efforts. Instead of creating yet another method, it might be more productive to combine the simplicity of syntactic approaches with the abstractions granted by intermediate representations (IR). To this end, we devised a c-like IR based on LLVM and ran NiCad on it (LLNiCad). To establish whether the clone detection capabilities of syntactic approaches can be improved through an IR, we compared NiCad and LLNiCad on three open source projects taken from Krutz's benchmark and a subset of Google code jam solutions. In our results, the f1score of LLNiCad consistently outperforms NiCad. Indeed, for all clone types in Krutz's benchmark, LLNiCad has a f1-score that is 37% higher than NiCad; with both better precision and recall. For type-4 clones in our GCJ benchmark, the f1-score of LLNiCad also outperforms CCCD (a semantic clone detector) by 44%. These findings suggest that IRs are beneficial for improving clone detection and that they have a larger impact on type-3 and type-4 clones.

show abstract

Section: Methodsmentioning

confidence: 99%

“…• Tools built with type-1, and -2 in mind use syntactic [5] or lexical similarities [3] to detect clones. By definition, these methods cannot detect semantic similarities if the syntax used to implement them is different [19]. Their performance for type-4 clones is thus lackluster.…”

Section: Introductionmentioning

confidence: 99%

Improving Syntactical Clone Detection Methods through the Use of an Intermediate Representation

Caldeira

Sakamoto

Washizaki

et al. 2020

2020 IEEE 14th International Workshop on Software Clones (IWSC)

View full text Add to dashboard Cite

show abstract

“…PDG based methods can detect complex Type 3 clones, e.g., Listings 1 and 2. However, the compared PDG sub-graphs are a representation of the source code; thereby, the approaches still rely on syntactic similarity [43].…”

Section: Related Workmentioning

confidence: 99%

Towards Semantic Clone Detection via Probabilistic Software Modeling

Thaller

Linsbauer

Egyed

2020

2020 IEEE 14th International Workshop on Software Clones (IWSC)

View full text Add to dashboard Cite

Semantic clone detection is the process of finding program elements with similar or equal runtime behavior. For example, detecting the semantic equality between the recursive and iterative implementation of the factorial computation. Semantic clone detection is the de facto technical boundary of clone detectors. This boundary was tested over the last years with interesting new approaches. This work contributes a semantic clone detection approach that detects clones with 0% syntactic similarity. We present Semantic Clone Detection via Probabilistic Software Modeling (SCD-PSM) as a stable and precise solution to semantic clone detection. PSM builds a probabilistic model of a program that is capable of evaluating and generating runtime data. SCD-PSM leverages this model and its model elements to finding behaviorally equal model elements. This behavioral equality is then generalized to semantic equality of the original program elements. It uses the likelihood between model elements as a distance metric. Then, it employs the likelihood ratio significance test to decide whether this distance is significant, given a pre-specified and controllable false-positive rate. The output of SCD-PSM are pairs of program elements (i.e., methods), their distance, and a decision whether they are clones or not. SCD-PSM yields excellent results with a Matthews Correlation Coefficient greater 0.9. These results are obtained on classical semantic clone detection problems such as detecting recursive and iterative versions of an algorithm, but also on complex problems used in coding competitions.

show abstract

“…Finally, Wagner et al found that less than 16 % of FSC pairs have actual syntactic similarities [3]. They provide a benchmark for FSCs which has, to the best of our knowledge, not yet been used to test FSC detection approaches.…”

Section: Related Workmentioning

confidence: 99%

“…The problem with classic approaches based on text, tokens, or syntax is that they cannot find clones with a completely different structure. We include those in the so called Functionally Similar Clones (FSCs) [3]. FSCs have the same or similar functionality but were generally created independently.…”

Section: Introductionmentioning

confidence: 99%

Are there functionally similar code clones in practice?

Käfer

Wagner

Koschke

2018

2018 IEEE 12th International Workshop on Software Clones (IWSC)

Self Cite

View full text Add to dashboard Cite

Having similar code fragments, also called clones, in software systems can lead to unnecessary comprehension, review and change efforts. Syntactically similar clones can often be encountered in practice. The same is not clear for only functionally similar clones (FSC).We conducted an exploratory survey among developers to investigate whether they encounter functionally similar clones in practice and whether there is a difference in their inclination to remove them to syntactically similar clones.Of the 34 developers answering the survey, 31 have experienced FSC in their professional work, and 24 have experienced problems caused by FSCs. We found no difference in the inclination and reasoning for removing FSCs and syntactically similar clones.FSCs exist in practice and should be investigated to bring clone detectors to the same quality as for syntactically similar clones, because being able to detect them allows developers to manage and potentially remove them.

show abstract

How are functionally similar code clones syntactically different? An empirical study and a benchmark

Cited by 18 publications

References 24 publications

Improving Syntactical Clone Detection Methods through the Use of an Intermediate Representation

Improving Syntactical Clone Detection Methods through the Use of an Intermediate Representation

Towards Semantic Clone Detection via Probabilistic Software Modeling

Are there functionally similar code clones in practice?

Contact Info

Product

Resources

About