Hitesh Sajnani scite author profile

Previous studies have shown that there is a non-trivial amount of duplication in source code. This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 428 million iles written in Java, C++, Python, and JavaScript. We found that this corpus has a mere 85 million unique iles. In other words, 70% of the code on GitHub consists of clones of previously created iles. There is considerable variation between language ecosystems. JavaScript has the highest rate of ile duplication, only 6% of the iles are distinct. Java, on the other hand, has the least duplication, 60% of iles are distinct. Lastly, a project-level analysis shows that between 9% and 31% of the projects contain at least 80% of iles that can be found elsewhere. These rates of duplication have implications for systems built on open source software as well as for researchers interested in analyzing large code bases. As a concrete artifact of this study, we have created DéjàVu, a publicly available map of code duplicates in GitHub repositories. CCS Concepts: • Information systems → Near-duplicate and plagiarism detection; • Software and its engineering → Ultra-large-scale systems;

show abstract

SourcererCC: Scalable and Accurate Clone Detection

Sajnani

Saini

Roy

et al. 2021

View full text Add to dashboard Cite

A study on the lifecycle of flaky tests

Lam

Muşlu

Sajnani

et al. 2020

View full text Add to dashboard Cite

File cloning in open source Java projects: The good, the bad, and the ugly

Ossher

Sajnani

Lopes

2011

View full text Add to dashboard Cite

SourcererCC

et al. 2016

View full text Add to dashboard Cite

Despite a decade of active research, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. We present SourcererCC, a token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized invertedindex to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone.We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks, (1) a large benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (250MLOC) using a standard workstation.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Hitesh Sajnani

DéjàVu: a map of code duplicates on GitHub

SourcererCC: Scalable and Accurate Clone Detection

A study on the lifecycle of flaky tests

File cloning in open source Java projects: The good, the bad, and the ugly

SourcererCC

Contact Info

Product

Resources

About