2017
DOI: 10.1145/3133908
|View full text |Cite
|
Sign up to set email alerts
|

DéjàVu: a map of code duplicates on GitHub

Abstract: Previous studies have shown that there is a non-trivial amount of duplication in source code. This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 428 million iles written in Java, C++, Python, and JavaScript. We found that this corpus has a mere 85 million unique iles. In other words, 70% of the code on GitHub consists of clones of previously created iles. There is considerable variation between language ecosystems. JavaScript has the highest rate of ile duplication… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

7
88
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 147 publications
(95 citation statements)
references
References 25 publications
7
88
0
Order By: Relevance
“…It has been shown that ile duplication among GitHub projectsÐmainly targeting popular libraries copied into many projectsÐis a common phenomenon [41], which we also observed for the projects we analyzed (see Table 1). For example, many projects contained sqlite3.c, which corresponds to the database with the same name (which uses the rdtsc instruction), SDL_endian.h for the SDL library (which uses inline assembly for endianness conversions), and inffas86.c (which implements a compression algorithm using inline assembly).…”
Section: Rq22supporting
confidence: 68%
“…It has been shown that ile duplication among GitHub projectsÐmainly targeting popular libraries copied into many projectsÐis a common phenomenon [41], which we also observed for the projects we analyzed (see Table 1). For example, many projects contained sqlite3.c, which corresponds to the database with the same name (which uses the rdtsc instruction), SDL_endian.h for the SDL library (which uses inline assembly for endianness conversions), and inffas86.c (which implements a compression algorithm using inline assembly).…”
Section: Rq22supporting
confidence: 68%
“…To filter duplicates, we used file names, directory names (such as "node_modules"), and md5 of files. In Java and Python, which do not commit dependencies, duplication is less severe (as also observed by Lopes et al [29]). Furthermore, in our setting, we took the top-ranked and most popular projects, in which we observed duplication to be less of a problem (Lopes et al [29] measured duplication across all the code in GitHub).…”
Section: Experimental Settingmentioning
confidence: 52%
“…Java required an order of magnitude more data than the other languages: we had to keep enlarging our Java dataset to achieve results that were close to the other languages. Following recent work which found a large amount of code duplication in GitHub [29], we devoted much effort into filtering duplicates from our dataset, and especially the JavaScript dataset. To filter duplicates, we used file names, directory names (such as "node_modules"), and md5 of files.…”
Section: Experimental Settingmentioning
confidence: 99%
“…We trained models at all three levels of import granularity (i.e., class, package and library-level) and could clearly observe that, for the task of clustering libraries based on semantic similarity, package-and library-level imports produced better non-trivial clusters 5 . In this work we therefore focus on library-level imports.…”
Section: B Assessing the Quality Of The Trained Modelsmentioning
confidence: 98%