Proceedings of the 17th International Conference on Mining Software Repositories 2020
DOI: 10.1145/3379597.3387499
|View full text |Cite
|
Sign up to set email alerts
|

A Complete Set of Related Git Repositories Identified via Community Detection Approaches Based on Shared Commits

Abstract: In order to understand the state and evolution of the entirety of open source software we need to get a handle on the set of distinct software projects. Most of open source projects presently utilize Git, which is a distributed version control system allowing easy creation of clones and resulting in numerous repositories that are almost entirely based on some parent repository from which they were cloned. Git commits are unlikely to get produce and represent a way to group cloned repositories. We use World of … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
3

Relationship

3
5

Authors

Journals

citations
Cited by 22 publications
(10 citation statements)
references
References 12 publications
0
10
0
Order By: Relevance
“…This is part of our future work, together with the replication of the study on other sets of projects. In this sense, we are especially interested in studying how these observations evolve when moving to project ecosystems (Blincoe et al 2015;Mockus et al 2020) instead of single projects.…”
Section: Discussionmentioning
confidence: 99%
“…This is part of our future work, together with the replication of the study on other sets of projects. In this sense, we are especially interested in studying how these observations evolve when moving to project ecosystems (Blincoe et al 2015;Mockus et al 2020) instead of single projects.…”
Section: Discussionmentioning
confidence: 99%
“…The latest S version contains 9,192,143,411 unique blobs, 2,326,066,436 commits, and 135,162,320 distinct repositories collected from open source communities including GitHub, Bitbucket, and GitLab identified on August 28, 2020 and retrieved by September 18, 2020 [12]. A distinct repository is determined as the "most central" repository to represent a group of repositories found with the Louvain community detection algorithm [13]. By using distinct repositories, many cloned projects (forks) can be avoided when performing an analysis.…”
Section: Dockerfile Datamentioning
confidence: 99%
“…To address this, we use the dataset published in [46], which applies the Louvain community detection algorithm to a massive graph consisting of links between commits and projects in WoC (because two projects are highly unlikely to share the same exact commit unless they are clones). We leverage that work to combine commits from the forked projects and ensure that we do not count the same projectrelated information multiple times due to these forks/clones.…”
Section: A Data Source: World O F Codementioning
confidence: 99%