A Dataset for GitHub Repository Deduplication: Extended Description

Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris

doi:10.48550/arxiv.2002.02314

Cited by 1 publication

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It has three stages: project discovery, data retrieval, and reorganization as shown in Figure 1, which is typical of most big data systems, that use the layered data approach where the initial layers accumulate and process raw data and the later layers produce cleaned/augmented data. We also perform data augmentation on the collected data, focusing on tasks like fork resolution [67] and author identity resolution [6,35]. The paper describes a rapidly evolving WoC prototype with some aspects of the system evolving over time.…”

Section: Building the Woc Infrastructurementioning

confidence: 99%

World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data

Bogart

Amreen

et al. 2019

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Self Cite

View full text Add to dashboard Cite

Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are the tens of millions of projects in the periphery interconnected through technical dependencies, code sharing, or knowledge flow? To answer such questions we: a) create a very large and frequently updated collection of version control data in the entire FLOSS ecosystems named World of Code (WoC), that can completely cross-reference authors, projects, commits, blobs, dependencies, and history of the FLOSS ecosystems and b) provide capabilities to efficiently correct, augment, query, and analyze that data. Our current WoC implementation is capable of being updated on a monthly basis and contains over 18B Git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.

show abstract

Section: Building the Woc Infrastructurementioning

confidence: 99%