2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR) 2019
DOI: 10.1109/msr.2019.00032
|View full text |Cite
|
Sign up to set email alerts
|

Crossflow: A Framework for Distributed Mining of Software Repositories

Abstract: Large-scale software repository mining typically requires substantial storage and computational resources, and often involves a large number of calls to (rate-limited) APIs such as those of GitHub and StackOverflow. This creates a growing need for distributed execution of repository mining programs to which remote collaborators can contribute computational and storage resources, as well as API quotas (ideally without sharing API access tokens or credentials). In this paper we introduce CROSSFLOW, a novel frame… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2

Relationship

2
5

Authors

Journals

citations
Cited by 9 publications
(8 citation statements)
references
References 4 publications
0
8
0
Order By: Relevance
“…This paper extends the preliminary work in [9] and provides a more comprehensive description of Crossflow 2 , focusing on new capabilities such as polyglot support. It also reports on empirical evaluation of Crossflow against an existing repository mining tool, using a case-study from the literature.…”
Section: Takedownmentioning
confidence: 72%
See 1 more Smart Citation
“…This paper extends the preliminary work in [9] and provides a more comprehensive description of Crossflow 2 , focusing on new capabilities such as polyglot support. It also reports on empirical evaluation of Crossflow against an existing repository mining tool, using a case-study from the literature.…”
Section: Takedownmentioning
confidence: 72%
“…In [9], we presented a short (4-page) overview of an early version of Crossflow, a Java-based framework for development and distributed execution of multi-step software repository mining programs (workflows). This preliminary work focused on motivating the need for a tool facilitating distributed execution of software repository mining programs that would allow remote collaborators to contribute their local computational and storage resources.…”
Section: Takedownmentioning
confidence: 99%
“…From an implementation perspective, to deal with the size of repositories to be cloned (which could be in the order of GBs) and minimise the number of expensive repository cloning operations, it would be helpful to ensure that cloned repositories could be saved for later use without the need to re-download their contents. To reuse the downloaded resources, repeated computations involving the same files would be required to be allocated to the same worker Figure 1: MSR pipeline specified in Crossflow [12] nodes, namely the ones that already possess them. By reducing the download costs, especially for large resources such as GitHub repositories, we could see a significant increase in the speed of workflow execution.…”
Section: Motivating Examplementioning
confidence: 99%
“…The objective of this paper is to propose a novel approach to data-aware scheduling that will allow distributed worker nodes a degree of independence and responsibility in the task allocation process. Our approach is implemented on top of Crossflow [12], which is a distributed stream-processing engine. The next section explains the motivation behind locality-aware scheduling, whereas in Section 3 we analyze different techniques proposed in the literature for placing the work close to the data.…”
Section: Introductionmentioning
confidence: 99%
“…is process is made even more difficult for MSR researchers focused on Git data because the retrieval process requires using difficult APIs or manual crawling of Git repositories. Many MSR papers currently use these methods to perform data retrieval and the difficulties with the retrieval process is well documented [2], [4].…”
Section: Introductionmentioning
confidence: 99%