Crossflow: A Framework for Distributed Mining of Software Repositories

Kolovos, Dimitrios S.; Neubauer, Patrick; Barmpis, Konstantinos; Matragkas, Nicholas; Paige, Richard F.

doi:10.1109/msr.2019.00032

Cited by 9 publications

(8 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This paper extends the preliminary work in [9] and provides a more comprehensive description of Crossflow 2 , focusing on new capabilities such as polyglot support. It also reports on empirical evaluation of Crossflow against an existing repository mining tool, using a case-study from the literature.…”

Section: Takedownmentioning

confidence: 72%

“…In [9], we presented a short (4-page) overview of an early version of Crossflow, a Java-based framework for development and distributed execution of multi-step software repository mining programs (workflows). This preliminary work focused on motivating the need for a tool facilitating distributed execution of software repository mining programs that would allow remote collaborators to contribute their local computational and storage resources.…”

Section: Takedownmentioning

confidence: 99%

See 1 more Smart Citation

Polyglot and Distributed Software Repository Mining with Crossflow

Barmpis

Neubauer

et al. 2020

Proceedings of the 17th International Conference on Mining Software Repositories

Self Cite

View full text Add to dashboard Cite

Mining software repositories at a large scale typically requires substantial computational and storage resources. This creates an increasing need for repository mining programs to be executed in a distributed manner, such that remote collaborators can contribute local computational and storage resources. In this paper we present Crossflow, a novel framework for building polyglot distributed repository mining programs. We demonstrate how Crossflow offers delegation of mining jobs to remote workers and can cache their results, how such workers are able to implement advanced behavior like load balancing and rejecting jobs they either cannot perform or would execute sub-optimally, and how workers of the same analysis program can be written in different programing languages like Java and Python, executing only relevant parts of the program described in that language. CCS CONCEPTS• Information systems → Data mining; • Software and its engineering → Concurrent programming structures.

show abstract

Section: Takedownmentioning

confidence: 72%

Section: Takedownmentioning

confidence: 99%

Polyglot and Distributed Software Repository Mining with Crossflow

Barmpis

Neubauer

et al. 2020

Proceedings of the 17th International Conference on Mining Software Repositories

Self Cite

View full text Add to dashboard Cite

show abstract

“…From an implementation perspective, to deal with the size of repositories to be cloned (which could be in the order of GBs) and minimise the number of expensive repository cloning operations, it would be helpful to ensure that cloned repositories could be saved for later use without the need to re-download their contents. To reuse the downloaded resources, repeated computations involving the same files would be required to be allocated to the same worker Figure 1: MSR pipeline specified in Crossflow [12] nodes, namely the ones that already possess them. By reducing the download costs, especially for large resources such as GitHub repositories, we could see a significant increase in the speed of workflow execution.…”

Section: Motivating Examplementioning

confidence: 99%

“…The objective of this paper is to propose a novel approach to data-aware scheduling that will allow distributed worker nodes a degree of independence and responsibility in the task allocation process. Our approach is implemented on top of Crossflow [12], which is a distributed stream-processing engine. The next section explains the motivation behind locality-aware scheduling, whereas in Section 3 we analyze different techniques proposed in the literature for placing the work close to the data.…”

Section: Introductionmentioning

confidence: 99%

Distributed Data Locality-Aware Job Allocation

Markovic,

Kolovos,

Soares Indrusiak

2023

Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analys

Self Cite

View full text Add to dashboard Cite

Scheduling tasks close to their associated data is crucial in distributed systems to minimize network traffic and latency. Some Big Data frameworks like Apache Spark employ locality functions and job allocation algorithms to minimize network traffic and execution times. However, these frameworks rely on centralized mechanisms, where the master node determines data locality by allocating tasks to available workers with minimal data transfer time, ignoring variances in worker configurations and availability. To address these limitations, we propose a decentralized approach to locality-driven scheduling that grants workers autonomy in the job allocation process while factoring in workers' configurations, such as network and CPU speed differences. Our approach is developed and evaluated on Crossflow, a distributed stream processing platform with data-aware independent worker nodes. Preliminary evaluation experiments indicate that our approach can yield up to 3.57x faster execution times when compared to the baseline centralized approach where the master controls data locality. CCS CONCEPTS• Computing methodologies → Distributed algorithms; • Software and its engineering → Development frameworks and environments.

show abstract

“…is process is made even more difficult for MSR researchers focused on Git data because the retrieval process requires using difficult APIs or manual crawling of Git repositories. Many MSR papers currently use these methods to perform data retrieval and the difficulties with the retrieval process is well documented [2], [4].…”

Section: Introductionmentioning

confidence: 99%

More Effective Software Repository Mining

Tutko,

Henley,

Mockus

2020

Preprint

View full text Add to dashboard Cite

Background: Data mining and analyzing of public Git so ware repositories is a growing research field. e tools used for studies that investigate a single project or a group of projects have been refined, but it is not clear whether the results obtained on such "convenience samples" generalize. Aims: is paper aims to elucidate the difficulties faced by researchers who would like to ascertain the generalizability of their findings by introducing an interface that addresses the issues with obtaining representative samples. Results: To do that we explore how to exploit the World of Code system to make so ware repository sampling and analysis much more accessible. Specifically, we present a resource for Mining So ware Repository researchers that is intended to simplify data sampling and retrieval workflow and, through that, increase the validity and completeness of data. Conclusions: is system has the potential to provide researchers a resource that greatly eases the difficulty of data retrieval and addresses many of the currently standing issues with data sampling.

show abstract

Crossflow: A Framework for Distributed Mining of Software Repositories

Cited by 9 publications

References 4 publications

Polyglot and Distributed Software Repository Mining with Crossflow

Polyglot and Distributed Software Repository Mining with Crossflow

Distributed Data Locality-Aware Job Allocation

More Effective Software Repository Mining

Contact Info

Product

Resources

About