Distributed version control systems provide support for pull request strategy, which is used to register external contributions in collaborative software projects. The data present on a pull request can provide insights of factors that have influence on the acceptance or rejection of contributions in open source projects. Furthermore, the discovery of knowledge about pull requests allows confirming or denying existing hypotheses and helps software developers and project managers to guide their actions. This work proposes the use of data mining, more specifically, the extraction of association rules, to find patterns that exert influence on the acceptance (merge) of a pull request. The results suggest that: (i) the use of association rules allows to identify which factors increase the likelihood of a pull request merge; (ii) the identification of attributes that influence the merge reveals important knowledge about the pull request model; and (iii) with the use of association rules, it is possible to determine which factors contribute to a faster merge.
A new collaboration approach is becoming increasingly common in open-source projects: the pull request model. In this kind of collaboration, developers that do not belong to the core team of a project can submit contributions to the core team. In projects that receive many pull requests, the task of assigning developers to analyze them is a difficult one. In this work, we propose to use data mining techniques, more specifically, classification strategies, in order to suggest the most appropriate developers to analyze a contribution, considering the pull request model. The experiments were conducted using 21 open source projects, each one characterized by 14 attributes. The first set of experiments aimed at indicating just one developer to analyze the pull request. The obtained predictive accuracy ranged from 22.45% to 68.27%. The Random Forest classifier achieved the best result in 76% on the projects. In the second set of experiments, we conclude that, when suggesting three developers to analyze a pull request, the chance of identifying the developer that actually analyzed the pull request ranged from 47.33% to 95.47%.
When external contributors want to collaborate with an open-source project, they fork the repository, make changes, and send a pull request to the core team. However, the lifetime of a pull request, defined by the time interval between its opening and its closing, has a high variation, potentially affecting the contributor engagement. In this context, understanding the root causes of pull request lifetime is important to both the external contributors and the core team. The former can adopt strategies that increase the chances of fast review, while the latter can establish priorities in the reviewing process, alleviating the pending tasks and improving the software quality. In this work, we mined association rules from 97,463 pull requests from 30 projects in order to find characteristics that have affected the pull requests lifetime. In addition, we present a qualitative analysis, helping to understand the patterns discovered from the association rules. The results indicate that: (i) contributions with shorter lifetimes tend to be accepted; (ii) structural characteristics, such as number of commits, changed files, and lines of code, have influence, in an isolated or combined way, on the pull request lifetime; (iii) the files changed and the directories to which they belong can be robust predictors for pull request lifetime; (iv) the profile of external contributors and their social relationships have influence on lifetime; and (v) the number of comments in a pull request, as well as the developer responsible for the review, are important predictors for its lifetime.
A recent survey using industrial projects has shown that providing an estimate of the lifetime of pull requests to developers helps to speed up their conclusion. Previous work has explored pull request lifetime prediction in open‐source projects using regression techniques but with a broad margin of error. The first objective of our work was to reduce the average error rate of the prediction obtained by the regression techniques so far. We performed experiments with different regression techniques and achieved a significant decrease in the mean error rate. The second objective of our work was to obtain a more effective and useful predictive model that can classify pull requests according to five discrete time intervals. We proposed new predictive attributes for the estimation of the time intervals and employed attribute selection strategies to identify subsets of attributes that could improve the predictive behavior of the classifiers. Our classification approach achieved the best accuracy in all the 20 projects evaluated in comparison with the literature. The average accuracy was of 45.28% to predict pull request lifetime, with an average normalized improvement of 14.68% in relation to the majority class and 6.49% in relation to the state‐of‐the‐art.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.