Duplicate Pull Request Detection

Wang, Qingye; Xu, Bowen; Xia, Xin; Wang, Ting; Li, Shanping

doi:10.1145/3361242.3361254

Cited by 23 publications

(16 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although few studies detect duplicate PRs in social coding platforms, these studies are classified into two branches [5]:…”

Section: Detecting Duplicate Pull-requestsmentioning

confidence: 99%

“…The third research question analyzes the efficiency of the proposed approach against the most recent and relevant works. These works take two branches: (I) pull-request retrieval [9,21], and (II) pull-request classification [5,8]. To address this research question, we perform qualitative analysis as we see in the subsequent subsection.…”

Section: Rq1mentioning

confidence: 99%

“…We organize existing works related to our proposal into two categories. The first category includes approaches that retrieve a ranked list of PRs for a given new PR [9,21] while the approaches of the second category assign a label (duplicated or not duplicated) for a given new PR using a ML algorithm [5,8]. The approaches of the first category struggle with the problem, on one hand, of how to adjust the threshold value such that these approaches only retrieve the duplicate or similar PRs and exclude others.…”

Section: Saving Reviewing Efforts (Rq2)mentioning

confidence: 99%

“…In a social coding platforms such as GitHub, contributors (developers) frequently use Pull-Request (PR) mechanisms to submit their code changes to reviewers or owners of a given software project (repository) [1,2].These changes include development activities (e.g., adding new functional features) [3,4], fixing errors in an existing project [5] or for improvements (in terms of performance, usability, reliability, and so on). The contributors are volunteers, who are distributed geographically around the world, and they implicitly collaborate together to work on a repository [1].…”

Section: Introductionmentioning

confidence: 99%

“…Existing works related to the research problem addressed in this article take two directions. The first direction refers to the research works interested in detecting duplicate PRs [1,5,8,9]. However, this research direction considers only duplicate PRs in pairs and does not take into account the similar PRs as a group.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning

2022

View full text Add to dashboard Cite

Context: In a social coding platform such as GitHub, a pull-request mechanism is frequently used by contributors to submit their code changes to reviewers of a given repository. In general, these code changes are either to add a new feature or to fix an existing bug. However, this mechanism is distributed and allows different contributors to submit unintentionally similar pull-requests that perform similar development activities. Similar pull-requests may be submitted to review in parallel time by different reviewers. This will cause redundant reviewing time and efforts. Moreover, it will complicate the collaboration process. Objective: Therefore, it is useful to assign similar pull-requests to the same reviewer to be able to decide which pull-request to choose in effective time and effort. In this article, we propose to group similar pull-requests together into clusters so that each cluster is assigned to the same reviewer or the same reviewing team. This proposal allows saving reviewing efforts and time. Method: To do so, we first extract descriptive textual information from pull-requests content to link similar pull-requests together. Then, we employ the extracted information to find similarities among pull-requests. Finally, machine learning algorithms (K-Means clustering and agglomeration hierarchical clustering algorithms) are used to group similar pull-requests together. Results: To validate our proposal, we have applied it to twenty popular repositories from public dataset. The experimental results show that the proposed approach achieved promising results according to the well-known metrics in this subject: precision and recall. Furthermore, it helps to save the reviewer time and effort. Conclusion: According to the obtained results, the K-Means algorithm achieves 94% and 91% average precision and recall values over all considered repositories, respectively, while agglomeration hierarchical clustering performs 93% and 98% average precision and recall values over all considered repositories, respectively. Moreover, the proposed approach saves reviewing time and effort on average between (67% and 91%) by K-Means algorithm and between (67% and 83%) by agglomeration hierarchical clustering algorithm.

show abstract

“…Although few studies detect duplicate PRs in social coding platforms, these studies are classified into two branches [5]:…”

Section: Detecting Duplicate Pull-requestsmentioning

confidence: 99%

Section: Rq1mentioning

confidence: 99%

Section: Saving Reviewing Efforts (Rq2)mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning

2022

View full text Add to dashboard Cite

show abstract

ReBack: recommending backports in social coding environments

Chakroborti,

Schneider,

Roy

2024

Autom Softw Eng

View full text Add to dashboard Cite

Pull request latency explained: an empirical overview

Zhang

Wang

et al. 2022

Empir Software Eng

View full text Add to dashboard Cite

Pull request latency evaluation is an essential application of effort evaluation in the pullbased development scenario. It can help the reviewers sort the pull request queue, remind developers about the review processing time, speed up the review process and accelerate software development. There is a lack of work that systematically organizes the factors that affect pull request latency. Also, there is no related work discussing the differences and variations in characteristics in different scenarios and contexts. In this paper, we collected relevant factors through a literature review approach. Then we assessed their relative importance in five scenarios and six different contexts using the mixed-effects linear regression model. The most important factors differ in different scenarios. The length of the description is most important when pull requests are submitted. The existence of comments is most important when closing pull requests, using CI tools, and when the contributor and the integrator are different. When there exist comments, the latency of the first comment is the most important. Meanwhile, the influence of factors may change in different contexts. For example, the number of commits in a pull request has a more significant impact on pull request latency when closing than submitting due to changes in contributions brought about by the review process. Both human and bot comments are positively correlated with pull request latency. In contrast, the bot's first comments are more strongly correlated with latency, but the number of comments is less correlated. Future research and tool implementation needs to consider the impact of different contexts. Researchers can conduct related studies based on our publicly available datasets and replication scripts.

show abstract

Duplicate Pull Request Detection

Cited by 23 publications

References 35 publications

Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning

Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning

ReBack: recommending backports in social coding environments

Pull request latency explained: an empirical overview

Contact Info

Product

Resources

About