Summarising big data: public GitHub dataset for software engineering challenges

Şeker, Abdulkadir; Di̇ri̇, Banu; Arslan, Halil; Amasyali, Fatih

doi:10.17776/csj.728932

Cited by 1 publication

(2 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Repos and all other collections include records that related to the Users collection. All details regarding the creation of the dataset are provided in the source study of the dataset [35].…”

Section: Paper Id Number Of User Number Of Project Ratio (~)mentioning

confidence: 99%

See 1 more Smart Citation

New Developer Metrics for Open Source Software Development Challenges: An Empirical Study of Project Recommendation Systems

2021

Self Cite

View full text Add to dashboard Cite

Software collaboration platforms where millions of developers from diverse locations can contribute to the common open source projects have recently become popular. On these platforms, various information is obtained from developer activities that can then be used as developer metrics to solve a variety of challenges. In this study, we proposed new developer metrics extracted from the issue, commit, and pull request activities of developers on GitHub. We created developer metrics from the individual activities and combined certain activities according to some common traits. To evaluate these metrics, we created an item-based project recommendation system. In order to validate this system, we calculated the similarity score using two methods and assessed top-n hit scores using two different approaches. The results for all scores with these methods indicated that the most successful metrics were binary_issue_related, issue_commented, binary_pr_related, and issue_opened. To verify our results, we compared our metrics with another metric generated from a very similar study and found that most of our metrics gave better scores that metric. In conclusion, the issue feature is more crucial for GitHub compared with other features. Moreover, commenting activity in projects can be equally as valuable as code contributions. The most of binary metrics that were generated, regardless of the number of activities, also showed remarkable results. In this context, we presented improvable and noteworthy developer metrics that can be used for a wide range of open-source software development challenges, such as user characterization, project recommendation, and code review assignment.

show abstract

Section: Paper Id Number Of User Number Of Project Ratio (~)mentioning

confidence: 99%

“…Thus, the results of the related studies are controversial in terms of real platform data (because of working on a smaller dataset). Therefore, in this study, we used a public dataset called GitDataSCP (https://github.com/kadirseker00/GitDataSCP) that is reflective of the sparsity problem inherent in the nature of GitHub [35]. Table 2.…”

Section: Introductionmentioning

confidence: 99%