Cataloging GitHub Repositories

Sharma, Abhishek; Thung, Ferdian; Kochhar, Pavneet Singh; Sulistya, Agus; Lo, David

doi:10.1145/3084226.3084287

Cited by 44 publications

(32 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…such as McMillan et al [1]. Others focus on categorizing software repositories and readme files, which helps to better perceive a massive pile of data and grasp the content faster (Prana et al [5], Sharma and Thung et al [2], . .…”

Section: A Recommendation Of Software Repositoriesmentioning

confidence: 99%

“…GitHub creates showcases where they manually catalog a set of repositories on a certain topic. Sharma et al [2] semiautomatically expanded such showcases. Using 10K repositories with readme files, they first extract the most descriptive section in the readme file by selecting the one with the highest cosine similarity value with the repository short description on the top of the repository landing page on GitHub.…”

Section: B Cataloging Software Repositoriesmentioning

confidence: 99%

See 1 more Smart Citation

Quantifying Synergy between Software Projects using README Files Only (S)

El¹

2021

Proceedings of the 33rd International Conference on Software Engineering and Knowledge Engineering

View full text Add to dashboard Cite

Software version control platforms, such as GitHub, host millions of open-source software projects. Due to their diversity, these projects are an appealing realm for discovering software trends. In our work, we seek to quantify synergy between software projects by connecting them via their similar as well as different software features. Our approach is based on the Literature-Based-Discovery (LBD), originally developed to uncover implicit knowledge in scientific literature databases by linking them through transitive connections. We tested our approach by conducting experiments on 13,264 GitHub (opensource) Python projects. Evaluation, based on human ratings of a subset of 90 project pairs, shows that our developed models are capable of identifying potential synergy between software projects by solely relying on their short descriptions (i.e. readme files).

show abstract

Section: A Recommendation Of Software Repositoriesmentioning

confidence: 99%

Section: B Cataloging Software Repositoriesmentioning

confidence: 99%

Quantifying Synergy between Software Projects using README Files Only (S)

El¹

2021

Proceedings of the 33rd International Conference on Software Engineering and Knowledge Engineering

View full text Add to dashboard Cite

show abstract

“…Thus, we consider only projects that have been starred by at least 20 developers. Such a number of stars has been used in some studies [7,82] as a sign of a decent project. The collected dataset and the CrossSim tool are available online for public usage [68].…”

Section: Crosssim Datasetmentioning

confidence: 99%

Detecting Java software similarities by using different clustering techniques

Capiluppi

Ruscio

Rocco

et al. 2020

Information and Software Technology

View full text Add to dashboard Cite

Background-Research on empirical software engineering has increasingly been conducted by analysing and measuring vast amounts of software systems. Hundreds, thousands and even millions of systems have been (and are) considered by researchers, and often within the same study, in order to test theories, demonstrate approaches or run prediction models. A much less investigated aspect is whether the collected metrics might be context-specific, or whether systems should be better analysed in clusters. Objective-The objectives of this study are (i) to define a set of clustering techniques that might be used to group similar software systems, and (ii) to evaluate whether a suite of well-known object-oriented metrics is context-specific, and its values differ along the defined clusters. Method-We group software systems based on three different clustering techniques, and we collect the values of the metrics suite in each cluster. We then test whether clusters are statistically different between each other, using the Kolgomorov-Smirnov (KS) hypothesis testing. Results-Our results show that, for two of the used techniques, the KS null hypothesis (e.g., the clusters come from the same population) is rejected for

show abstract

“…Based on these heuristics, they build a recommendation system named RepoPal and compare it with state-of-the-art approach CLAN using one thousand repositories on GitHub. Sharma et al collect 10,000 popular projects on GitHub and propose a cataloging system to group similar projects into categories [33]. They automatically extract descriptive segments from readme files and aply LDA-GA, a state-of-the-art topic modeling algorithm that combines Latent Dirichlet Allocation (LDA) and Genetic Algorithm (GA), to identify categories.…”

Section: B Large Scale Studies On Githubmentioning

confidence: 99%

Code Coverage and Postrelease Defects: A Large-Scale Study on Open Source Projects

Kochhar

Lawall

et al. 2017

IEEE Trans. Rel.

Self Cite

View full text Add to dashboard Cite

Testing is a pivotal activity in ensuring the quality of software. Code coverage is a common metric used as a yardstick to measure the efficacy and adequacy of testing. However, does higher coverage actually lead to a decline in post-release bugs? Do files that have higher test coverage actually have fewer bug reports? The direct relationship between code coverage and actual bug reports has not yet been analysed via a comprehensive empirical study on real bugs. Past studies only involve a few software systems or artificially injected bugs (mutants). In this empirical study, we examine these questions in the context of open-source software projects based on their actual reported bugs. We analyze 100 large open-source Java projects and measure the code coverage of the test cases that come along with these projects. We collect real bugs logged in the issue tracking system after the release of the software and analyse the correlations between code coverage and these bugs. We also collect other metrics such as cyclomatic complexity and lines of code, which are used to normalize the number of bugs and coverage to correlate with other metrics as well as use these metrics in regression analysis. Our results show that coverage has an insignificant correlation with the number of bugs that are found after the release of the software at the project level, and no such correlation at the file level.

show abstract

Cataloging GitHub Repositories

Cited by 44 publications

References 19 publications

Quantifying Synergy between Software Projects using README Files Only (S)

Quantifying Synergy between Software Projects using README Files Only (S)

Detecting Java software similarities by using different clustering techniques

Code Coverage and Postrelease Defects: A Large-Scale Study on Open Source Projects

Contact Info

Product

Resources

About