What is wrong with topic modeling? And how to fix it using search-based software engineering

Agrawal, Amritanshu; Wei, Fu; Menzies, Tim

doi:10.1016/j.infsof.2018.02.005

Cited by 179 publications

(240 citation statements)

References 63 publications

Supporting

Mentioning

224

Contrasting

Unclassified

Order By: Relevance

“…We used a fixed number of clusters (k = 10) to offer an easily understandable overview of the area and allow a sample of papers for the qualitative analysis to remain small, rather than generating large number of fine-grained clusters. We used the genetic algorithm Differential Evolution to tune LDA hyperparameters alpha and beta as suggested by Agrawal et al (2016).…”

Section: Analyzing Literature With Text Miningmentioning

confidence: 99%

Guest editorial for special section on success and failure in software engineering

et al. 2017

View full text Add to dashboard Cite

Many papers investigate success and failure of software projects from diverse perspectives, leading to a myriad of antecedents, causes, correlates, factors and predictors of success and failure. This body of research has not yet produced a solid, empirically grounded body of evidence enabling actionable practices for increasing success and avoiding failure in software projects. The need for more evidence motivates this special issue, which includes four articles that contribute to our understanding of how software project success and failure relate to topics such as: requirements engineering, user satisfaction, start-up pivots and retrospective discussions. We moreover present a brief systematic review to both situate the accepted articles in existing literature and to explore enduring methodological and conceptual challenges in this area, including developing sound instruments for measuring success, representative sampling without population lists and creating both empirically sound and practically actionable taxonomies of success antecedents.

show abstract

Section: Analyzing Literature With Text Miningmentioning

confidence: 99%

Guest editorial for special section on success and failure in software engineering

et al. 2017

View full text Add to dashboard Cite

show abstract

“…Our initial exploration suggests that text-based unsupervised methods such as topic modeling and clustering are not effective in detecting the information types. The performances of such methods are sensitive to the parameter settings [2] and are highly dependant on the distribution of the terms which can not be generalized across domains [6]. Consequently, in this section, we explore the possibility of utilizing supervised techniques to detect information types of sentences in issue Fig.…”

Section: Automated Information Type Detectionmentioning

confidence: 99%

Analysis and Detection of Information Types of Open Source Software Issue Discussions

Arya

Wang

Guo

et al. 2019

2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)

View full text Add to dashboard Cite

Most modern Issue Tracking Systems (ITSs) for open source software (OSS) projects allow users to add comments to issues. Over time, these comments accumulate into discussion threads embedded with rich information about the software project, which can potentially satisfy the diverse needs of OSS stakeholders. However, discovering and retrieving relevant information from the discussion threads is a challenging task, especially when the discussions are lengthy and the number of issues in ITSs are vast. In this paper, we address this challenge by identifying the information types presented in OSS issue discussions. Through qualitative content analysis of 15 complex issue threads across three projects hosted on GitHub, we uncovered 16 information types and created a labeled corpus containing 4656 sentences. Our investigation of supervised, automated classification techniques indicated that, when prior knowledge about the issue is available, Random Forest can effectively detect most sentence types using conversational features such as the sentence length and its position. When classifying sentences from new issues, Logistic Regression can yield satisfactory performance using textual features for certain information types, while falling short on others. Our work represents a nontrivial first step towards tools and techniques for identifying and obtaining the rich information recorded in the ITSs to support various software engineering activities and to satisfy the diverse needs of OSS stakeholders.Index Terms-collaborative software engineering, issue tracking system, issue discussion analysis

show abstract

“…Source code embeddings can contribute in inferring semantic inconsistencies and assist tasks such as semantic bug localization and recommendations for semantic bug fix. Robust Topic Modeling: Agrawal et al [38] present a comprehensive review of topic modeling studies in software engineering. Following the paradigm of natural language word embeddings, pretrained source code embeddings provide background knowledge that can further enhance existing methods.…”

Section: A Opportunitiesmentioning

confidence: 99%

Semantic Source Code Models Using Identifier Embeddings

Efstathiou

Spinellis

2019

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

View full text Add to dashboard Cite

The emergence of online open source repositories in the recent years has led to an explosion in the volume of openly available source code, coupled with metadata that relate to a variety of software development activities. As an effect, in line with recent advances in machine learning research, software maintenance activities are switching from symbolic formal methods to data-driven methods. In this context, the rich semantics hidden in source code identifiers provide opportunities for building semantic representations of code which can assist tasks of code search and reuse. To this end, we deliver in the form of pretrained vector space models, distributed code representations for six popular programming languages, namely, Java, Python, PHP, C, C++, and C#. The models are produced using fastText, a state-of-the-art library for learning word representations. Each model is trained on data from a single programming language; the code mined for producing all models amounts to over 13.000 repositories. We indicate dissimilarities between natural language and source code, as well as variations in coding conventions in between the different programming languages we processed. We describe how these heterogeneities guided the data preprocessing decisions we took and the selection of the training parameters in the released models. Finally, we propose potential applications of the models and discuss limitations of the models.

show abstract

What is wrong with topic modeling? And how to fix it using search-based software engineering

Cited by 179 publications

References 63 publications

Guest editorial for special section on success and failure in software engineering

Guest editorial for special section on success and failure in software engineering

Analysis and Detection of Information Types of Open Source Software Issue Discussions

Semantic Source Code Models Using Identifier Embeddings

Contact Info

Product

Resources

About