2018
DOI: 10.1016/j.infsof.2018.02.005
|View full text |Cite
|
Sign up to set email alerts
|

What is wrong with topic modeling? And how to fix it using search-based software engineering

Abstract: Context: Topic modeling finds human-readable structures in unstructured textual data. A widely used topic modeling technique is Latent Dirichlet allocation. When running on different datasets, LDA suffers from "order effects", i.e., different topics are generated if the order of training data is shuffled. Such order effects introduce a systematic error for any study. This error can relate to misleading results; specifically, inaccurate topic descriptions and a reduction in the efficacy of text mining classific… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

4
224
1
2

Year Published

2019
2019
2023
2023

Publication Types

Select...
8
2

Relationship

0
10

Authors

Journals

citations
Cited by 179 publications
(240 citation statements)
references
References 63 publications
4
224
1
2
Order By: Relevance
“…We used a fixed number of clusters (k = 10) to offer an easily understandable overview of the area and allow a sample of papers for the qualitative analysis to remain small, rather than generating large number of fine-grained clusters. We used the genetic algorithm Differential Evolution to tune LDA hyperparameters alpha and beta as suggested by Agrawal et al (2016).…”
Section: Analyzing Literature With Text Miningmentioning
confidence: 99%
“…We used a fixed number of clusters (k = 10) to offer an easily understandable overview of the area and allow a sample of papers for the qualitative analysis to remain small, rather than generating large number of fine-grained clusters. We used the genetic algorithm Differential Evolution to tune LDA hyperparameters alpha and beta as suggested by Agrawal et al (2016).…”
Section: Analyzing Literature With Text Miningmentioning
confidence: 99%
“…Our initial exploration suggests that text-based unsupervised methods such as topic modeling and clustering are not effective in detecting the information types. The performances of such methods are sensitive to the parameter settings [2] and are highly dependant on the distribution of the terms which can not be generalized across domains [6]. Consequently, in this section, we explore the possibility of utilizing supervised techniques to detect information types of sentences in issue Fig.…”
Section: Automated Information Type Detectionmentioning
confidence: 99%
“…Source code embeddings can contribute in inferring semantic inconsistencies and assist tasks such as semantic bug localization and recommendations for semantic bug fix. Robust Topic Modeling: Agrawal et al [38] present a comprehensive review of topic modeling studies in software engineering. Following the paradigm of natural language word embeddings, pretrained source code embeddings provide background knowledge that can further enhance existing methods.…”
Section: A Opportunitiesmentioning
confidence: 99%