2013
DOI: 10.1016/j.scico.2013.03.015
|View full text |Cite
|
Sign up to set email alerts
|

Using heuristics to estimate an appropriate number of latent topics in source code analysis

Abstract: Latent Dirichlet Allocation (LDA) is a data clustering algorithm that performs especially well for text documents. In natural-language applications it automatically finds groups of related words (called "latent topics") and clusters the documents into sets that are about the same "topic". LDA has also been applied to source code, where the documents are natural source code units such as methods or classes, and the words are the keywords, operators, and programmer-defined names in the code. The problem of deter… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
30
0

Year Published

2014
2014
2020
2020

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 33 publications
(30 citation statements)
references
References 32 publications
(41 reference statements)
0
30
0
Order By: Relevance
“…Compared to natural language text, Hindle et al [22] reported that text extracted from source code is much more repetitive and predictable. Based on the characteristic of source code and prior researches [23], [24], the preprocess procedure was customized through experimentation and brought the following five preprocess steps:…”
Section: Preprocessing Proceduresmentioning
confidence: 99%
See 1 more Smart Citation
“…Compared to natural language text, Hindle et al [22] reported that text extracted from source code is much more repetitive and predictable. Based on the characteristic of source code and prior researches [23], [24], the preprocess procedure was customized through experimentation and brought the following five preprocess steps:…”
Section: Preprocessing Proceduresmentioning
confidence: 99%
“…According to the research of Grant et al [24], in this paper, we tried to put original identifier names into the source of topic modeling too. For instance, an identifier name called "addMenu", was first split into "add" and "menu", and then "addmenu", "add" and "menu" were put into the source of topic modeling.…”
Section: Preprocessing Proceduresmentioning
confidence: 99%
“…The cosine similarity, a standard similarity metric (described in practice in [6]), is calculated between the remaining functions in a prediction trial and each of the other functions in the software system. For example, in a system with 100 functions, a changelist modifying four functions will result in four prediction trials of three functions each, and each prediction trial with three functions will be compared to 97 other functions.…”
Section: Extracting Prediction Trials From Maintenance Historymentioning
confidence: 99%
“…Latent Dirichlet Allocation (LDA) [2] is a generative model that is being applied to a growing number of software engineering (SE) problems [1,3,5,6,8,9,12,13,14]. By estimating the distributions of latent topics describing a text corpus constructed from source code and source-code related artifacts, LDA models can aid in program comprehension as they identify patterns both within the code and between the code and its related artifacts.…”
Section: Introductionmentioning
confidence: 99%