Using heuristics to estimate an appropriate number of latent topics in source code analysis

Grant, Scott; Cordy, James R.; Skillicorn, David B.

doi:10.1016/j.scico.2013.03.015

Cited by 33 publications

(30 citation statements)

References 32 publications

(41 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared to natural language text, Hindle et al [22] reported that text extracted from source code is much more repetitive and predictable. Based on the characteristic of source code and prior researches [23], [24], the preprocess procedure was customized through experimentation and brought the following five preprocess steps:…”

Section: Preprocessing Proceduresmentioning

confidence: 99%

See 1 more Smart Citation

JSEA: A Program Comprehension Tool Adopting LDA-based Topic Modeling

Wang¹,

Liu²

2017

ijacsa

View full text Add to dashboard Cite

Abstract-Understanding a large number of source code is a big challenge for software development teams in software maintenance process. Using topic models is a promising way to automatically discover feature and structure from textual software assets, and thus support developers comprehending programs on software maintenance. To explore the application of applying topic modeling to software engineering practice, we proposed JSEA (Java Software Engineers Assistant), an interactive program comprehension tool adopting LDA-based topic modeling, to support developers during performing software maintenance tasks. JSEA utilizes essential information automatically generated from Java source code to establish a project overview and to bring search capability for software engineers. The results of our preliminary experimentation suggest the practicality of JSEA.

show abstract

Section: Preprocessing Proceduresmentioning

confidence: 99%

“…According to the research of Grant et al [24], in this paper, we tried to put original identifier names into the source of topic modeling too. For instance, an identifier name called "addMenu", was first split into "add" and "menu", and then "addmenu", "add" and "menu" were put into the source of topic modeling.…”

Section: Preprocessing Proceduresmentioning

confidence: 99%

JSEA: A Program Comprehension Tool Adopting LDA-based Topic Modeling

Wang¹,

Liu²

2017

ijacsa

View full text Add to dashboard Cite

show abstract

“…The cosine similarity, a standard similarity metric (described in practice in [6]), is calculated between the remaining functions in a prediction trial and each of the other functions in the software system. For example, in a system with 100 functions, a changelist modifying four functions will result in four prediction trials of three functions each, and each prediction trial with three functions will be compared to 97 other functions.…”

Section: Extracting Prediction Trials From Maintenance Historymentioning

confidence: 99%

Examining the relationship between topic model similarity and software maintenance

Grant

Cordy

2014

2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE)

Self Cite

View full text Add to dashboard Cite

Abstract-Software maintenance is the last phase of software development, and typically one of the most time-consuming. One reason for this is the difficulty in finding related source code fragments. A high-level understanding of the source code is necessary to make decisions about which source code fragments should be modified together, for example, in the context of fixing a bug. Even with a similarity metric available, understanding what it means to measure similarity in the first place is important; if a technique suggests that two source code fragments are related, is there a human-oriented way of explaining that relation? In this paper, we attempt to identify a concrete link between software maintenance and the similarity metrics provided by latent topic models. We show that similarity in topic models is related to the likelihood that source code fragments will be modified together in the future, and that an awareness of similar source code can make software maintenance easier.

show abstract

“…Latent Dirichlet Allocation (LDA) [2] is a generative model that is being applied to a growing number of software engineering (SE) problems [1,3,5,6,8,9,12,13,14]. By estimating the distributions of latent topics describing a text corpus constructed from source code and source-code related artifacts, LDA models can aid in program comprehension as they identify patterns both within the code and between the code and its related artifacts.…”

Section: Introductionmentioning

confidence: 99%

Understanding LDA in source code analysis

Binkley

Heinz

Lawrie

et al. 2014

Proceedings of the 22nd International Conference on Program Comprehension

View full text Add to dashboard Cite

Latent Dirichlet Allocation (LDA) has seen increasing use in the understanding of source code and its related artifacts in part because of its impressive modeling power. However, this expressive power comes at a cost: the technique includes several tuning parameters whose impact on the resulting LDA model must be carefully considered. An obvious example is the burn-in period; too short a burn-in period leaves excessive echoes of the initial uniform distribution. The aim of this work is to provide insights into the tuning parameter's impact. Doing so improves the comprehension of both, 1) researchers who look to exploit the power of LDA in their research and 2) those who interpret the output of LDA-using tools. It is important to recognize that the goal of this work is not to establish values for the tuning parameters because there is no universal best setting. Rather, appropriate settings depend on the problem being solved, the input corpus (in this case, typically words from the source code and its supporting artifacts), and the needs of the engineer performing the analysis. This work's primary goal is to aid software engineers in their understanding of the LDA tuning parameters by demonstrating numerically and graphically the relationship between the tuning parameters and the LDA output. A secondary goal is to enable more informed setting of the parameters. Results obtained using both production source code and a synthetic corpus underscore the need for a solid understanding of how to configure LDA's tuning parameters.

show abstract

Using heuristics to estimate an appropriate number of latent topics in source code analysis

Cited by 33 publications

References 32 publications

JSEA: A Program Comprehension Tool Adopting LDA-based Topic Modeling

JSEA: A Program Comprehension Tool Adopting LDA-based Topic Modeling

Examining the relationship between topic model similarity and software maintenance

Understanding LDA in source code analysis

Contact Info

Product

Resources

About