Semantic clustering: Identifying topics in source code

Kuhn, Adrian; Ducasse, Sté́phane; Gîrba, Tudor

doi:10.1016/j.infsof.2006.10.017

Cited by 391 publications

(257 citation statements)

References 36 publications

Supporting

Mentioning

254

Contrasting

Unclassified

Order By: Relevance

“…Therefore, a code search engine can facilitate searching in specific topics, or support filtering results by topics. Results from topic modeling on source code show that it is possible to extract meaningful topics from source code automatically (Baldi et al 2008;Kuhn et al 2007). Code search engines should leverage such information.…”

Section: Topic Modelingmentioning

confidence: 99%

Analyzing and mining a code search engine usage log

Bajracharya

Lopes

2010

Empir Software Eng

View full text Add to dashboard Cite

This paper presents an analysis of a year long usage log of Koders, the first commercially available Internet-Scale code search engine (http://www.koders.com). The usage log comprises about ten million activities from more than three million users. Analysis of the usage data shows that despite of attracting a large number of visitors, Koders has a very sparse usage and that it lacks regular usage from many of its users. When compared to Web search, search behavior in Koders showed many similar patterns. A topic modeling analysis of the usage data shows what topics users of Koders are looking for. Observations on the prevalence of these topics among the users, and observations on how search and download activities vary across topics, lead to the conclusion that users who find code search engines usable are those who already know to a high level of specificity what to look for. This paper also presents a general categorization of these topics that provides insights on the different ways code search engine users express their queries. It identifies various forms of queries in Koders's log and the kinds of results addressed by the queries. It also provides several suggestions for improvements in code search engines based on the analysis of usage, topics, and query forms. The work presented in this paper is the first of its kind that reveals several insights on the usage of an Internet-Scale code search engine.

show abstract

Section: Topic Modelingmentioning

confidence: 99%

Analyzing and mining a code search engine usage log

Bajracharya

Lopes

2010

Empir Software Eng

View full text Add to dashboard Cite

show abstract

“…As expected, the equations suggest too few clusters for small systems, but are otherwise remarkably robust across a range of system sizes, and for systems written in different languages. This equation can be compared to an estimate given by Kuhn et al in [10]. For an m × n document-term matrix, where m is the number of documents (classes, instead of methods or functions) and n is the total number of terms over all documents, the authors suggest using a value of (m × n) 0.2 .…”

Section: Resultsmentioning

confidence: 99%

“…Many authors propose somewhere in the range of 200 to 300 topics [7,8], and a recent study showed "islands of stability" around 300 to 500 topics for document sets in the millions, with performance degrading outside of that range [9]. Kuhn et al suggest using smaller topic values, noting that a smaller number of topics may be warranted for analyzing software corpora because the document count is smaller than typical natural-language corpora [10]. However, source code documents are classes in their research, whereas for us documents are methods or functions.…”

Section: Introductionmentioning

confidence: 99%

Using heuristics to estimate an appropriate number of latent topics in source code analysis

Grant

Cordy

Skillicorn

2013

Science of Computer Programming

View full text Add to dashboard Cite

Latent Dirichlet Allocation (LDA) is a data clustering algorithm that performs especially well for text documents. In natural-language applications it automatically finds groups of related words (called "latent topics") and clusters the documents into sets that are about the same "topic". LDA has also been applied to source code, where the documents are natural source code units such as methods or classes, and the words are the keywords, operators, and programmer-defined names in the code. The problem of determining a topic count that most appropriately describes a set of source code documents is an open problem. We address this empirically by constructing clusterings with different numbers of topics for a large number of software systems, and then use a pair of measures based on source code locality and topic model similarity to assess how well the topic structure identifies related source code units. Results suggest that the topic count required can be closely approximated using the number of software code fragments in the system. We extend these results to recommend appropriate topic counts for arbitrary software systems based on an analysis of a set of open source systems.

show abstract

“…Poshyvanyk and Marcus (2006) Fluri et al use a set-based similarity metric to explore how comments and code evolve over time (Fluri et al 2007). Kuhn et al (2007) proposed the use of IR techniques to exploit linguistic information found in source code, such as identifiers (i.e., class or method) names and comments. Revelle et al (2011) define new feature coupling metrics based on structural and textual source code information.…”

Section: Related Workmentioning

confidence: 99%

An empirical study on the interplay between semantic coupling and co-change of software classes

2017

View full text Add to dashboard Cite

Software systems continuously evolve to accommodate new features and interoperability relationships between artifacts point to increasingly relevant software change impacts. During maintenance, developers must ensure that related entities are updated to be consistent with these changes. Studies in the static change impact analysis domain have identified that a combination of source code and lexical information outperforms using each one when adopted independently. However, the extraction of lexical information and the measure of how loosely or closely related two software artifacts are, considering the semantic information embedded in their comments and identifiers has been carried out using somewhat complex information retrieval (IR) techniques. The interplay between software semantic and change relationship strengths has also not been extensively studied. This work aims to fill both gaps by comparing the effectiveness of measuring semantic coupling of OO software classes using (i) simple identifier based techniques and (ii) the word corpora of the entire classes in a software system. Afterwards, we empirically investigate the interplay between semantic and change coupling. The empirical results show that: (1) identifier based methods have more computational efficiency but cannot always be used interchangeably with corpora-based methods of computing semantic coupling of classes and (2) there is no correlation between semantic and change coupling. Furthermore we found that (3) there is a directional relationship between the two, as over 70% of the semantic dependencies are also linked by change coupling but not vice versa.

show abstract

Semantic clustering: Identifying topics in source code

Cited by 391 publications

References 36 publications

Analyzing and mining a code search engine usage log

Analyzing and mining a code search engine usage log

Using heuristics to estimate an appropriate number of latent topics in source code analysis

An empirical study on the interplay between semantic coupling and co-change of software classes

Contact Info

Product

Resources

About