Topic modeling using models such as Latent Dirichlet Allocation (LDA) is a text mining technique to extract human-readable semantic “topics” (i.e., word clusters) from a corpus of textual documents. In software engineering, topic modeling has been used to analyze textual data in empirical studies (e.g., to find out what developers talk about online), but also to build new techniques to support software engineering tasks (e.g., to support source code comprehension). Topic modeling needs to be applied carefully (e.g., depending on the type of textual data analyzed and modeling parameters). Our study aims at describing how topic modeling has been applied in software engineering research with a focus on four aspects: (1) which topic models and modeling techniques have been applied, (2) which textual inputs have been used for topic modeling, (3) how textual data was “prepared” (i.e., pre-processed) for topic modeling, and (4) how generated topics (i.e., word clusters) were named to give them a human-understandable meaning. We analyzed topic modeling as applied in 111 papers from ten highly-ranked software engineering venues (five journals and five conferences) published between 2009 and 2020. We found that (1) LDA and LDA-based techniques are the most frequent topic modeling techniques, (2) developer communication and bug reports have been modelled most, (3) data pre-processing and modeling parameters vary quite a bit and are often vaguely reported, and (4) manual topic naming (such as deducting names based on frequent words in a topic) is common.
Technical debt is a metaphor that measures the additional effort needed to continue to add more features in a software due to its inherent decrease in code quality. Most software systems suffer from technical debt at some point so that dedicated tools and metrics have been developed to monitor such debt. Alongside tools, appropriate engineering practices must be put in place by the development team to keep that debt at an acceptable level. In this empirical study, we observed and surveyed Scrum development teams composed of experienced students in order to understand their quality-related processes on a year-long academic project. We found that (1) students do use static analysis tools of many forms, but their actual usage is limited due to time pressure; (2) retrospective and non-constraining feedback on code quality has little to no effect, even when given regularly during the course of the project; and (3) junior developers value composite quality indicators (e.g., maintainability, reliability in SonarQube), even if they do not fully understand their meaning. From our findings, we propose a series of recommendations, both technical and methodological, on how to train junior developers to understand and manage technical debt.
CCS CONCEPTS• Software and its engineering → Agile software development; Maintaining software; • Social and professional topics → Software engineering education;
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.