Predicting technical debt from commit contents: reproduction and extension with automated feature selection

Rantala, Leevi; Mäntylä, Martti

doi:10.1007/s11219-020-09520-3

Cited by 9 publications

(3 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One of the threats to construct validity in the study concerns the potentially different interpretations of discussed topics between interviewees and researchers. Because we focus on SATD in this study and most Code Comments [6], [7], [12], [14], [15], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66] Issue Trackers [3], [12], [16] Commit Messages [12] Pull Requests [12] Automated Differentiation Between Fixed and Unfixed SATD -Automated Tracing Between SATD in Different Sources [11], [12], [36], [37] and Code and Related Development Tasks -Automated SATD Prioritization [9], [67], …”

Section: Threats To Validity 61 Construct Validitymentioning

confidence: 99%

Self-Admitted Technical Debt in the Embedded Systems Industry: An Exploratory Case Study

Soliman

Avgeriou

et al. 2023

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

Technical debt denotes shortcuts taken during software development, mostly for the sake of expedience. When such shortcuts are admitted explicitly by developers (e.g., writing a TODO/Fixme comment), they are termed as Self-Admitted Technical Debt or SATD. There has been a fair amount of work studying SATD management in Open Source projects, but SATD in industry is relatively unexplored. At the same time, there is no work focusing on developers' perspectives towards SATD and its management. To address this, we conducted an exploratory case study in cooperation with an industrial partner to study how they think of SATD and how they manage it. Specifically, we collected data by identifying and characterizing SATD in different sources (issues, source code comments, and commits) and carried out a series of interviews with 12 software practitioners. The results show: 1) the core characteristics of SATD in industrial projects; 2) developers' attitudes towards identified SATD and statistics; 3) triggers for practitioners to introduce and repay SATD; 4) relations between SATD in different sources; 5) practices used to manage SATD; 6) challenges and tooling ideas for SATD management.

show abstract

Section: Threats To Validity 61 Construct Validitymentioning

confidence: 99%

Self-Admitted Technical Debt in the Embedded Systems Industry: An Exploratory Case Study

Soliman

Avgeriou

et al. 2023

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

show abstract

“…For the commit messages, we used the same dataset that was used in the study described in [18]. This dataset consists of 73,625 messages, of which 1,876 are classified as SATD.…”

Section: Commits Messagesmentioning

confidence: 99%

“…Rantala and Mäntylä [18] replicating and extending the work introduced by Yan et al [16], they used 1876 commits messages extracted from five repositories (Camel, Log4J, Hadoop, Gerrit, and Tomcat) that were pre-labeled as SATD, and three techniques of NLP (bag-of-words, latent Dirichl et al location, and word embedding), to predict self-admitted technical debt from commit messages. The main contribution of this study, the bag-of-words technique, is the best performance with a median (AUC 0.7411).…”

mentioning

confidence: 99%

Self-admitted technical debt classification using natural language processing word embeddings

Sabbah

Hanani

2023

IJECE

View full text Add to dashboard Cite

<p>Recent studies show that it is possible to detect technical dept automatically from source code comments intentionally created by developers, a phenomenon known as self-admitted technical debt. This study proposes a system by which a comment or commit is classified as one of five dept types, namely, requirement, design, defect, test, and documentation. In addition to the traditional term frequency-inverse document frequency (TF-IDF), several word embeddings methods produced by different pre-trained language models were used for feature extraction, such as Word2Vec, GolVe, bidirectional encoder representations from transformers (BERT), and FastText. The generated features were used to train a set of classifiers including naive Bayes (NB), random forest (RF), support vector machines (SVM), and two configurations of convolutional neural network (CNN). Two datasets were used to train and test the proposed systems. Our collected dataset (A-dataset) includes a total of 1,513 comments and commits manually labeled. Additionally, a dataset, consisting of 4,071 labeled comments, used in previous studies (M-dataset) was also used in this study. The RF classifier achieved an accuracy of 0.822 with A-dataset and 0.820 with the M-dataset. CNN with A-dataset achieved an accuracy of 0.838 using BERT features. With M-dataset, the CNN achieves an accuracy of 0.809 and 0.812 with BERT and Word2Vec, respectively.</p>

show abstract

Self-admitted technical debt in R: detection and causes

et al. 2022

View full text Add to dashboard Cite

Self-Admitted Technical Debt (SATD) is primarily studied in Object-Oriented (OO) languages and traditionally commercial software. However, scientific software coded in dynamically-typed languages such as R differs in paradigm, and the source code comments’ semantics are different (i.e., more aligned with algorithms and statistics when compared to traditional software). Additionally, many Software Engineering topics are understudied in scientific software development, with SATD detection remaining a challenge for this domain. This gap adds complexity since prior works determined SATD in scientific software does not adjust to many of the keywords identified for OO SATD, possibly hindering its automated detection. Therefore, we investigated how classification models (traditional machine learning, deep neural networks, and deep neural Pre-Trained Language Models (PTMs)) automatically detect SATD in R packages. This study aims to study the capabilities of these models to classify different TD types in this domain and manually analyze the causes of each in a representative sample. Our results show that PTMs (i.e., RoBERTa) outperform other models and work well when the number of comments labelled as a particular SATD type has low occurrences. We also found that some SATD types are more challenging to detect. We manually identified sixteen causes, including eight new causes detected by our study. The most common cause was failure to remember, in agreement with previous studies. These findings will help the R package authors automatically identify SATD in their source code and improve their code quality. In the future, checklists for R developers can also be developed by scientific communities such as rOpenSci to guarantee a higher quality of packages before submission.

show abstract

Predicting technical debt from commit contents: reproduction and extension with automated feature selection

Cited by 9 publications

References 55 publications

Self-Admitted Technical Debt in the Embedded Systems Industry: An Exploratory Case Study

Self-Admitted Technical Debt in the Embedded Systems Industry: An Exploratory Case Study

Self-admitted technical debt classification using natural language processing word embeddings

Self-admitted technical debt in R: detection and causes

Contact Info

Product

Resources

About