An Empirical Comparison of Model Validation Techniques for Defect Prediction Models

Tantithamthavorn, Chakkrit; McIntosh, Shane; Hassan, Ahmed E.; Matsumoto, Kenichi

doi:10.1109/tse.2016.2584050

Cited by 443 publications

(296 citation statements)

References 96 publications

Supporting

Mentioning

271

Contrasting

Unclassified

Order By: Relevance

“…Still in this category there is a possible threat related to the validation methodology exploited. As shown by Tantithamthavorn et al [114], ten-fold cross validation might provide unstable results because of the effect of random splitting. To deal with this issue, we repeated the 10-fold cross validation 100 times: in this way, we drastically removed the bias due to the validation strategy.…”

Section: Threats To Conclusion Validitymentioning

confidence: 99%

Improving change prediction models with code smell-related information

et al. 2019

View full text Add to dashboard Cite

Code smells represent sub-optimal implementation choices applied by developers when evolving software systems. The nagative impact of code smells has been widely investigated in the past: besides developers' productivity and ability to comprehend source code, researchers empirically showed that the presence of code smells heavily impacts the change-proneness of the affected classes. On the basis of these findings, in this paper we conjecture that code smell-related information can be effectively exploited to improve the performance of change prediction models, i.e., models having as goal that of indicating to developers which classes are more likely to change in the future, so that they may apply preventive maintenance actions. Specifically, we exploit the so-called intensity index -a previously defined metric that captures the severity of a code smell-and evaluate its contribution when added as additional feature in the context of three state of the art change prediction models based on product, process, and developer-based features. We also compare the performance achieved by the proposed model with the one of an alternative technique that considers the previously defined antipattern metrics, namely a set of indicators computed considering the history of code smells in files. Our results report that (i) the prediction performance of the intensity-including

show abstract

Section: Threats To Conclusion Validitymentioning

confidence: 99%

Improving change prediction models with code smell-related information

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Table III reports the parameters we tuned for to each classifier. For tuning, we followed a GridSearch approach [31] with tuneLength = 5-i.e., the maximum number of different values to be evaluated for each parameter [32], [33].…”

Section: Classification Settingsmentioning

confidence: 99%

A Replication Study on Code Comprehension and Expertise using Lightweight Biometric Sensors

Fucci

Girardi

Novielli

et al. 2019

2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC)

View full text Add to dashboard Cite

Code comprehension has been recently investigated from physiological and cognitive perspectives using medical imaging devices. Floyd et al. (i.e., the original study) used fMRI to classify the type of comprehension tasks performed by developers and relate their results to their expertise. We replicate the original study using lightweight biometrics sensors. Our study participants-28 undergrads in computer science-performed comprehension tasks on source code and natural language prose. We developed machine learning models to automatically identify what kind of tasks developers are working on leveraging their brain-, heart-, and skin-related signals. The best improvement over the original study performance is achieved using solely the heart signal obtained through a single device (BAC 87% vs. 79.1%). Differently from the original study, we did not observe a correlation between the participants' expertise and the classifier performance (τ = 0.16, p = 0.31). Our findings show that lightweight biometric sensors can be used to accurately recognize comprehension tasks opening interesting scenarios for research and practice.

show abstract

“…To compare the performance of the best-answer prediction models, as suggested by Tantithamthavorn et al (2017), we use the Scott-Knott ESD test, which groups the models into statistically distinct clusters with a non-negligible difference, at level α = 0.01. The grouping is performed based on mean AUC values (i.e., the mean AUC value of the ten × 10-fold runs for each prediction model).…”

Section: Best-answer Prediction Within Stack Overflowmentioning

confidence: 99%

“…Yet, given the size of the datasets used in the study and the number of different classifiers compared, we reserve to evaluate feature selection techniques in the domain of best-answer prediction in our future work. Finally, we are aware that recent work on defect prediction has shown how the choice of model validation technique (i.e., repeated cross-validation, in this case) may impact the performance estimate (Tantithamthavorn et al 2017). Given that the datasets used in the study are made publicly available, this limitation might be addressed in future independent replications.…”

Section: Threats To Validitymentioning

confidence: 99%

An empirical assessment of best-answer prediction models in technical Q&A sites

2018

View full text Add to dashboard Cite

Technical Q&A sites have become essential for software engineers as they constantly seek help from other experts to solve their work problems. Despite their success, many questions remain unresolved, sometimes because the asker does not acknowledge any helpful answer. In these cases, an information seeker can only browse all the answers within a question thread to assess their quality as potential solutions. We approach this time-consuming problem as a binary-classification task where a best-answer prediction model is built to identify the accepted answer among those within a resolved question thread, and the candidate solutions to those questions that have received answers but are still unresolved. In this paper, we report on a study aimed at assessing 26 best-answer prediction models in two steps. First, we study how models perform when predicting best answers in Stack Overflow, the most popular Q&A site for software engineers. Then, we assess performance in a cross-platform setting where the prediction models are trained on Stack Overflow and tested on other technical Q&A sites. Our findings show that the choice of the classifier and automatied parameter tuning have a large impact on the prediction of the best answer. We also demonstrate that our approach to the bestanswer prediction problem is generalizable across technical Q&A sites. Finally, we provide practical recommendations to Q&A platform designers to curate and preserve the crowdsourced knowledge shared through these sites.

show abstract

An Empirical Comparison of Model Validation Techniques for Defect Prediction Models

Cited by 443 publications

References 96 publications

Improving change prediction models with code smell-related information

Improving change prediction models with code smell-related information

A Replication Study on Code Comprehension and Expertise using Lightweight Biometric Sensors

An empirical assessment of best-answer prediction models in technical Q&A sites

Contact Info

Product

Resources

About