“Bad smells” in software analytics papers

Shepperd, Martin

doi:10.1016/j.infsof.2019.04.005

Cited by 29 publications

(24 citation statements)

References 103 publications

(140 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The predictive accuracy of the defect prediction model heavily relies on the modelling pipelines of defect prediction models [4,22,56,73,74,76,78]. To accurately predicting defective areas of code, prior studies conducted a comprehensive evaluation to identify the best technique of the modelling pipelines for defect models.…”

Section: The Modelling Pipeline Of Defect Prediction Modelsmentioning

confidence: 99%

Predicting Defective Lines Using a Model-Agnostic Technique

Wattanakriengkrai

Thongtanunam

Tantithamthavorn

et al. 2022

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

Defect prediction models are proposed to help a team prioritize source code areas files that need Software Quality Assurance (SQA) based on the likelihood of having defects. However, developers may waste their unnecessary effort on the whole file while only a small fraction of its source code lines are defective. Indeed, we find that as little as 1%-3% of lines of a file are defective. Hence, in this work, we propose a novel framework (called LINE-DP) to identify defective lines using a model-agnostic technique, i.e., an Explainable AI technique that provides information why the model makes such a prediction. Broadly speaking, our LINE-DP first builds a file-level defect model using code token features. Then, our LINE-DP uses a state-of-the-art model-agnostic technique (i.e., LIME) to identify risky tokens, i.e., code tokens that lead the file-level defect model to predict that the file will be defective. Then, the lines that contain risky tokens are predicted as defective lines. Through a case study of 32 releases of nine Java open source systems, our evaluation results show that our LINE-DP achieves an average recall of 0.61, a false alarm rate of 0.47, a top 20%LOC recall of 0.27, and an initial false alarm of 16, which are statistically better than six baseline approaches. Our evaluation shows that our LINE-DP requires an average computation time of 10 seconds including model construction and defective identification time. In addition, we find that 63% of defective lines that can be identified by our LINE-DP are related to common defects (e.g., argument change, condition change). These results suggest that our LINE-DP can effectively identify defective lines that contain common defects while requiring a smaller amount of inspection effort and a manageable computation cost. The contribution of this paper builds an important step towards line-level defect prediction by leveraging a model-agnostic technique.

show abstract

Section: The Modelling Pipeline Of Defect Prediction Modelsmentioning

confidence: 99%

Predicting Defective Lines Using a Model-Agnostic Technique

Wattanakriengkrai

Thongtanunam

Tantithamthavorn

et al. 2022

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

show abstract

“…6,14,15,[19][20][21][22][23] Recently, researchers have also highlighted that the lack of reference models can be considered as an additional cause for inconsistent findings, and for this reason, the adoption of benchmarking mechanisms is highly recommended as a good practice in decision making. 3,6,23,[33][34][35][42][43][44][45] The proposed reference models can be categorized into "worst-case scenarios" 6,23,33 and "reasonably well standards" 34,35 serving different purposes in the evaluation process. A worst-case model, (eg, the Median model 33 , the Mean model, 23 or the Random model 6 ) can be used in practice when the objective is to assess whether a newly proposed model can be considered as a promising candidate for the estimation task of forthcoming projects.…”

Section: Related Work and Contributionmentioning

confidence: 99%

“…The latter is an important requirement for any new candidate characterizing the quality of the derived solutions, since the inability of a proposed model to predict better than a naïve approach raises significant doubts about its practical value. 45 In other words, worst-case reference models contribute to a straightforward sanity-check scope, since they provide borderline standards that any proposed model should outperform in order to qualify as a potentially useful option. 35,45 Moreover, Menzies et al 14 point out that there is a necessity for introducing "pruning" mechanisms, since practitioners face the problem of being overwhelmed by a growing number of possibly useless SDEE methods.…”

Section: Related Work and Contributionmentioning

confidence: 99%

“…Although the contribution of this work is not the tool itself, structured frameworks through free automated tools in SE can provide meaningful directions for decision making. 23 Furthermore, the implementation of automated tools can also contribute to the degradation of "smelliness" issues related to completeness and clarity of past experimental studies, 45 whereas it also satisfies the sixth requirement discussed by Whigham et al 34 that is related to the public availability of a reference implementation and associated environment for execution. The PLATO toolbox (as a tribute to the great Greek philosopher whose ideas are connected to the origins of archetypes) is an interactive web-based application implemented using Shiny framework 50 taking advantage of the R statistical language 51 in an easy-to-use frontend.…”

Section: Related Work and Contributionmentioning

confidence: 99%

See 1 more Smart Citation

Data‐driven benchmarking in software development effort estimation: The few define the bulk

Mittas

Angelis

2020

J Software Evolu Process

View full text Add to dashboard Cite

Context The rapid evolvement of software development effort estimation models created the need for empirical evaluation of their quality. The empirical evaluation is based either on hypothesis tests with respect to a single criterion or on aggregating methods for multiple criteria. However, a model can be considered as a multidimensional entity performing differently on alternative datasets and its performance can be divergent when expressed by alternative criteria. Objective In this study, we explore this multidimensional nature of models by considering them as points in two different spaces (domain and criteria spaces). Method Introducing an alternative approach for data‐driven benchmarking, a new framework based on archetypal analysis is proposed for evaluation purposes of multiple models. Results The benefits of the framework are illustrated through a large‐scale experimental setup on a set of 93 effort estimation models, trained and tested on 10 datasets under 8 criteria providing answers to critical research questions. Conclusion The results indicate that a small minority of reference models is enough to define the performance of the bulk of all models. The framework focuses on models that have behavior close to archetypes and especially those that are close to a “best” archetype.

show abstract

“…There has been rapid growth in the use of data analytics to support evidence-based software engineering [14,20]. Modern software development relies on short feedback cycles as a way to provide flexibility and rapid adaptation to market fluctuations.…”

Section: Introductionmentioning

confidence: 99%

Integrating runtime data with development data to monitor external quality: challenges from practice

Aghabayli

Pfahl

Martínez‐Fernández

et al. 2019

Proceedings of the 2nd ACM SIGSOFT International Workshop on Software Qualities and Their Dependencies

View full text Add to dashboard Cite

The use of software analytics in software development companies has grown in the last years. Still, there is little support for such companies to obtain integrated insightful and actionable information at the right time. This research aims at exploring the integration of runtime and development data to analyze to what extent external quality is related to internal quality based on real project data. Over the course of more than three months, we collected and analyzed data of a software product following the CRISP-DM process. We studied the integration possibilities between runtime and development data, and implemented two integrations. The number of bugs found in code has a weak positive correlation with code quality measures and a moderate negative correlation with the number of rule violations found. Other types of correlations require more data cleaning and higher quality data for their exploration. During our study, several challenges to exploit data gathered both at runtime and during development were encountered. Lessons learned from integrating external and internal data in software projects may be useful for practitioners and researchers alike.

show abstract

“Bad smells” in software analytics papers

Cited by 29 publications

References 103 publications

Predicting Defective Lines Using a Model-Agnostic Technique

Predicting Defective Lines Using a Model-Agnostic Technique

Data‐driven benchmarking in software development effort estimation: The few define the bulk

Integrating runtime data with development data to monitor external quality: challenges from practice

Contact Info

Product

Resources

About