A comparison of machine learning algorithms on design smell detection using balanced and imbalanced dataset: A study of God class

Alkharabsheh, Khalid; Alawadi, Sadi; Kebande, Victor R.; Crespo, Yania; Fernández-Delgado, Manuel; Taboada, José Antonio Franco

doi:10.1016/j.infsof.2021.106736

Cited by 28 publications

(47 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Therefore, no conclusive empirical evidence from their experimental results showed that using SMOTE, RUS, and ROS could significantly positively impact machine learning-based CSD models. Alkharabsheh et al Alkharabsheh et al (2022) used the machine learning classifiers (i.e., LDA, Quadratic Discriminant Analysis (QDA), NB, Multi-Layer Perceptron (MLP), SVM, DT, GB, CatBoost, Light Gradient Boosting Machine (LGBM), XGBoost, XGBoost with Random Forest (XGBRF), AdaBoost, Bagging, RF, Extra Trees (ET), KNN, Nearest Centroid (NC), Gaussian Process (GP), Ridge, LR, Perceptron, Passive Aggressive (PA), and Stochastic Gradient Descent (SGD)) to compare whether using SMOTE would improve the detection performance on God Class detection. Their results showed that SMOTE could not improve the God Class detection performance.…”

Section: Imbalanced Learning For Csdmentioning

confidence: 99%

“…The former produces a superset of the original code smell data sets by duplicating existing smelly instances or creating new smelly instances from existing smelly ones, while the latter produces a subset of the original code smell data sets by eliminating non-smelly instances. To maintain consistency with previous studies Pecorelli et al (2020); Alkharabsheh et al (2022) and common practices, we set the default smelly ratio to 0.5, resulting in an equal number of smelly and non-smelly instances in the balanced data sets.…”

Section: Data Resamplingmentioning

confidence: 99%

“…Code smells are code symptoms caused by design flaws or bad coding idioms, which might negatively impact software quality factors Fowler (2018); Rahman et al (2018). Identifying code smells is a vital task that helps software developers improve the design of their software Alkharabsheh et al (2022); Rahad et al (2021); Sousa et al (2019). Recently, several approaches to Code Smell Detection (CSD) have been proposed, which include two categories, i.e., heuristic-based and machine learningbased.…”

Section: Introductionmentioning

confidence: 99%

“…Therefore, some researchers have investigated the impact of some imbalanced learning techniques for CSD. For example, Alkharabsheh et al Alkharabsheh et al (2022) and Pecorelli et al Pecorelli et al (2019a found that Synthetic Minority Over-sampling TEchnique (SMOTE) could not significantly improve detection performance. Alazba et al Alazba and Aljamaan (2021) and showed that the stacking ensemble and voting ensemble are better than individual classifiers.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

On the Relative Value of Imbalanced Learning for Code Smell Detection

Xiao

Zou

et al. 2023

Preprint

View full text Add to dashboard Cite

Machine learning-based code smell detection has been demonstrated to be a valuable approach for improving software quality and enabling developers to identify problematic patterns in code. However, previous researches have shown that the code smell datasets commonly used to train these models are heavily imbalanced. While some recent studies have explored the use of imbalanced learning techniques for code smell detection, they have only evaluated a limited number of techniques and thus their conclusions about the most effective methods may be biased and inconclusive. To thoroughly evaluate the effect of imbalanced learning techniques on machine learning-based code smell detection, we examine 31 imbalanced learning techniques with seven classifiers to build code smell detection models on four code smell data sets. We employ four evaluation metrics to assess the detection performance with the Wilcoxon signed-rank test and Cliff’s δ. The results show that (1) Not all imbalanced learning techniques significantly improve detection performance, but deep forest significantly outperforms the other techniques on all code smell data sets. (2) SMOTE (Synthetic Minority Over-sampling TEchnique) is not the most effective technique for resampling code smell data sets. (3) The best-performing imbalanced learning techniques and the top-3 data resampling techniques have little time cost for code smell detection. Therefore, we provide some practical guidelines. First, researchers and practitioners should select the appropriate imbalanced learning techniques (e.g., deep forest) to ameliorate the class imbalance problem. In contrast, the blind application of imbalanced learning techniques could be harmful. Then, better data resampling techniques than SMOTE should be selected to preprocess the code smell data sets.

show abstract

Section: Imbalanced Learning For Csdmentioning

confidence: 99%

Section: Data Resamplingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

On the Relative Value of Imbalanced Learning for Code Smell Detection

Xiao

Zou

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…The main role of software metrics is to estimate and measure some characteristics of systems such as size, complexity, inheritance, encapsulation, etc. [14,15]. Selected metrics are a large set of object-oriented metrics that are considered as independent variables as shown in Table 1.…”

Section: Introductionmentioning

confidence: 99%

Improving Accuracy of Code Smells Detection with Data Balancing Techniques

Khleel¹

2022

Preprint

View full text Add to dashboard Cite

Code smells are indicators of potential symptoms or problems in software due to inefficient design or incomplete implementation. These problems can affect software quality in the long term. Code smell detection is fundamental to improving software quality and maintainability, reducing the risk of software failure, and helping to refactor the code. Data imbalance is the main challenge of machine learning (ML) techniques in detecting the code smells. Several prediction methods have been applied for code smells detection in our previous works. However, many of them show that, ML methods is not always suitable for code smells detection due to the problem of highly unbalanced data. To overcome these challenges, the objective of this study is to present a code smells detection method based on ML models with data balancing techniques to mitigate data unbalancing issues by taking a corpus of Java projects as experimental datasets. In our experiments, we have used Bidirectional Long Short-Term Memory (Bi-LSTM) and Gated Recurrent Unit (GRU). The performance of these models has been evaluated based on accuracy, precision, recall, f-measure, ROC curve. The results indicate that the use of data balancing techniques had a positive effect on the predictive accuracy of the models presented.

show abstract

A hybrid‐ensemble model for software defect prediction for balanced and imbalanced datasets using AI‐based techniques with feature preservation: SMERKP‐XGB

Mustaqeem,

Siddiqui,

Mustajab

2024

J Software Evolu Process

View full text Add to dashboard Cite

Maintaining software quality is a significant challenge as the complexity of software is increasing with the rise of the software industry. Software defects are a primary concern in complex modules, and predicting them in the early stages of the software development life cycle (SDLC) is difficult. Previous techniques to address this issue have not been very promising. We have proposed “A hybrid ensemble model for software defect prediction using AI‐based techniques with feature preservation” to overcome this problem. We have used the National Aeronautics and Space Administration (NASA) dataset from the PROMISE repository for testing and validation. By applying exploratory data analysis (EDA), feature engineering, scaling, and standardization, we found that the dataset is imbalanced, which can negatively affect the model's performance. To address this, we have used the Synthetic Minority Oversampling (SMOTE) technique and the edited nearest neighbor (ENN) (SMOTE‐ENN). We have also used recursive feature elimination cross‐validation (RFE‐CV) with a pipeline to prevent data leaking in CV and kernel‐based principal component analysis (K‐PCA) to minimize dimensionality and selectively relevant features. The reduced dimensional data is then given to the eXtreme Gradient Boosting (XGBoost) for classification, resulting in the hybrid‐ensemble (SMERKP‐XGB) model. The proposed SMERKP‐XGB model is better than previously developed models in terms of accuracy (CM1: 97.53%, PC1: 92.05%, and PC2: 97.45%, KC1:95.65%), and area under the receiver operating characteristic curve values (CM1:96.30%, PC1:98.30%, and PC2:99.30%: KC1: 93.54) and other evaluation criteria mentioned in the literature.

show abstract

A comparison of machine learning algorithms on design smell detection using balanced and imbalanced dataset: A study of God class

Cited by 28 publications

References 20 publications

On the Relative Value of Imbalanced Learning for Code Smell Detection

On the Relative Value of Imbalanced Learning for Code Smell Detection

Improving Accuracy of Code Smells Detection with Data Balancing Techniques

A hybrid‐ensemble model for software defect prediction for balanced and imbalanced datasets using AI‐based techniques with feature preservation: SMERKP‐XGB

Contact Info

Product

Resources

About