Improving performance with hybrid feature selection and ensemble machine learning techniques for code smell detection

Jain, Shraddha; Saha, Anju

doi:10.1016/j.scico.2021.102713

Cited by 40 publications

(17 citation statements)

References 85 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As shown in Figure 4 and Figure 5, SMOTE does achieve significant improvement over the None technique on Data Class, God Class, and Long Method across our data sets, and obtains non-significant improvement on Feature Envy. Therefore, researchers and practitioners may still consider using SMOTE as a preprocessing method in line with previous studies Akhter et al (2021); Alkharabsheh et al (2021); Gupta et al (2021); Jain and Saha (2021); Stefano et al (2021); Khleel and Nehéz (2022);Kovačević et al (2022); Nanda and Chhabra (2022); Yedida and Menzies (2022), but should also consider exploring other techniques that may be more effective. Our results in Section 5.3 demonstrate that SMOTE does not consistently achieve the best performance on all four data sets, and the top-performing data resampling technique outperforms SMOTE by 2.63%-17.73% in terms of MCC.…”

Section: Discussionmentioning

confidence: 67%

“…Our findings are based on data sets provided by Fontana et al Fontana et al (2016), which are derived from 74 systems in the Qualitas corpus. While these code smell data sets are widely used in recent studies Nucci et al (2018); ; Jain and Saha (2021), we cannot guarantee that our conclusions will hold true for other data sets. Current researches Azeem et al (2019); Pecorelli et al (2020); Alkharabsheh et al (2022) in CSD tends to treat code smells as a binary classification problem, meaning that a code block is either classified as having a particular smell or not having that smell.…”

Section: Threats To Validitymentioning

confidence: 86%

“…Gupta et al Gupta et al (2021) employed eight deep learning models for CSD. Jain et al Jain and Saha (2021) used hybrid feature selection and ensemble learning techniques to improve detection performance by adopting the 11 classifiers (i.e., SVM, KNN, NB, DT, Linear Discriminant Analysis (LDA), Logistic Regression (LR), Bagging, AdaBoost, XGBoost, Gradient Boost (GB), and Stacking). Stefano et al Stefano et al (2021) proposed a cross-project method, which used the four machine learning classifiers (i.e., LR, DT, NB, and RF) to train detection models and predicted the smelliness of within-project instances.…”

Section: Imbalanced Learning For Csdmentioning

confidence: 99%

“…In our study, we use the same experimental data sets as Nucci et al Nucci et al (2018), Aljamaan Aljamaan (2021), Jain et al Jain and Saha (2021), and Nanda et al Nanda and Chhabra (2022), which was built by Fontana et al Fontana et al (2016) and contained four types of code smells collected from 74 software systems. Each type of code smell data sets includes 140 smelly instances and 280 non-smelly ones (420 instances in total).…”

Section: Experimental Setup 41 Data Setsmentioning

confidence: 99%

See 3 more Smart Citations

On the Relative Value of Imbalanced Learning for Code Smell Detection

Xiao

Zou

et al. 2023

Preprint

View full text Add to dashboard Cite

Machine learning-based code smell detection has been demonstrated to be a valuable approach for improving software quality and enabling developers to identify problematic patterns in code. However, previous researches have shown that the code smell datasets commonly used to train these models are heavily imbalanced. While some recent studies have explored the use of imbalanced learning techniques for code smell detection, they have only evaluated a limited number of techniques and thus their conclusions about the most effective methods may be biased and inconclusive. To thoroughly evaluate the effect of imbalanced learning techniques on machine learning-based code smell detection, we examine 31 imbalanced learning techniques with seven classifiers to build code smell detection models on four code smell data sets. We employ four evaluation metrics to assess the detection performance with the Wilcoxon signed-rank test and Cliff’s δ. The results show that (1) Not all imbalanced learning techniques significantly improve detection performance, but deep forest significantly outperforms the other techniques on all code smell data sets. (2) SMOTE (Synthetic Minority Over-sampling TEchnique) is not the most effective technique for resampling code smell data sets. (3) The best-performing imbalanced learning techniques and the top-3 data resampling techniques have little time cost for code smell detection. Therefore, we provide some practical guidelines. First, researchers and practitioners should select the appropriate imbalanced learning techniques (e.g., deep forest) to ameliorate the class imbalance problem. In contrast, the blind application of imbalanced learning techniques could be harmful. Then, better data resampling techniques than SMOTE should be selected to preprocess the code smell data sets.

show abstract

Section: Discussionmentioning

confidence: 67%

Section: Threats To Validitymentioning

confidence: 86%

Section: Imbalanced Learning For Csdmentioning

confidence: 99%

Section: Experimental Setup 41 Data Setsmentioning

confidence: 99%

See 2 more Smart Citations

On the Relative Value of Imbalanced Learning for Code Smell Detection

Xiao

Zou

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…These methods are very computationally expensive and often unrealistic if the feature space is vast, (iii) Embedded methods: in these methods, feature selection is a part of building ML algorithms. These methods select the best possible feature subset as per the ML model to be implemented [41]. In this study, we applied embedded methods because it is faster and less computationally expensive than other methods and it fits ML models and feature scaling technique was applied to make the output the same standard.…”

Section: 3data Pre-processing and Features Selectionmentioning

confidence: 99%

A Novel Approach for Software Defect Prediction using CNN and GRU Based on SMOTE Tomek Method

Khleel

Nehéz

2022

Preprint

View full text Add to dashboard Cite

Software defect prediction (SDP) plays an important role in enhancing the quality of software projects and reducing maintenance-based risks through the ability to detect defective software components. SDP refers to the methods that use historical defect data to build the relationship between software metrics and software defects. Several prediction models such as machine learning (ML), deep learning (DL) have been developed and adopted to recognize defect in software modules and many methodologies and frameworks have been presented. One of the most difficult problems that these models face in binary classification is the classes imbalance. When the distribution of classes is unbalanced, the accuracy may be high, but the model cannot recognize data instances in the minority class, this will lead to weak classifications. So far, few research have been done in the previous studies that address the problem of class imbalance in SDP. To address the class imbalance problem, we propose a novel SDP approach based on convolutional neural network (CNN) and gated recurrent unit (GRU) combined with synthetic minority oversampling technique plus Tomek link (SMOTE Tomek). To establish the efficiency of the proposed models, the experiments have been conducted on benchmark datasets which obtained from the PROMISE repository and the experimental results have been compared and evaluated in terms of accuracy, precision, recall, f-measure, the area under the ROC curve (AUC), the area under the precision-recall curve (AUCPR), mean square error (MSE). The average accuracy of the proposed models on the original datasets were 89% for CNN and 87% for GRU, while the average accuracy of the proposed models on the balanced datasets were 94% for CNN and 92% for GRU. The results showed that the proposed models on the balanced datasets improves the average accuracy by 5% for both models compared to original datasets. This indicates the positive effects of combining ML techniques with data balancing methods on the performance of defect prediction regarding datasets with imbalanced class distributions.

show abstract

On the relative value of imbalanced learning for code smell detection

Zou

Keung

et al. 2023

Softw Pract Exp

View full text Add to dashboard Cite

SummaryMachine learning‐based code smell detection (CSD) has been demonstrated to be a valuable approach for improving software quality and enabling developers to identify problematic patterns in code. However, previous researches have shown that the code smell datasets commonly used to train these models are heavily imbalanced. While some recent studies have explored the use of imbalanced learning techniques for CSD, they have only evaluated a limited number of techniques and thus their conclusions about the most effective methods may be biased and inconclusive. To thoroughly evaluate the effect of imbalanced learning techniques for machine learning‐based CSD, we examine 31 imbalanced learning techniques with seven classifiers to build CSD models on four code smell data sets. We employ four evaluation metrics to assess the detection performance with the Wilcoxon signed‐rank test and Cliff's . The results show that (1) Not all imbalanced learning techniques significantly improve detection performance, but deep forest significantly outperforms the other techniques on all code smell data sets. (2) SMOTE (Synthetic Minority Over‐sampling TEchnique) is not the most effective technique for resampling code smell data sets. (3) The best‐performing imbalanced learning techniques and the top‐3 data resampling techniques have little time cost for code smell detection. Therefore, we provide some practical guidelines. First, researchers and practitioners should select the appropriate imbalanced learning techniques (e.g., deep forest) to ameliorate the class imbalance problem. In contrast, the blind application of imbalanced learning techniques could be harmful. Then, better data resampling techniques than SMOTE should be selected to preprocess the code smell data sets.

show abstract

Improving performance with hybrid feature selection and ensemble machine learning techniques for code smell detection

Cited by 40 publications

References 85 publications

On the Relative Value of Imbalanced Learning for Code Smell Detection

On the Relative Value of Imbalanced Learning for Code Smell Detection

A Novel Approach for Software Defect Prediction using CNN and GRU Based on SMOTE Tomek Method

On the relative value of imbalanced learning for code smell detection

Contact Info

Product

Resources

About