Revisiting the Class Imbalance Issue in Software Defect Prediction

Sohan, Fahimuzzman; Kabir, Alamgir; Jabiullah, Md. Ismail; Rahman, Sheikh Shah Mohammad Motiur

doi:10.1109/ecace.2019.8679382

Cited by 12 publications

(3 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Data resampling techniques were used to tackle data imbalance problems in the data sets. These sampling techniques are widely used in machine learning–based prediction models in different areas [ 24 ]. Our first analysis was done without the data resampling technique, where the four machine learning algorithms were applied directly to the data sets.…”

Section: Resultsmentioning

confidence: 99%

Predicting Risk of Stroke From Lab Tests Using Machine Learning Algorithms: Development and Evaluation of Prediction Models

Alanazi¹,

Abdou²,

Luo³

2021

JMIR Form Res

View full text Add to dashboard Cite

Background Stroke, a cerebrovascular disease, is one of the major causes of death. It causes significant health and financial burdens for both patients and health care systems. One of the important risk factors for stroke is health-related behavior, which is becoming an increasingly important focus of prevention. Many machine learning models have been built to predict the risk of stroke or to automatically diagnose stroke, using predictors such as lifestyle factors or radiological imaging. However, there have been no models built using data from lab tests. Objective The aim of this study was to apply computational methods using machine learning techniques to predict stroke from lab test data. Methods We used the National Health and Nutrition Examination Survey data sets with three different data selection methods (ie, without data resampling, with data imputation, and with data resampling) to develop predictive models. We used four machine learning classifiers and six performance measures to evaluate the performance of the models. Results We found that accurate and sensitive machine learning models can be created to predict stroke from lab test data. Our results show that the data resampling approach performed the best compared to the other two data selection techniques. Prediction with the random forest algorithm, which was the best algorithm tested, achieved an accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and area under the curve of 0.96, 0.97, 0.96, 0.75, 0.99, and 0.97, respectively, when all of the attributes were used. Conclusions The predictive model, built using data from lab tests, was easy to use and had high accuracy. In future studies, we aim to use data that reflect different types of stroke and to explore the data to build a prediction model for each type.

show abstract

Section: Resultsmentioning

confidence: 99%

Predicting Risk of Stroke From Lab Tests Using Machine Learning Algorithms: Development and Evaluation of Prediction Models

Alanazi¹,

Abdou²,

Luo³

2021

JMIR Form Res

View full text Add to dashboard Cite

show abstract

“…They evaluated twenty seven data sets, using seven classifiers on seven types of input metrics and various imbalanced learning methods and concluded that imbalanced learning could be considered only for moderate or highly imbalanced software defect prediction datasets. Sohan et al [37] conducted a study to know the inconsistency in the performance among imbalanced dataset and balanced dataset. In this study, eight public data sets were examined with seven classification methods to conclude that the imbalance nature of defective and non-defective classes plays a major role in SDP and among seven classifiers, the voting results in best performer among the classifiers.…”

Section: Related Workmentioning

confidence: 99%

Class Imbalance Reduction (CIR): A Novel Approach to Software Defect Prediction in the Presence of Class Imbalance

2020

View full text Add to dashboard Cite

Software defect prediction (SDP) is the technique used to predict the occurrences of defects in the early stages of software development process. Early prediction of defects will reduce the overall cost of software and also increase its reliability. Most of the defect prediction methods proposed in the literature suffer from the class imbalance problem. In this paper, a novel class imbalance reduction (CIR) algorithm is proposed to create a symmetry between the defect and non-defect records in the imbalance datasets by considering distribution properties of the datasets and is compared with SMOTE (synthetic minority oversampling technique), a built-in package of many machine learning tools that is considered a benchmark in handling class imbalance problems, and with K-Means SMOTE. We conducted the experiment on forty open source software defect datasets from PRedict or Models in Software Engineering (PROMISE) repository using eight different classifiers and evaluated with six performance measures. The results show that the proposed CIR method shows improved performance over SMOTE and K-Means SMOTE.

show abstract

“…[8] [9]. Inappropriately [10] [11], Uneven data distribution presents a significant difficulty for the SDP procedure, lowering the quality of the learning model as a result. Due to the asymmetry of the situation, there are fewer malfunctioning modules than there are functional ones.…”

Section: Introductionmentioning

confidence: 99%

A New Improved Prediction of Software Defects Using Machine Learning-based Boosting Techniques with NASA Dataset

Goyal,

Sinha

2023

IJRITCC

View full text Add to dashboard Cite

Predicting when and where bugs will appear in software may assist improve quality and save on software testing expenses. Predicting bugs in individual modules of software by utilizing machine learning methods. There are, however, two major problems with the software defect prediction dataset: Social stratification (there are many fewer faulty modules than non-defective ones), and noisy characteristics (a result of irrelevant features) that make accurate predictions difficult. The performance of the machine learning model will suffer greatly if these two issues arise. Overfitting will occur, and biassed classification findings will be the end consequence. In this research, we suggest using machine learning approaches to enhance the usefulness of the CatBoost and Gradient Boost classifiers while predicting software flaws. Both the Random Over Sampler and Mutual info classification methods address the class imbalance and feature selection issues inherent in software fault prediction. Eleven datasets from NASA's data repository, "Promise," were utilised in this study. Using 10-fold cross-validation, we classified these 11 datasets and found that our suggested technique outperformed the baseline by a significant margin. The proposed methods have been evaluated based on their abilities to anticipate software defects using the most important indices available: Accuracy, Precision, Recall, F1 score, ROC values, RMSE, MSE, and MAE parameters. For all 11 datasets evaluated, the suggested methods outperform baseline classifiers by a significant margin. We tested our model to other methods of flaw identification and found that it outperformed them all. The computational detection rate of the suggested model is higher than that of conventional models, as shown by the experiments..

show abstract

Revisiting the Class Imbalance Issue in Software Defect Prediction

Cited by 12 publications

References 32 publications

Predicting Risk of Stroke From Lab Tests Using Machine Learning Algorithms: Development and Evaluation of Prediction Models

Predicting Risk of Stroke From Lab Tests Using Machine Learning Algorithms: Development and Evaluation of Prediction Models

Class Imbalance Reduction (CIR): A Novel Approach to Software Defect Prediction in the Presence of Class Imbalance

A New Improved Prediction of Software Defects Using Machine Learning-based Boosting Techniques with NASA Dataset

Contact Info

Product

Resources

About