LIMCR: Less-Informative Majorities Cleaning Rule Based on Naïve Bayes for Imbalance Learning in Software Defect Prediction

Wu, Yumei; Yao, Jingxiu; Chang, Shuo; Liu, Bin

doi:10.3390/app10238324

Cited by 8 publications

(9 citation statements)

References 51 publications

(63 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The optimum values were searched in the range of 1 and 30 (Kang and Ryu, 2019) and 100 and 1,200 (Baker et al, 2020a), respectively. Also, variance smoothing (10 −x , x between 3 and 9) was the only parameter to be considered in the determination of optimum model configuration for NB (Soni et al, 2020;Wu et al, 2020). In the identification of the KNN structure, three hyperparameters of the algorithm were scanned, the number of neighbors and the leaf size with the ranges of 1-30 for both parameters (Bykov et al, 2019;Zhang et al, 2020a, b) and four different distance metrics, i.e.…”

Section: Classification Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Prediction of construction accident outcomes based on an imbalanced dataset through integrated resampling techniques and machine learning methods

Koç

Ekmekçioğlu²,

Gürgün³

2022

ECAM

View full text Add to dashboard Cite

PurposeCentral to the entire discipline of construction safety management is the concept of construction accidents. Although distinctive progress has been made in safety management applications over the last decades, construction industry still accounts for a considerable percentage of all workplace fatalities across the world. This study aims to predict occupational accident outcomes based on national data using machine learning (ML) methods coupled with several resampling strategies.Design/methodology/approachOccupational accident dataset recorded in Turkey was collected. To deal with the class imbalance issue between the number of nonfatal and fatal accidents, the dataset was pre-processed with random under-sampling (RUS), random over-sampling (ROS) and synthetic minority over-sampling technique (SMOTE). In addition, random forest (RF), Naïve Bayes (NB), K-Nearest neighbor (KNN) and artificial neural networks (ANNs) were employed as ML methods to predict accident outcomes.FindingsThe results highlighted that the RF outperformed other methods when the dataset was preprocessed with RUS. The permutation importance results obtained through the RF exhibited that the number of past accidents in the company, worker's age, material used, number of workers in the company, accident year, and time of the accident were the most significant attributes.Practical implicationsThe proposed framework can be used in construction sites on a monthly-basis to detect workers who have a high probability to experience fatal accidents, which can be a valuable decision-making input for safety professionals to reduce the number of fatal accidents.Social implicationsPractitioners and occupational health and safety (OHS) departments of construction firms can focus on the most important attributes identified by analysis results to enhance the workers' quality of life and well-being.Originality/valueThe literature on accident outcome predictions is limited in terms of dealing with imbalanced dataset through integrated resampling techniques and ML methods in the construction safety domain. A novel utilization plan was proposed and enhanced by the analysis results.

show abstract

Section: Classification Resultsmentioning

confidence: 99%

“…Also, variance smoothing (10−x, x between 3 and 9) was the only parameter to be considered in the determination of optimum model configuration for NB (Soni et al. , 2020; Wu et al. , 2020).…”

Section: Resultsmentioning

confidence: 99%

Prediction of construction accident outcomes based on an imbalanced dataset through integrated resampling techniques and machine learning methods

Koç

Ekmekçioğlu²,

Gürgün³

2022

ECAM

View full text Add to dashboard Cite

show abstract

“…A set of conditions for ant lion hunting can be presented with the following and can be seen in Figure 7 [15]. Hunters (ants) move randomly in the search space.…”

Section: Ant Milk Optimization Algorithmmentioning

confidence: 99%

“…Figure (15): Behavioral status of group sailing fish [25] 4.5 Plant-inspired meta-exploratory algorithms…”

Section: Wall Optimization Algorithmmentioning

confidence: 99%

Investigating and comparing the performance of meta-heuristic algorithms in feature selection and software fault prediction

Arshaghi

2022

Preprint

View full text Add to dashboard Cite

Meta‑Heuristic algorithms are optimization techniques that provide the optimal solution through processes of repeated exploration and exploitation of the entire search space. Feature selection is also an important and prominent process in the field of machine learning that reduces data dimensions. This paper examines and compares nature-inspired meta-heuristic algorithms for feature selection to increase the accuracy of software fault prediction. Researchers cannot easily select meta-heuristic algorithms as a suitable method for their research due to their great variety and multiplicity. In this paper, by describing the feature selection techniques and its methods, the application of meta-heuristic algorithms in different fields, such as swarm intelligence and binary methods of these algorithms has been investigated. Also, by introducing 18 meta-heuristic algorithms in 6 different categories and evaluating each of them, a suitable analysis has been provided to researchers so that they can easily and with the highest efficiency choose the appropriate algorithm and method of their work. In the papers presented so far, meta-heuristic algorithms have been studied from only one aspect, while in this article, while studying different types of research, they have tried to study and evaluate them from different aspects. The effectiveness of the combination of three meta-heuristic algorithms, developed butterfly flame, bee colony and developed wall, was tested on 20 data sets. the proposed method in 17 datasets was able to improve the result of 7 datasets.

show abstract

“…An extensive body of research on software defect prediction based on ML models exists. The literature approaches explore defective prediction models from many perspectives [28][29][30][31][32][33][34][35][36][37]. One of the most well-known datasets used in many of those studies is the NASA MDP open datasets [38,39].…”

Section: Introductionmentioning

confidence: 99%

Making More with Less: Improving Software Testing Outcomes Using a Cross-Project and Cross-Language ML Classifier Based on Cost-Sensitive Training

Nascimento,

Shimanuki,

Dias

2024

Applied Sciences

View full text Add to dashboard Cite

As digitalization expands across all sectors, the economic toll of software defects on the U.S. economy reaches up to $2.41 trillion annually. High-profile incidents like the Boeing 787-Max 8 crash have shown the devastating potential of these defects, highlighting the critical importance of software testing within quality assurance frameworks. However, due to its complexity and resource intensity, the exhaustive nature of comprehensive testing often surpasses budget constraints. This research utilizes a machine learning (ML) model to enhance software testing decisions by pinpointing areas most susceptible to defects and optimizing scarce resource allocation. Previous studies have shown promising results using cost-sensitive training to refine ML models, improving predictive accuracy by reducing false negatives through addressing class imbalances in defect prediction datasets. This approach facilitates more targeted and effective testing efforts. Nevertheless, these models’ in-company generalizability across different projects (cross-project) and programming languages (cross-language) remained untested. This study validates the approach’s applicability across diverse development environments by integrating various datasets from distinct projects into a unified dataset, using a more interpretable ML technique. The results demonstrate that ML can support software testing decisions, enabling teams to identify up to 7× more defective modules compared to benchmark with the same testing effort.

show abstract

LIMCR: Less-Informative Majorities Cleaning Rule Based on Naïve Bayes for Imbalance Learning in Software Defect Prediction

Cited by 8 publications

References 51 publications

Prediction of construction accident outcomes based on an imbalanced dataset through integrated resampling techniques and machine learning methods

Prediction of construction accident outcomes based on an imbalanced dataset through integrated resampling techniques and machine learning methods

Investigating and comparing the performance of meta-heuristic algorithms in feature selection and software fault prediction

Making More with Less: Improving Software Testing Outcomes Using a Cross-Project and Cross-Language ML Classifier Based on Cost-Sensitive Training

Contact Info

Product

Resources

About