Customized Instance Random Undersampling to Increase Knowledge Management for Multiclass Imbalanced Data Classification

Tusell-Rey, Claudia C.; Camacho-Nieto, Oscar; Yáñez-Márquez, Cornelio; Villuendas-Rey, Yenny

doi:10.3390/su142114398

Cited by 2 publications

(1 citation statement)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such methods are not relevant for mixed and partial data. Because there are more instances in the dataset due to oversampling processes, the computational execution time of the decision-making algorithms increases [37]. Moreover, the biggest drawback of oversampling is that it increases the likelihood of overfitting by using identical replicas of previous cases.…”

Section: Proposed Methodsmentioning

confidence: 99%

F-RUS-RF: A Hybrid Machine Learning Approach for Cancer Detection in Older Adults

Javeed

Dallora

Saleem

et al. 2023

Preprint

View full text Add to dashboard Cite

Background: Globally, cancer is the second-leading cause of mortality, behind cardiovascular diseases. Although cancer affects people of all ages, most cases occur among those in their fifth or sixth decade of life; hence, the chance of developing cancer grows significantly with age. Early cancer prediction and its risk factors are crucial since it increases survival rates. Motivated by this fact, we conducted this study on a Swedish older adult sample, where the proposed model based on machine learning (ML) not only predicted cancer but also identified risk factors for cancer in older adults. Results: The newly proposed model comprises two modules. The first module uses an F-score statistical model to rank the variables from the acquired dataset, which consists of 75 variables, and the second module serves as a classifier. For the classification job, we deployed the random forest (RF) algorithm, and the hyperparameters of the RF model were optimized by employing a genetic algorithm. The highly significant variables determined in the first module are fed into the second module for cancer prediction. It was observed during the study that classes in the dataset were highly imbalanced. To avoid the problem of bias in the ML model, we deployed a random undersampling approach to balance the classes in the dataset. The components of the proposed model are combined into a single unit that functions as a ”black box.” The newly constructed model for cancer prediction was named F-RUS-RF. The highest accuracy achieved by the F-RUS-RF model for cancer prediction while using only the top six ranked variables was 86.15%, with sensitivity and specificity of 92.25% and 85.14%, respectively. Conclusions: The proposed F-RUS-RF model helped us predict cancer and identified the risk factors of cancer in older adults. From the total of 75 variables in the dataset, the six most significant variables were determined by the proposed F-RUS-RF model, which actually causes cancer in older adults. By taking care of these risk factors, we can reduce the risk of cancer in older adults.

show abstract

Section: Proposed Methodsmentioning

confidence: 99%