An Empirical Assessment of Performance of Data Balancing Techniques in Classification Task

Jadhav, Anil; Mostafa, Samih M.; Elmannai, Hela; Karim, Faten Khalid

doi:10.3390/app12083928

Cited by 14 publications

(7 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unbalanced datasets are relevant and commonly observed in pathology detection problems that can significantly impact the classification performance of machine learning models. Several solutions have been proposed to deal with unbalanced datasets ( 44 , 45 ) and the problem was solved by data resampling at the pre-processing data level. The basic idea of unbalance is to resample the original dataset, either by oversampling the smallest class or subsampling the largest class until the class sizes are approximately the same.…”

Section: Data Balancing Techniquesmentioning

confidence: 99%

“…The weakness of this method is that if the dataset is large, it can introduce a significant additional computational load and the duplication of information due to the oversampling of the minority class instances, which can lead to the overfitting of the model. However, this method retains all important information, unlike the US method ( 44 ).…”

Section: Data Balancing Techniquesmentioning

confidence: 99%

See 1 more Smart Citation

Machine learning models based on clinical indices and cardiotocographic features for discriminating asphyxia fetuses—Porto retrospective intrapartum study

Ribeiro

Nunes

Castro

et al. 2023

Front. Public Health

View full text Add to dashboard Cite

IntroductionPerinatal asphyxia is one of the most frequent causes of neonatal mortality, affecting approximately four million newborns worldwide each year and causing the death of one million individuals. One of the main reasons for these high incidences is the lack of consensual methods of early diagnosis for this pathology. Estimating risk-appropriate health care for mother and baby is essential for increasing the quality of the health care system. Thus, it is necessary to investigate models that improve the prediction of perinatal asphyxia. Access to the cardiotocographic signals (CTGs) in conjunction with various clinical parameters can be crucial for the development of a successful model.ObjectivesThis exploratory work aims to develop predictive models of perinatal asphyxia based on clinical parameters and fetal heart rate (fHR) indices.MethodsSingle gestations data from a retrospective unicentric study from Centro Hospitalar e Universitário do Porto de São João (CHUSJ) between 2010 and 2018 was probed. The CTGs were acquired and analyzed by Omniview-SisPorto, estimating several fHR features. The clinical variables were obtained from the electronic clinical records stored by ObsCare. Entropy and compression characterized the complexity of the fHR time series. These variables' contribution to the prediction of asphyxia perinatal was probed by binary logistic regression (BLR) and Naive-Bayes (NB) models.ResultsThe data consisted of 517 cases, with 15 pathological cases. The asphyxia prediction models showed promising results, with an area under the receiver operator characteristic curve (AUC) >70%. In NB approaches, the best models combined clinical and SisPorto features. The best model was the univariate BLR with the variable compression ratio scale 2 (CR2) and an AUC of 94.93% [94.55; 95.31%].ConclusionBoth BLR and Bayesian models have advantages and disadvantages. The model with the best performance predicting perinatal asphyxia was the univariate BLR with the CR2 variable, demonstrating the importance of non-linear indices in perinatal asphyxia detection. Future studies should explore decision support systems to detect sepsis, including clinical and CTGs features (linear and non-linear).

show abstract

Section: Data Balancing Techniquesmentioning

confidence: 99%

Section: Data Balancing Techniquesmentioning

confidence: 99%

Machine learning models based on clinical indices and cardiotocographic features for discriminating asphyxia fetuses—Porto retrospective intrapartum study

Ribeiro

Nunes

Castro

et al. 2023

Front. Public Health

View full text Add to dashboard Cite

show abstract

“…Data balancing [53,54] is crucial to addressing class imbalance and making sure that machine learning models are impartial, legitimate, and powerful. It increases the performance of the model, averts bias, enhances generalizability, facilitates better learning of features, prevents overfitting, and increases the model's stability to change in concept.…”

Section: Data Balancing For Classificationmentioning

confidence: 99%

ActieCoach: Personalized Recommendation Generation in Activity eCoaching with Meta-Heuristic Approach

Chatterjee

Pahari

Prinz

et al. 2023

Preprint

View full text Add to dashboard Cite

Background: Automated coaches (eCoach) can help people lead a healthy lifestyle (e.g., reduction of sedentary bouts) with continuous health status monitoring and personalized recommendation generation with artificial intelligence (AI). Semantic ontology can play a crucial role in knowledge representation, data integration, and information retrieval. Methods: This study proposes a semantic ontology model to annotate the AI predictions, forecasting outcomes, and personal preferences to conceptualize a personalized recommendation generation model with a hybrid approach. This study considers a mixed activity projection method that takes individual activity insights from the univariate time-series prediction and ensemble multi-class classification approaches. We have introduced a way to improve the prediction result with a residual error minimization (REM) technique and make it meaningful in recommendation presentation with a Naïve-based interval prediction approach. We have integrated the activity prediction results in an ontology for semantic interpretation. A SPARQL query protocol and RDF Query Language (SPARQL) have generated personalized recommendations in an understandable format. Moreover, we have evaluated the performance of the time-series prediction and classification models against standard metrics on both imbalanced and balanced public PMData and private MOX2-5 activity datasets. We have used Adaptive Synthetic (ADASYN) to generate synthetic data from the minority classes to avoid bias. The activity datasets were collected from healthy adults (n=16 for public datasets; n=15 for private datasets). The standard ensemble algorithms have been used to investigate the possibility of classifying daily physical activity levels into the following activity classes: sedentary (0), low active (1), active (2), highly active (3), and rigorous active (4). The daily step count, low physical activity (LPA), medium physical activity (MPA), and vigorous physical activity (VPA) serve as input for the classification models. Subsequently, we re-verify the classifiers on the private MOX2-5 dataset. The performance of the ontology has been assessed with reasoning and SPARQL query execution time. Additionally, we have verified our ontology for effective recommendation generation. Results: We have tested several standard AI algorithms and selected the best-performing model with optimized configuration for our use case by empirical testing. We have found that the autoregression model with the REM method outperforms the autoregression model without the REM method for both datasets. Gradient Boost (GB) classifier outperforms other classifiers with a mean accuracy score of 98.00%, and 99.00% for imbalanced PMData and MOX2-5 datasets, respectively, and 98.30%, and 99.80% for balanced PMData and MOX2-5 datasets, respectively. Hermit reasoner performs better than other ontology reasoners under defined settings. Our proposed algorithm shows a direction to combine the AI prediction forecasting results in an ontology to generate personalized activity recommendations in eCoaching. Conclusion: The proposed method combining step-prediction, activity-level classification techniques, and personal preference information with semantic rules is an asset for generating personalized recommendations.

show abstract

“…There are 450176 urls in the dataset. Imbalanced dataset affects the classification process which gives a skewed result [13]. To avoid such issue, the experiment uses 10000 benign and 10000 malicious urls.…”

Section: A Raw Datasetmentioning

confidence: 99%

Structural Analysis of URL For Malicious URL Detection Using Machine Learning

A. Saleem Raja,

S. Peerbashab,

Y. Mohammed Iqbal

et al. 2023

JOAASR

View full text Add to dashboard Cite

Malicious websites are intentionally created websites that aid online criminals in carrying out illicit actions. They commit crimes like installing malware on the victim's computer, stealing private data from the victim's system, and exposing the victim online. Malicious codes can also be found on legitimate websites. Therefore, locating such a website in cyberspace is a difficult operation that demands the utilization of an automated detection tool. Currently, machine learning/deep learning technologies are employed to detect such malicious websites. However, the problem persists since the attack vector is constantly changing. Most research solutions use a limited number of URL lexical features, DNS information, global ranking information, and webpage content features. Combining several derived features involves computation time and security risk. Additionally, the dataset's minimal features don't maximize its potential. This paper exclusively uses URLs to address this problem and blends linguistic and vectorized URL features. Complete potential of the URL is utilized through vectorization. Six machine learning algorithms are examined. The results indicate that the proposed approach performs better for the count vectorizer with random forest algorithm

show abstract

An Empirical Assessment of Performance of Data Balancing Techniques in Classification Task

Cited by 14 publications

References 40 publications

Machine learning models based on clinical indices and cardiotocographic features for discriminating asphyxia fetuses—Porto retrospective intrapartum study

Machine learning models based on clinical indices and cardiotocographic features for discriminating asphyxia fetuses—Porto retrospective intrapartum study

ActieCoach: Personalized Recommendation Generation in Activity eCoaching with Meta-Heuristic Approach

Structural Analysis of URL For Malicious URL Detection Using Machine Learning

Contact Info

Product

Resources

About