Stratified K-fold cross validation optimization on machine learning for prediction

Widodo, Slamet; Brawijaya, Herlambang; Samudi, Samudi

doi:10.33395/sinkron.v7i4.11792

Cited by 10 publications

(6 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Five-fold cross-validation was used, dividing the dataset into five subdivisions and taking four subdivisions each time as the training set and the remaining one sub-division as the test set. Stratified KFold (Widodo et al ., 2022 ) was used to perform 5-fold cross-validation in this study. To avoid feature leakage, feature selection was only applied to the training dataset.…”

Section: Methodsmentioning

confidence: 99%

Transcriptomic and neuroimaging data integration enhances machine learning classification of schizophrenia

Wang,

Zhao,

et al. 2024

Psychoradiology

View full text Add to dashboard Cite

Background Schizophrenia is a polygenetic disorder associated with changes in brain structure and function. Integrating macroscale brain features and microscale genetic data may provide a more complete overview of the disease etiology and may serve as potential diagnostic markers for schizophrenia. Objective We aim to systematically evaluate the impact of multi-scale neuroimaging and transcriptomic data fusion in schizophrenia classification models. Methods we collected brain imaging data and blood RNA-seq data from 43 schizophrenia patients and 60 age-, gender-matched healthy controls, and we extracted multi-omics features of macroscale brain morphology, brain structural connectivity and functional connectivity, and gene transcription of schizophrenia risk genes. Multi-scale data fusion was performed using a machine learning integration framework, together with several conventional machine learning methods and neural networks for patient classification. Results We found that multi-omics data fusion in conventional machine learning models achieved the highest accuracy in contrast to the single-modality models, with AUC improvements of 8.88% to 22.64%. Similar findings were observed for the neural network, showing an increase of 16.57% for the multimodal classification accuracy compared to the single-modal average. In addition, we identified several brain regions in the left posterior cingulate and right frontal pole that contribute to disease classification. Conclusion We provide empirical evidence for the increased accuracy achieved by imaging genetic data integration in schizophrenia classification. Multi-scale data fusion holds promise for enhancing diagnostic precision, facilitating early detection and personalizing treatment regimens in schizophrenia.

show abstract

Section: Methodsmentioning

confidence: 99%

Transcriptomic and neuroimaging data integration enhances machine learning classification of schizophrenia

Wang,

Zhao,

et al. 2024

Psychoradiology

View full text Add to dashboard Cite

show abstract

“…Model performance measurement is carried out using a confusion matrix and 10-fold cross validation. Confusion Matrix helps to understand how well a classification model predicts correctly, while K-Fold Cross-Validation helps measure the overall performance of the model by avoiding bias caused by random separation of test and training data [22], [23].…”

Section: Methodsmentioning

confidence: 99%

Implementation Of C5.0 Classification And Support Vector Machine Algorithm With Correlation-Based Feature Selection In Application Liver Disease

Rachman,

Ramdani,

Insani

2024

JAISI

View full text Add to dashboard Cite

Liver disease is a general term that refers to a number of disorders or problems that affect the liver. The liver is an important organ in the human body and has many diverse functions, including food processing, protein production, toxin removal, and energy storage. Therefore, when the liver experiences disorders or disease, it can have a serious impact on the overall health and function of the body. Liver disease is a significant global health problem. Early detection as well as classification of liver disease can provide valuable guidance for effective treatment. Based on the problems above, the aim of this research is to create a liver disease classification model using C5.0 and Support Vector Machine with Radial Basis Function (RBF) and Sigmoid kernels. With data obtained from the liver disease dataset. The two methods will be compared and we will find out which one produces the best results. The method used is also optimized with CFS (Correlation Based Feature Selection) feature selection. The results of the classification process, namely the C5.0 Model and Support Vector Machine (RBF) with CFS have a similar accuracy of 76%, while the Support Vector Machine (Sigmoid) has an accuracy of 70%, without feature selection the C5.0 algorithm has an accuracy of 66% , Support Vector Machine between RBF and sigmoid kernels has an accuracy of 69% and 55%.

show abstract

“…Stratified K-Fold Cross-Validation (SKCV) is an extension of KCV, where class distribution in the original data is taken into consideration when sampling [18]. Accordingly, SKCV is preferred over KCV in the case of unbalanced class distributions [19]. In our experiments, we used SKCV, specifically 10-fold cross validation, to split the data into training and testing, while computing the average accuracy of the different folds.…”

Section: ) K-fold Cross-validation (Kcvmentioning

confidence: 99%

Sentiment Analysis of Pandemic Tweets with COVID-19 as a Prototype

Almutiri,

Alghamdi,

Elazhary

2024

IJACSA

View full text Add to dashboard Cite

One of the most important applications of text mining is sentiment analysis of pandemic tweets. For example, it can make governments able to predict the onset of pandemics and to put in place safe policies based on people's feelings. Many research studies addressed this issue using various datasets and models. Nevertheless, this is still an open area of research in which many datasets and models are yet to be explored. This paper is interested in the sentiment analysis of COVID-19 tweets as a prototype. Our literature review revealed that as the dataset size increases, the accuracy generally tends to decrease. This suggests that using a small dataset might provide misleading results that cannot be generalized. Hence, it is better to consider large datasets and try to improve analysis performance on it. Accordingly, in this paper we consider a huge dataset, namely COVIDSenti, which is composed of three sub datasets (COVIDSenti_A, COVIDSenti_B, and COVIDSenti_C). These datasets have been processed with a number of Machine Learning (ML) models, Deep Learning (DL) models, and transformers. In this paper, we examine other ML and DL models aiming to find superior solutions. Specifically, we consider Ridge Classifier (RC), Multinomial Naïve Bayes (MNB), Stochastic Gradient Descent (SGD), Support Vector Classification (SVC), Extreme Gradient Boosting (XGBoost), and the DL Gated Recurrent Unit (GRU). Experimental results have shown that unlike the models that we tested, and the state-of-theart models on the same dataset, SGD technique with count vectorizer showed quite constantly high performance on all the four datasets.

show abstract

Stratified K-fold cross validation optimization on machine learning for prediction

Cited by 10 publications

References 15 publications

Transcriptomic and neuroimaging data integration enhances machine learning classification of schizophrenia

Transcriptomic and neuroimaging data integration enhances machine learning classification of schizophrenia

Implementation Of C5.0 Classification And Support Vector Machine Algorithm With Correlation-Based Feature Selection In Application Liver Disease

Sentiment Analysis of Pandemic Tweets with COVID-19 as a Prototype

Contact Info

Product

Resources

About