The Effect of Training and Testing Process on Machine Learning in Biomedical Datasets

Uçar, Muhammed Kürşad; Nour, Majid; Sindi, Hatem; Polat, Kemal

doi:10.1155/2020/2836236

Cited by 74 publications

(41 citation statements)

References 35 publications

(53 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The dataset is separated into 80% for the training database and 20% for the testing database to perform the COVID-19 classification according to the country under the Python computational environment. The training-testing ratio is selected according to data correlation and performance criteria to achieve high algorithm accuracy [36]. It was found that the performance criteria can be maximized when the training data is greater than the testing data [37].…”

Section: Resultsmentioning

confidence: 99%

“…After the data labeling stage, all labeled data is randomly divided with a proportion of 8:2 into a training set and a testing set. In machine learning algorithms, the value of training and testing is a significant factor in deciding the performance level [36]. If the features and the label have a high correlation, the training-testing ratio is 50%-50%.…”

Section: Data Divisionmentioning

confidence: 99%

“…Therefore, the accuracy rate is a critical indicator in the training and testing process and classification success. In machine learning algorithms, if the testing data increases, the accuracy is expected to decrease [36].…”

Section: Performance Metricsmentioning

confidence: 99%

See 2 more Smart Citations

Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms

Afify

Zanaty

2021

Med Biol Eng Comput

View full text Add to dashboard Cite

The rapid spread of coronavirus disease (COVID-19) has become a worldwide pandemic and affected more than 15 million patients reported in 27 countries. Therefore, the computational biology carrying this virus that correlates with the human population urgently needs to be understood. In this paper, the classification of the human protein sequences of COVID-19, according to the country, is presented based on machine learning algorithms. The proposed model is based on distinguishing 9238 sequences using three stages, including data preprocessing, data labeling, and classification. In the first stage, data preprocessing's function converts the amino acids of COVID-19 protein sequences into eight groups of numbers based on the amino acids' volume and dipole. It is based on the conjoint triad (CT) method. In the second stage, there are two methods for labeling data from 27 countries from 0 to 26. The first method is based on selecting one number for each country according to the code numbers of countries, while the second method is based on binary elements for each country. According to their countries, machine learning algorithms are used to discover different COVID-19 protein sequences in the last stage. The obtained results demonstrate 100% accuracy, 100% sensitivity, and 90% specificity via the country-based binary labeling method with a linear support vector machine (SVM) classifier. Furthermore, with significant infection data, the USA is more prone to correct classification compared to other countries with fewer data. The unbalanced data for COVID-19 protein sequences is considered a major issue, especially as the US's available data represents 76% of a total of 9238 sequences. The proposed model will act as a prediction tool for the COVID-19 protein sequences in different countries.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Data Divisionmentioning

confidence: 99%

See 1 more Smart Citation

Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms

Afify

Zanaty

2021

Med Biol Eng Comput

View full text Add to dashboard Cite

show abstract

“…These data consist of 200 images, including 100 for coronavirus and 100 for non-coronavirus images collected from different patients [24]. The effect of training and testing data on medical images is based on the success rate of the CAD system [25]. Based on experts, when the training data is less than 50%, the test results will be failed to achieve a good classifier.…”

Section: Data Description Phasementioning

confidence: 99%

An Automated CAD System of CT Chest Images for COVID-19 Based on Genetic Algorithm and K-Nearest Neighbor Classifier

Afify¹,

Darwish²,

Mohammed³

et al. 2020

ISI

View full text Add to dashboard Cite

The detection of COVID-19 from computed tomography (CT) scans suffered from inaccuracies due to its difficulty in data acquisition and radiologist errors. Therefore, a fully automated computer-aided detection (CAD) system is proposed to detect coronavirus versus non-coronavirus images. In this paper, a total of 200 images for coronavirus and non-coronavirus are employed based on 90% for training images and 10% testing images. The proposed system comprised five stages for organizing the virus prevalence. In the first stage, the images are preprocessed by thresholding-based lung segmentation. Afterward, the feature extraction technique was performed on segmented images, while the genetic algorithm performed on sixty-four extracted features to adopt the superior features. In the final stage, the K-nearest neighbor (KNN) and decision tree are applied for COVID-19 classification. The findings of this paper confirmed that the KNN classifier with K=3 is accomplished for COVID-19 detection with high accuracy of 100% on CT images. However, the decision tree for COVID-19 classification is achieved 95% accuracy. This system is used to facilitate the radiologist’s role in the prediction of COVID-19 images. This system will prove to be valuable to the research community working on automation of COVID-19 images prediction.

show abstract

“…Feature selection algorithms are often used in the machine learning field to improve the performance of systems [9][10][11]. In the field of machine learning, datasets are used in a variety of sizes and types [12][13][14][15].…”

Section: Introductionmentioning

confidence: 99%

A Novel Approach to Ensemble Classifiers: FsBoost-Based Subspace Method

Noor

Uçar

Polat

et al. 2020

Mathematical Problems in Engineering

Self Cite

View full text Add to dashboard Cite

In this article, an algorithm is proposed for creating an ensemble classifier. The name of the algorithm is the F-score subspace method (FsBoost). According to this method, the features are selected with the F-score and classified with different or the same classifiers. In the next step, the ensemble classifier is created. Two versions that are named FsBoost.V1 and FsBoost.V2 have been developed based on classification by the same or different classifiers. According to the results obtained, the results are consistent with the literature. Besides, a higher accuracy rate is obtained compared with many algorithms in the literature. The algorithm is fast because it has a few steps. It is thought that the algorithm will be successful due to these advantages.

show abstract

The Effect of Training and Testing Process on Machine Learning in Biomedical Datasets

Cited by 74 publications

References 35 publications

Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms

Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms

An Automated CAD System of CT Chest Images for COVID-19 Based on Genetic Algorithm and K-Nearest Neighbor Classifier

A Novel Approach to Ensemble Classifiers: FsBoost-Based Subspace Method

Contact Info

Product

Resources

About