Background:
In recent era prediction of enzyme class from an unknown protein is one of the challenging tasks in bioinformatics. Day to day the number of proteins increases that causes difficulties in clinical verification and classification; as a result, the prediction of enzyme class gives a new opportunity to bioinformatics scholars. The machine learning classification technique helps in protein classification and predictions. But it is imperative to know which classification technique is more suited for protein classification. This study used human proteins data that is extracted from UniProtKB databank. Total 4368 protein data with 45 identified features has been used for experimental analysis.
Objective:
The prime objective of this article is to find an appropriate classification technique to classify the reviewed as well as un-reviewed human enzyme class of protein data. Also find the significance of different features in protein classification and prediction.
Method:
In this article, the ten most significant classification techniques such as CRT, QUEST, CHAID, C5.0, ANN, SVM, Bayesian, Random Forest, XgBoost and CatBoost has been used to classify the data and know the importance of features. To validate the result of different classification technique, the accuracy, precision, recall, F-measures, sensitivity, specificity, MCC, ROC and AUROC has been used. All experiment has been done with the help of SPSS Clementine and Python.
Result:
Above discussed classification techniques give different results and found that the data are imbalanced for class C4, C5, and C6. As a result, all of the classification technique gives acceptable accuracy above of 60% for these classes of data, but their precision value is very less or negligible. The experimental results highlight that the Random forest gives highest accuracy as well as AUROC among all, i.e., 96.84% and 0.945 respectively. And also have high precision and recall value.
Conclusion:
The experiment conducted and analyzed in this article highlight that the Random Forest classification technique can be used for protein of human enzyme classification and predictions.
Abstract: In the field of computational biology, to gauge the meaningful and accurate feature for protein function predications, either the profile-based protein data or sequence-based data has been used.
As we know that the prediction of enzyme class from an unknown protein is most interacted research in the current era. In this context, machine learning and statistical classification technique has been used. In this article, we have use six different machine learning and statistical classification technique such as CRT, QUEST, CHAID, C5.0, ANN and SVM for classification of 4314 number of human protein sequence data. These data are extracted form UniprotKB databank with the help of PROFEAT server. The extracted data are categorized in seven different classes. To manipulate the high dimensional protein sequence data with some missing value, the SPSS has been used for classification and estimation of the performance of classification technique. The experimental results highlight that the class C4, C5, C6 and C7 data are imbalanced that affect the overall performance of classification technique. This article provides an extensive comparative analysis of different classification technique on sequence-based protein data. The experimental analysis highlights that the SVM and C5.0 classification technique gives better result than others and can be used for protein classification and predictions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.