Predicting HIV-1 Protease Cleavage Sites With Positive-Unlabeled Learning

Li, Zhenfeng; Hu, Lun; Tang, Zehai; Zhao, Cheng

doi:10.3389/fgene.2021.658078

Cited by 12 publications

(20 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, using diverse crossvalidated performance metrics is considered good practice and can objectively reveal the true performance of a model rather than depending on just one metric that could be biased towards a subset of the dataset. This approach reduces the risk of overfitting [6,18,35]. Our models performance was also evaluated on an independent testing set which was not previously exposed to the models to give a true account of the models predictive strength.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Prediction of HIV-1 protease cleavage site from octapeptide sequence information using selected classifiers and hybrid descriptors

et al. 2022

View full text Add to dashboard Cite

Background In most parts of the world, especially in underdeveloped countries, acquired immunodeficiency syndrome (AIDS) still remains a major cause of death, disability, and unfavorable economic outcomes. This has necessitated intensive research to develop effective therapeutic agents for the treatment of human immunodeficiency virus (HIV) infection, which is responsible for AIDS. Peptide cleavage by HIV-1 protease is an essential step in the replication of HIV-1. Thus, correct and timely prediction of the cleavage site of HIV-1 protease can significantly speed up and optimize the drug discovery process of novel HIV-1 protease inhibitors. In this work, we built and compared the performance of selected machine learning models for the prediction of HIV-1 protease cleavage site utilizing a hybrid of octapeptide sequence information comprising bond composition, amino acid binary profile (AABP), and physicochemical properties as numerical descriptors serving as input variables for some selected machine learning algorithms. Our work differs from antecedent studies exploring the same subject in the combination of octapeptide descriptors and method used. Instead of using various subsets of the dataset for training and testing the models, we combined the dataset, applied a 3-way data split, and then used a "stratified" 10-fold cross-validation technique alongside the testing set to evaluate the models. Results Among the 8 models evaluated in the “stratified” 10-fold CV experiment, logistic regression, multi-layer perceptron classifier, linear discriminant analysis, gradient boosting classifier, Naive Bayes classifier, and decision tree classifier with AUC, F-score, and B. Acc. scores in the ranges of 0.91–0.96, 0.81–0.88, and 80.1–86.4%, respectively, have the closest predictive performance to the state-of-the-art model (AUC 0.96, F-score 0.80 and B. Acc. ~ 80.0%). Whereas, the perceptron classifier and the K-nearest neighbors had statistically lower performance (AUC 0.77–0.82, F-score 0.53–0.69, and B. Acc. 60.0–68.5%) at p < 0.05. On the other hand, logistic regression, and multi-layer perceptron classifier (AUC of 0.97, F-score > 0.89, and B. Acc. > 90.0%) had the best performance on further evaluation on the testing set, though linear discriminant analysis, gradient boosting classifier, and Naive Bayes classifier equally performed well (AUC > 0.94, F-score > 0.87, and B. Acc. > 86.0%). Conclusions Logistic regression and multi-layer perceptron classifiers have comparable predictive performances to the state-of-the-art model when octapeptide sequence descriptors consisting of AABP, bond composition and standard physicochemical properties are used as input variables. In our future work, we hope to develop a standalone software for HIV-1 protease cleavage site prediction utilizing the linear regression algorithm and the aforementioned octapeptide sequence descriptors.

show abstract

Section: Resultsmentioning

confidence: 99%

“…As a result, feature selection is critical in classification tasks [35]. The number and type of peptide feature descriptors selected determine to a great extent the performance of a model [6]. Amino acids are the building blocks of peptides and proteins.…”

Section: Feature Extraction/vector Constructionmentioning

confidence: 99%

Prediction of HIV-1 protease cleavage site from octapeptide sequence information using selected classifiers and hybrid descriptors

et al. 2022

View full text Add to dashboard Cite

show abstract

“…As a popular metric for binary classification problems, F-measure indicates the harmonic mean of Precision and Recall. The details of computing F-measure can be found in [ 31 ].…”

Section: Methodsmentioning

confidence: 99%

“…Combining the knowledge from experimental studies, a multitask learning model is developed recently based on multi-kernel [ 30 ], and it utilizes the dependencies among various related tasks to build a stronger predictive model for HIV-1 protease cleavage sites prediction. Since certain noisy can be contained by mislabeling cleavable octamers as negative instances, PU-HIV [ 31 ] considers unknown substrate sites as unlabeled samples, and makes use of positive-unlabeled learning to effectively predict HIV-1 protease cleavage sites.

Fig.…”

Section: Related Workmentioning

confidence: 99%

“…With the development of machine learning techniques in bioinformatics [11], a variety of machine learning-based methods have been developed to effectively predict the existence of HIV-1 protease cleavage sites in the substrates [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31]. They usually regard the prediction problem as a typical binary classification task, which is then achieved with a two-step procedure.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Effectively predicting HIV-1 protease cleavage sites by using an ensemble learning approach

Tang

et al. 2022

BMC Bioinformatics

View full text Add to dashboard Cite

Background The site information of substrates that can be cleaved by human immunodeficiency virus 1 proteases (HIV-1 PRs) is of great significance for designing effective inhibitors against HIV-1 viruses. A variety of machine learning-based algorithms have been developed to predict HIV-1 PR cleavage sites by extracting relevant features from substrate sequences. However, only relying on the sequence information is not sufficient to ensure a promising performance due to the uncertainty in the way of separating the datasets used for training and testing. Moreover, the existence of noisy data, i.e., false positive and false negative cleavage sites, could negatively influence the accuracy performance. Results In this work, an ensemble learning algorithm for predicting HIV-1 PR cleavage sites, namely EM-HIV, is proposed by training a set of weak learners, i.e., biased support vector machine classifiers, with the asymmetric bagging strategy. By doing so, the impact of data imbalance and noisy data can thus be alleviated. Besides, in order to make full use of substrate sequences, the features used by EM-HIV are collected from three different coding schemes, including amino acid identities, chemical properties and variable-length coevolutionary patterns, for the purpose of constructing more relevant feature vectors of octamers. Experiment results on three independent benchmark datasets demonstrate that EM-HIV outperforms state-of-the-art prediction algorithm in terms of several evaluation metrics. Hence, EM-HIV can be regarded as a useful tool to accurately predict HIV-1 PR cleavage sites.

show abstract

MRLDTI: A Meta-path-Based Representation Learning Model for Drug-Target Interaction Prediction

Zhao

et al. 2022

Intelligent Computing Theories and Application

View full text Add to dashboard Cite

Predicting HIV-1 Protease Cleavage Sites With Positive-Unlabeled Learning

Cited by 12 publications

References 31 publications

Prediction of HIV-1 protease cleavage site from octapeptide sequence information using selected classifiers and hybrid descriptors

Prediction of HIV-1 protease cleavage site from octapeptide sequence information using selected classifiers and hybrid descriptors

Effectively predicting HIV-1 protease cleavage sites by using an ensemble learning approach

MRLDTI: A Meta-path-Based Representation Learning Model for Drug-Target Interaction Prediction

Contact Info

Product

Resources

About