A pipeline for improved QSAR analysis of peptides: physiochemical property parameter selection via BMSF, near-neighbor sample selection via semivariogram, and weighted SVR regression and prediction

Dai, Zhijun; Wang, Lifeng; Chen, Yuan; Wang, Haiyan; Bai, Lianyang; Yuan, Zheming

doi:10.1007/s00726-014-1667-5

Cited by 6 publications

(8 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, the significant improvement in model performance was achieved by feature selection because plenty of irrelevant features were eliminated. 17 …”

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…comprising the peptides and properties of the entire peptides (electronegativity, sequence information, solubility, molecular weight, topological information, etc. ). , Then, feature selection and modeling methods are combined to connect the structure information and bioactivity. , More than 80 amino acid descriptors (AADs) extracted from properties of amino acids by principal component analysis (PCA) were presented to characterize peptide structures and encode the peptides. − However, directly using these AADs usually led to undesirable model performance since most of them were not intended for the antioxidant activity modeling (e.g., T-scale for angiotensin-converting enzyme inhibitory activity). ,,,,− …”

Section: Introductionmentioning

confidence: 99%

“…Machine learning methods have been successfully applied for feature selection and model development in QSAR studies on peptide bioactivity (e.g., angiotensin-converting enzyme inhibitory activity). ,,,,− A total of 566 numerical values of amino acid including physicochemical properties and biochemical properties of amino acids and pairs of amino acids have been available in the AAIndex database . This makes it possible to use feature selection to find the important variables for bioactivity prediction compared with using AADs from PCA where the principal components were composed of various original variables.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Comprehensive Evaluation and Comparison of Machine Learning Methods in QSAR Modeling of Antioxidant Tripeptides

Wang

2022

ACS Omega

View full text Add to dashboard Cite

Due to their multiple beneficial effects, antioxidant peptides have attracted increasing interest. Currently, the screening and identification of bioactive peptides, including antioxidative peptides based on wet-chemistry methods are time-consuming and highly rely on many advanced instruments and trained personnel. Quantitative structure–activity relationship (QSAR) analysis as an in silico method can be more efficient and cost-effective. However, model performance of QSAR studies on antioxidant peptides was still poor due to limited attempts in model development approaches. The objective of this study was to compare popular machine learning methods for antioxidant activity modeling and screening of tripeptides and identify the critical amino acid features that determine the antioxidant activity. 533 numerical indices of amino acids were adopted to characterize 130 tripeptides with known antioxidant activity from the published literature, and then 7 feature selection strategies plus pairwise correlation were used to screen the most important indices for antioxidant activity and model building. 14 machine learning methods were used to build models based on the feature selection strategies, respectively. Among the 98 models, non-linear regression methods tended to perform better, and the best model with an R 2 Test of 0.847 and RMSE Test of 0.627 for tripeptide antioxidants was obtained by combining random forest for feature selection and tree-based extreme gradient boost regression for model development. Based on the predicted antioxidant values of 7870 unknown tripeptides, potentially high antioxidant activity tripeptides all have a tyrosine, tryptophan, or cysteine residue at the C-terminal position. Furthermore, the predicted antioxidant activity of six synthesized tripeptides was confirmed through experimental determination, and for the first time, the cysteine or tyrosine residue at the C-terminal was found to be critical to the antioxidant activity based on both QSAR models and experimental observations.

show abstract

“…Therefore, the significant improvement in model performance was achieved by feature selection because plenty of irrelevant features were eliminated. 17 …”

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Comprehensive Evaluation and Comparison of Machine Learning Methods in QSAR Modeling of Antioxidant Tripeptides

Wang

2022

ACS Omega

View full text Add to dashboard Cite

show abstract

“…This study uses the molecule descriptor calculation software, PCLIENT, to calculate thousands of physiochemical parameters for every small molecule compound of alcohol. 15 Optimum descriptors subset is obtained by a feature selection pipeline containing three step searching strategies: (i) select statistically significant features that imply nonlinear correlation with biotoxicity of chemical compounds using MIC based univariate filter; (ii) refine feature subset by support vector regression based backward elimination (SVR-BE); 16 (iii) obtain optimal subset via a forward selection process that integrated minimal redundancy maximal relevance, MIC and SVR. A QSAR model is finally built on the training set with the reserved descriptors, and then to predict biotoxicities of Rana temporaria in the test set.…”

Section: Introductionmentioning

confidence: 99%

Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria

Wang¹,

Xing²,

Zhou³

et al. 2018

J. Braz. Chem. Soc.

Self Cite

View full text Add to dashboard Cite

Efficient evaluation of biotoxicity of organics is of vital significance to resource utilization and environmental protection. In this study, toxicity of 110 alcohol compounds to tadpoles of Rana temporaria is adopted as the dependent variable and 1388 physiochemical parameters (features) calculated by PCLIENT are used for representing each compound. A feature selection pipeline with three steps is developed to refine the feature subset: 282 features that significantly correlated with biotoxicity of chemical compounds are preliminarily selected via the maximum information coefficient (MIC); 138 descriptors that have positive contribution to the model's performance are reserved after a support vector regression (SVR) based backward elimination; 18 descriptors are finally selected via a forward selection process that integrated minimal redundancy maximal relevance (mRMR), MIC and SVR. In terms of feature subsets with different numbers of variables, quantitative structure activity relationship (QSAR) models are built using multiple linear regression (MLR), partial least square regression (PLS) and SVR, respectively. The independent prediction evaluation index, Q 2 , increases from −74.787, 0.824 and 0.868 to 0.892, 0.878 and 0.940, for the three regression models, respectively. Results suggest that nonlinear feature selection methods involved in MIC and SVR can effectively eliminate irrelevant descriptors. SVR outperforms classical statistical models to QSAR modeling on high-dimensional data containing nonlinear relationship between features. The methods proposed in this study have a potential application in the QSAR research field such as biotoxicity compounds.

show abstract

Improving depression prediction using a novel feature selection algorithm coupled with context-aware analysis

Dai

Zhou

et al. 2021

Journal of Affective Disorders

View full text Add to dashboard Cite

A pipeline for improved QSAR analysis of peptides: physiochemical property parameter selection via BMSF, near-neighbor sample selection via semivariogram, and weighted SVR regression and prediction

Cited by 6 publications

References 32 publications

Comprehensive Evaluation and Comparison of Machine Learning Methods in QSAR Modeling of Antioxidant Tripeptides

Comprehensive Evaluation and Comparison of Machine Learning Methods in QSAR Modeling of Antioxidant Tripeptides

Maximal Information Coefficient and Support Vector Regression Based Nonlinear Feature Selection and QSAR Modeling on Toxicity of Alcohol Compounds to Tadpoles of Rana temporaria

Improving depression prediction using a novel feature selection algorithm coupled with context-aware analysis

Contact Info

Product

Resources

About