2021
DOI: 10.26434/chemrxiv-2021-4d420
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Protein pKa prediction by tree-based machine learning

Abstract: We present four tree-based machine learning models for protein pKa prediction. The four models, Random Forest, Extra Trees, eXtreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machine (LightGBM), were trained on three experimental PDB and pKa datasets, two of which included a notable portion of internal residues. We observed similar performance among the four machine learning algorithms. The best model trained on the largest dataset performs 37% better than the widely used empirical pKa prediction … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 107 publications
(148 reference statements)
0
3
0
Order By: Relevance
“…As IEF is still used for separation of modified peptides there is a potential to develop models that can predict the pI of modified peptides. For protein-level IEF under native conditions, a methodology similar to that recently used to calculate protein pK a values 73 using AlphaFold may be adopted.…”
Section: ■ Enzymatic Digestionmentioning
confidence: 99%
“…As IEF is still used for separation of modified peptides there is a potential to develop models that can predict the pI of modified peptides. For protein-level IEF under native conditions, a methodology similar to that recently used to calculate protein pK a values 73 using AlphaFold may be adopted.…”
Section: ■ Enzymatic Digestionmentioning
confidence: 99%
“…Chen et al trained tree-based machine learning models, such as XGBoost or LightGBM, on experimental data, and their best model exhibited an RMSE of 0.69. 30 To compare pKAI with these models and illustrate the data leakage problem at hand, we have refined our pKAI model by training it on same data split reported in ref 30. This new model seems to have an unparalleled performance (RMSE of 0.32 and MAE of 0.21).…”
Section: Journal Of Chemicalmentioning
confidence: 99%
“…5 Recently, traditional ML models have been trained on ∼1500 experimental pK a values. 29,30 However, testing the real-world performances of such methods is difficult, as there is a high degree of similarity among available experimental data. Our larger data set translates into more diversity in terms of protein and residue types and, more importantly, a wider variety of residue environments.…”
Section: ■ Introductionmentioning
confidence: 99%