2017
DOI: 10.1021/acs.jcim.7b00166
|View full text |Cite
|
Sign up to set email alerts
|

Profile-QSAR 2.0: Kinase Virtual Screening Accuracy Comparable to Four-Concentration IC50s for Realistically Novel Compounds

Abstract: While conventional random forest regression (RFR) virtual screening models appear to have excellent accuracy on random held-out test sets, they prove lacking in actual practice. Analysis of 18 historical virtual screens showed that random test sets are far more similar to their training sets than are the compounds project teams actually order. A new, cluster-based "realistic" training/test set split, which mirrors the chemical novelty of real-life virtual screens, recapitulates the poor predictive power of RFR… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
106
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 64 publications
(107 citation statements)
references
References 22 publications
(29 reference statements)
0
106
0
Order By: Relevance
“…2) Although no activity information was available in house for CLK2, a ligand‐based approach to hit‐list generation was followed. An internally developed machine learning technique (pQSAR) based on random forests and partial‐least squares was employed to build a virtual screening model. To train the model, activity data of roughly 3000 compounds against a closely related kinase was used as a surrogate for CLK2 because it was known from a few compounds that selectivity between the kinases was low.…”
Section: Resultsmentioning
confidence: 99%
“…2) Although no activity information was available in house for CLK2, a ligand‐based approach to hit‐list generation was followed. An internally developed machine learning technique (pQSAR) based on random forests and partial‐least squares was employed to build a virtual screening model. To train the model, activity data of roughly 3000 compounds against a closely related kinase was used as a surrogate for CLK2 because it was known from a few compounds that selectivity between the kinases was low.…”
Section: Resultsmentioning
confidence: 99%
“…Commonly used machine learning algorithms in QSAR, such as RF (RF), show high interpolation power (i.e., they perform accurately within their applicability domain). However, their performance in extrapolation (i.e., when applied to molecules outside their applicability domain) is limited, due to the method of prediction used 47 . That is, the predicted value is given as the average value of data from the training set at each leaf.…”
Section: Introductionmentioning
confidence: 99%
“… The use of well documented and amenable workflow management platforms like KNIME facilitate the construction of consistent, reproducible, and transferable protocols . The workflows can be transferred between, for example, workstations, users, and sites, and can be re‐run: i) as is, for example, when large data transfer is not feasible, or when new database versions are released; ii) with different configurations of the nodes, for example, changing ligand activity cut‐offs (Figure ), input ligands (Figures , , ), protein targets (Figure ); iii) with additional/modified nodes to obtain complementary information, for example, including annotations from other databases, further analyzing results, or performing machine learning on the obtained data. Pre‐configured meta nodes or workflow blocks can be easily reused because the same data collection, preparation, processing and analysis steps might be required in various workflows for different purposes. KNIME contains a rich and continuously growing set of cheminformatics nodes to handle and process chemical and biological data in multiple formats.…”
Section: Discussionmentioning
confidence: 99%