Small Random Forest Models for Effective Chemogenomic Active Learning

Rakers, Christin; Reker, Daniel; Brown, J.

doi:10.2751/jcac.18.124

Cited by 16 publications

(16 citation statements)

References 65 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These emerging methods are collectively known as chemogenomic active learning. Rakers, Reker, and Brown further demonstrated that model complexity built on qHTS data could be reduced by more than half of existing estimates.…”

Section: Introductionsupporting

confidence: 90%

“…The subsequent CGAL model development is based on a collection of decision trees known as a random forest that are actively trained using a specified number of iterations or data samples. Based on previous analyses of the tradeoff between the number of trees and resulting performance, the number of trees used here was fixed at 100 …”

Section: Resultsmentioning

confidence: 99%

“…Overall, the interaction spaces of CYPs and NHRs, comprising 6432 and 10 601 data points, respectively, were successfully modeled in chemogenomic active learning experiments based on only a fraction of available data. In comparison with previous studies that were conducted based on more than 100 000 protein–ligand interaction samples, results presented here indicate applicability of active learning even in sparse data scenarios.…”

Section: Resultsmentioning

confidence: 99%

“…The descriptor choice was based on previously published comparative experiments . Model training was initiated by random selection of one CPI and one non‐CPI (Figure ) and, based on previous research, models were built using 100 trees with sqrt(4496)=67 descriptors considered per tree node split. Training was terminated after 5000 iterations.…”

Section: Methodsmentioning

confidence: 99%

See 3 more Smart Citations

Chemogenomic Active Learning's Domain of Applicability on Small, Sparse qHTS Matrices: A Study Using Cytochrome P450 and Nuclear Hormone Receptor Families

et al. 2018

Self Cite

View full text Add to dashboard Cite

Computational models for predicting the activity of small molecules against targets are now routinely developed and used in academia and industry, partially due to public bioactivity databases. While models based on bigger datasets are the trend, recent studies such as chemogenomic active learning have shown that only a fraction of data is needed for effective models in many cases. In this article, the chemogenomic active learning method is discussed and used to newly analyze public databases containing nuclear hormone receptor and cytochrome P450 enzyme family bioactivity. In addition to existing results on kinases and G-protein coupled receptors, results here demonstrate the active learning methodology's effectiveness on extracting informative ligand-target pairs in sparse data scenarios. Experiments to assess the domain of the applicability demonstrate the influence of ligand profiles of similar targets within the family.

show abstract

Section: Introductionsupporting

confidence: 90%

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

Chemogenomic Active Learning's Domain of Applicability on Small, Sparse qHTS Matrices: A Study Using Cytochrome P450 and Nuclear Hormone Receptor Families

et al. 2018

Self Cite

View full text Add to dashboard Cite

show abstract

“…Computational experiments may consider the philosophy discussed herein to perform repeated executions of a model‐predict experiment such that less than half of the data per class is subsampled for model selection with the majority remainder used for prediction. Recent modeling methods have shown that often only a fraction of a dataset is sufficient to build a predictive model ,,. As in the prior studies, if distribution of prediction performances can be shown to be normally distributed by the Kolmogorov‐Smirnov test, we can consider using such a fact to forecast the chances of success in a true prospective experiment.…”

Section: Conclusion and Future Outlookmentioning

confidence: 98%

Classifiers and their Metrics Quantified

Brown

2018

Molecular Informatics

View full text Add to dashboard Cite

Molecular modeling frequently constructs classification models for the prediction of two‐class entities, such as compound bio(in)activity, chemical property (non)existence, protein (non)interaction, and so forth. The models are evaluated using well known metrics such as accuracy or true positive rates. However, these frequently used metrics applied to retrospective and/or artificially generated prediction datasets can potentially overestimate true performance in actual prospective experiments. Here, we systematically consider metric value surface generation as a consequence of data balance, and propose the computation of an inverse cumulative distribution function taken over a metric surface. The proposed distribution analysis can aid in the selection of metrics when formulating study design. In addition to theoretical analyses, a practical example in chemogenomic virtual screening highlights the care required in metric selection and interpretation.

show abstract

Selection of Informative Examples in Chemogenomic Datasets

Reker

Brown

2018

Methods in Molecular Biology

View full text Add to dashboard Cite

Small Random Forest Models for Effective Chemogenomic Active Learning

Cited by 16 publications

References 65 publications

Chemogenomic Active Learning's Domain of Applicability on Small, Sparse qHTS Matrices: A Study Using Cytochrome P450 and Nuclear Hormone Receptor Families

Chemogenomic Active Learning's Domain of Applicability on Small, Sparse qHTS Matrices: A Study Using Cytochrome P450 and Nuclear Hormone Receptor Families

Classifiers and their Metrics Quantified

Selection of Informative Examples in Chemogenomic Datasets

Contact Info

Product

Resources

About