2022
DOI: 10.1021/acs.jmedchem.2c00460
|View full text |Cite
|
Sign up to set email alerts
|

TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions

Abstract: Development of accurate machine-learning-based scoring functions (MLSFs) for structure-based virtual screening against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most datasets for the development of MLSFs were designed for traditional SFs and may suffer from hidden biases and data insufficiency. Hereby, we developed a new approach named Topology-based and Conformation-based decoys generation (TocoDecoy), which integrates two strategies to generate de… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
24
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8

Relationship

1
7

Authors

Journals

citations
Cited by 19 publications
(24 citation statements)
references
References 48 publications
0
24
0
Order By: Relevance
“…More consideration should be given to the curation of the training data sets and to introducing regularization techniques to current VS protocol. Computationally generated decoy poses and training data sets should be generated carefully to prevent the model from having noncausal bias, an issue where the model learns specific data patterns instead of meaningful ligand–protein interactions. , Training data sets tend to have many positive labels that lack diversity and have few or biased negative labels. ,,, Non- and poor-binders are under-reported, and computationally selected nonbinders should be verified experimentally. , There is a lack of quality and standardized data for some targets. Methods such as data set debiasing and introducing specific models to classify and rank actives from inactives , are promising regularization techniques to tackle this problem.…”
Section: Concluding Remarks and Perspectivementioning
confidence: 99%
See 1 more Smart Citation
“…More consideration should be given to the curation of the training data sets and to introducing regularization techniques to current VS protocol. Computationally generated decoy poses and training data sets should be generated carefully to prevent the model from having noncausal bias, an issue where the model learns specific data patterns instead of meaningful ligand–protein interactions. , Training data sets tend to have many positive labels that lack diversity and have few or biased negative labels. ,,, Non- and poor-binders are under-reported, and computationally selected nonbinders should be verified experimentally. , There is a lack of quality and standardized data for some targets. Methods such as data set debiasing and introducing specific models to classify and rank actives from inactives , are promising regularization techniques to tackle this problem.…”
Section: Concluding Remarks and Perspectivementioning
confidence: 99%
“…Computationally generated decoy poses and training data sets should be generated carefully to prevent the model from having noncausal bias, an issue where the model learns specific data patterns instead of meaningful ligand−protein interactions. 256,257 Training data sets tend to have many positive labels that lack diversity and have few or biased negative labels. 83,88,258,259 Non-and poor-binders are under-reported, and computationally selected nonbinders should be verified experimentally.…”
Section: Concluding Remarks and Perspectivementioning
confidence: 99%
“…Similarly, TocoDecoy generates unbiased and expandable datasets for training and benchmarking scoring functions based on machine learning. This tool generates property-matched decoy sets in combination with decoy conformation sets having low docking scores to mitigate bias [136]. However, property-matched decoy generation is prone to falsely increase the enrichment and does not represent the chemical space expected in a large library.…”
Section: Datasetsmentioning
confidence: 99%
“…Machine-learning-based scoring functions (MLSFs) have attracted extensive attention due to their potentially improved accuracy in binding affinity prediction and structure-based virtual screening (SBVS) compared with classical SFs. Development of accurate MLSFs for SBVS against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most existing datasets for the development of MLSFs were originally designed for traditional SFs and may suffer from hidden biases (artificial enrichment, analogue bias, domain bias, and noncausal bias) and data insufficiency. To address this issue, we have developed a new approach named topology-based and conformation-based decoys generation (TocoDecoy), which integrates two strategies to generate decoys by tweaking the actives for a specific target, to generate unbiased and expandable datasets for training and benchmarking MLSFs …”
Section: Introductionmentioning
confidence: 99%
“…6−8 To address this issue, we have developed a new approach named topology-based and conformation-based decoys generation (TocoDecoy), which integrates two strategies to generate decoys by tweaking the actives for a specific target, to generate unbiased and expandable datasets for training and benchmarking MLSFs. 9 To further simplify the application of TocoDecoy, we developed a new database named topology-based and conformation-based decoys database (ToCoDDB) based on the generation protocol of TocoDecoy. Currently, ToCoDDB not only provides a large number of pregenerated targetspecific unbiased datasets, but also supports the generation of unbiased and expandable datasets for training and benchmarking MLSFs.…”
Section: ■ Introductionmentioning
confidence: 99%