2020
DOI: 10.3389/fphar.2020.00069
|View full text |Cite
|
Sign up to set email alerts
|

Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets

Abstract: Predicting protein-ligand interactions using artificial intelligence (AI) models has attracted great interest in recent years. However, data-driven AI models unequivocally suffer from a lack of sufficiently large and unbiased datasets. Here, we systematically investigated the data biases on the PDBbind and DUD-E datasets. We examined the model performance of atomic convolutional neural network (ACNN) on the PDBbind core set and achieved a Pearson R 2 of 0.73 between experimental and predicted binding affinitie… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

6
144
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 92 publications
(150 citation statements)
references
References 48 publications
(58 reference statements)
6
144
0
Order By: Relevance
“…These models were trained with the 3DA descriptor equivalent of the PLC features. In an ideal dataset, ligand and protein pocket controls would have near-zero correlation to experimental results; however, consistent with the findings of Yang et al, 46 the LB and pocket-based QSAR models each had correlation coefficients greater than 0.50 at 0.72 and 0.61, respectively ( Figure S9 ).…”
Section: Resultssupporting
confidence: 84%
See 1 more Smart Citation
“…These models were trained with the 3DA descriptor equivalent of the PLC features. In an ideal dataset, ligand and protein pocket controls would have near-zero correlation to experimental results; however, consistent with the findings of Yang et al, 46 the LB and pocket-based QSAR models each had correlation coefficients greater than 0.50 at 0.72 and 0.61, respectively ( Figure S9 ).…”
Section: Resultssupporting
confidence: 84%
“…It is increasingly well documented that strong machine learning model performance on QSAR tasks can be the result of dataset bias. 41 , 46 , 47 Indeed, Yang et al found that atomic CNNs (ACNNs) trained solely on ligand or receptor pocket features performed just as well as ACNNs trained on protein–ligand complexes, 46 suggesting that the model was unable to leverage features relating to the protein–ligand interactions in a meaningful way. Therefore, we sought to determine the extent to which dataset biases may be inflating BCL-AffinityNet performance.…”
Section: Resultsmentioning
confidence: 99%
“…However, there has been criticism against ML models trained on DUD-E dataset regarding overfitting to the dataset. One of the key criticism of the DUD-E test set is that models trained on DUD-E can easily distinguish active and inactive ligands based on physiochemical properties [ 70 ]. For example, Sieg et al [ 71 ] reported that the distributions of MW beyond 500 Da between actives and decoys in DUD-E were mismatched.…”
Section: Resultsmentioning
confidence: 99%
“…One of the key criticism of the DUD-E test set is that models trained on DUD-E can easily distinguish active and inactive ligands based on physiochemical properties. 62 For example, Sieg et al 63 reported that the distributions of MW beyond 500 Da between actives and decoys in DUD-E were mismatched. Further studies have shown that the actives and decoys against the same target can be easily differentiated based on fingerprint.…”
Section: Datasetsmentioning
confidence: 99%