Quality Issues with Public Domain Chemogenomics Data

Kalliokoski, Tuomo; Krämer, Christian; Vulpetti, Anna

doi:10.1002/minf.201300051

Cited by 19 publications

(17 citation statements)

References 62 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, compared to the Morgan2 fingerprint, the QAFFP fingerprints were able to retrieve significantly higher number of new scaffolds. These findings are rather encouraging given that (i) the QAFFP fingerprints are much shorter, (ii) the QAFFP fingerprints are defined on a purely data-driven fashion, without selecting the targets following biological reasons, and (iii) the models from which the QAFFP fingerprints are derived are far from perfect as their quality is influenced by, for example, QSAR modeling errors [107,108], experimental errors in publicly available data [109][110][111], data curation errors [69,112] or data imputation noise. On Table 4 The average number of ACSKs per an assay (and its standard error of the mean SEM) in 22 CLASS sets revealed by the Morgan2, rv-QAFFP and b-QAFFP fingerprints Model AD was estimated by an ICP with the confidence level of 90%.…”

Section: Discussionmentioning

confidence: 79%

QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping

et al. 2020

View full text Add to dashboard Cite

An affinity fingerprint is the vector consisting of compound's affinity or potency against the reference panel of protein targets. Here, we present the QAFFP fingerprint, 440 elements long in silico QSAR-based affinity fingerprint, components of which are predicted by Random Forest regression models trained on bioactivity data from the ChEMBL database. Both real-valued (rv-QAFFP) and binary (b-QAFFP) versions of the QAFFP fingerprint were implemented and their performance in similarity searching, biological activity classification and scaffold hopping was assessed and compared to that of the 1024 bits long Morgan2 fingerprint (the RDKit implementation of the ECFP4 fingerprint). In both similarity searching and biological activity classification, the QAFFP fingerprint yields retrieval rates, measured by AUC (~ 0.65 and ~ 0.70 for similarity searching depending on data sets, and ~ 0.85 for classification) and EF5 (~ 4.67 and ~ 5.82 for similarity searching depending on data sets, and ~ 2.10 for classification), comparable to that of the Mor-gan2 fingerprint (similarity searching AUC of ~ 0.57 and ~ 0.66, and EF5 of ~ 4.09 and ~ 6.41, depending on data sets, classification AUC of ~ 0.87, and EF5 of ~ 2.16). However, the QAFFP fingerprint outperforms the Morgan2 fingerprint in scaffold hopping as it is able to retrieve 1146 out of existing 1749 scaffolds, while the Morgan2 fingerprint reveals only 864 scaffolds.

show abstract

Section: Discussionmentioning

confidence: 79%

QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping

et al. 2020

View full text Add to dashboard Cite

show abstract

“…If these guidelines would be adopted in all public databases, the quality of datasets for the development and evaluation of scoring functions would increase substantially. Kalliokoski et al recently published a review where the topic of quality in bioactivity databases is discussed in more detail …”

Section: Resultsmentioning

confidence: 99%

“…Kalliokoski et al recently published a review where the topic of quality in bioactivity databases is discussed in more detail. 65 Concerning databases which can be used for the geometrical analysis of protein-ligand interactions, the format of the stored data and the possibility to generate user-specific queries are very important aspects. CREDO stores all interactions in form of an interaction fingerprint.…”

Section: Resultsmentioning

confidence: 99%

Protein–ligand interaction databases: advanced tools to mine activity data and interactions on a structural level

Inhester

Rarey

2014

WIREs Comput Mol Sci

View full text Add to dashboard Cite

The formation of molecular complexes between proteins and small organic substances is a fundamental concept of life. Biochemical experiments from X‐ray crystallography to isothermal titration calorimetry (ITC) are applied in large‐scale providing data for the analysis of the structural foundations of binding affinity. In recent years, several, mostly publically available databases emerged containing affinity data and structural information. These databases are central for the construction of complex models describing interaction geometries and correlate structural features to the strength of binding. Binding affinity databases reflect the knowledge of affinity measurements from many sources, mostly scientific and patent literature. A critical aspect is the data quality, which is affected by transcription errors during database construction as well as experimental uncertainties. The Protein Data Bank (PDB) is the central resource for macromolecular biological structures containing nearly 100,000 data entries today. Sophisticated geometric databases have been constructed based on this allowing for complex queries about the spatial arrangement of functional groups and their interactions. For scientists working in molecular design like medicinal chemists, access to this information can substantially support the process of creating new molecular entities specifically interacting with proteins of interest. WIREs Comput Mol Sci 2014, 4:562–575. doi: 10.1002/wcms.1192 This article is categorized under: Structure and Mechanism > Molecular Structures Computer and Information Science > Chemoinformatics Computer and Information Science > Databases and Expert Systems

show abstract

“…It is known that some noise and various contradictions are stored in, and migrate from one source of bioactivity data to another, along with correct records (Kramer and Lewis, 2012 ; Kalliokoski et al, 2013 ; Tiikkainen et al, 2013 ; Papadatos et al, 2015 ). Thus, it is necessary to filter the data before using them in order to eliminate incorrect data and records that are inconsistent with the goal of the virtual screening study (Fourches et al, 2016 ).…”

Section: Methodsmentioning

confidence: 99%

How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors

et al. 2018

View full text Add to dashboard Cite

Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of “active” and “inactive” compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.

show abstract

Quality Issues with Public Domain Chemogenomics Data

Cited by 19 publications

References 62 publications

QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping

QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping

Protein–ligand interaction databases: advanced tools to mine activity data and interactions on a structural level

How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors

Contact Info

Product

Resources

About