Abstract:The key concept in chemogenomics is the similarity principle that states that similar ligands should bind similar targets. Chemogenomic analysis requires large amounts of data and both powerful computational algorithms and computers. Data used for chemogenomics analysis can either be compiled from open sources, or they can be produced in-house as is often done in the pharmaceutical industry. The chemogenomic modeller often has to resort to mixing activity values from different laboratories and even assay types… Show more
“…However, compared to the Morgan2 fingerprint, the QAFFP fingerprints were able to retrieve significantly higher number of new scaffolds. These findings are rather encouraging given that (i) the QAFFP fingerprints are much shorter, (ii) the QAFFP fingerprints are defined on a purely data-driven fashion, without selecting the targets following biological reasons, and (iii) the models from which the QAFFP fingerprints are derived are far from perfect as their quality is influenced by, for example, QSAR modeling errors [107,108], experimental errors in publicly available data [109][110][111], data curation errors [69,112] or data imputation noise. On Table 4 The average number of ACSKs per an assay (and its standard error of the mean SEM) in 22 CLASS sets revealed by the Morgan2, rv-QAFFP and b-QAFFP fingerprints Model AD was estimated by an ICP with the confidence level of 90%.…”
An affinity fingerprint is the vector consisting of compound's affinity or potency against the reference panel of protein targets. Here, we present the QAFFP fingerprint, 440 elements long in silico QSAR-based affinity fingerprint, components of which are predicted by Random Forest regression models trained on bioactivity data from the ChEMBL database. Both real-valued (rv-QAFFP) and binary (b-QAFFP) versions of the QAFFP fingerprint were implemented and their performance in similarity searching, biological activity classification and scaffold hopping was assessed and compared to that of the 1024 bits long Morgan2 fingerprint (the RDKit implementation of the ECFP4 fingerprint). In both similarity searching and biological activity classification, the QAFFP fingerprint yields retrieval rates, measured by AUC (~ 0.65 and ~ 0.70 for similarity searching depending on data sets, and ~ 0.85 for classification) and EF5 (~ 4.67 and ~ 5.82 for similarity searching depending on data sets, and ~ 2.10 for classification), comparable to that of the Mor-gan2 fingerprint (similarity searching AUC of ~ 0.57 and ~ 0.66, and EF5 of ~ 4.09 and ~ 6.41, depending on data sets, classification AUC of ~ 0.87, and EF5 of ~ 2.16). However, the QAFFP fingerprint outperforms the Morgan2 fingerprint in scaffold hopping as it is able to retrieve 1146 out of existing 1749 scaffolds, while the Morgan2 fingerprint reveals only 864 scaffolds.
“…However, compared to the Morgan2 fingerprint, the QAFFP fingerprints were able to retrieve significantly higher number of new scaffolds. These findings are rather encouraging given that (i) the QAFFP fingerprints are much shorter, (ii) the QAFFP fingerprints are defined on a purely data-driven fashion, without selecting the targets following biological reasons, and (iii) the models from which the QAFFP fingerprints are derived are far from perfect as their quality is influenced by, for example, QSAR modeling errors [107,108], experimental errors in publicly available data [109][110][111], data curation errors [69,112] or data imputation noise. On Table 4 The average number of ACSKs per an assay (and its standard error of the mean SEM) in 22 CLASS sets revealed by the Morgan2, rv-QAFFP and b-QAFFP fingerprints Model AD was estimated by an ICP with the confidence level of 90%.…”
An affinity fingerprint is the vector consisting of compound's affinity or potency against the reference panel of protein targets. Here, we present the QAFFP fingerprint, 440 elements long in silico QSAR-based affinity fingerprint, components of which are predicted by Random Forest regression models trained on bioactivity data from the ChEMBL database. Both real-valued (rv-QAFFP) and binary (b-QAFFP) versions of the QAFFP fingerprint were implemented and their performance in similarity searching, biological activity classification and scaffold hopping was assessed and compared to that of the 1024 bits long Morgan2 fingerprint (the RDKit implementation of the ECFP4 fingerprint). In both similarity searching and biological activity classification, the QAFFP fingerprint yields retrieval rates, measured by AUC (~ 0.65 and ~ 0.70 for similarity searching depending on data sets, and ~ 0.85 for classification) and EF5 (~ 4.67 and ~ 5.82 for similarity searching depending on data sets, and ~ 2.10 for classification), comparable to that of the Mor-gan2 fingerprint (similarity searching AUC of ~ 0.57 and ~ 0.66, and EF5 of ~ 4.09 and ~ 6.41, depending on data sets, classification AUC of ~ 0.87, and EF5 of ~ 2.16). However, the QAFFP fingerprint outperforms the Morgan2 fingerprint in scaffold hopping as it is able to retrieve 1146 out of existing 1749 scaffolds, while the Morgan2 fingerprint reveals only 864 scaffolds.
“…If these guidelines would be adopted in all public databases, the quality of datasets for the development and evaluation of scoring functions would increase substantially. Kalliokoski et al recently published a review where the topic of quality in bioactivity databases is discussed in more detail …”
Section: Resultsmentioning
confidence: 99%
“…Kalliokoski et al recently published a review where the topic of quality in bioactivity databases is discussed in more detail. 65 Concerning databases which can be used for the geometrical analysis of protein-ligand interactions, the format of the stored data and the possibility to generate user-specific queries are very important aspects. CREDO stores all interactions in form of an interaction fingerprint.…”
The formation of molecular complexes between proteins and small organic substances is a fundamental concept of life. Biochemical experiments from X‐ray crystallography to isothermal titration calorimetry (ITC) are applied in large‐scale providing data for the analysis of the structural foundations of binding affinity. In recent years, several, mostly publically available databases emerged containing affinity data and structural information. These databases are central for the construction of complex models describing interaction geometries and correlate structural features to the strength of binding. Binding affinity databases reflect the knowledge of affinity measurements from many sources, mostly scientific and patent literature. A critical aspect is the data quality, which is affected by transcription errors during database construction as well as experimental uncertainties. The Protein Data Bank (PDB) is the central resource for macromolecular biological structures containing nearly 100,000 data entries today. Sophisticated geometric databases have been constructed based on this allowing for complex queries about the spatial arrangement of functional groups and their interactions. For scientists working in molecular design like medicinal chemists, access to this information can substantially support the process of creating new molecular entities specifically interacting with proteins of interest. WIREs Comput Mol Sci 2014, 4:562–575. doi: 10.1002/wcms.1192
This article is categorized under:
Structure and Mechanism > Molecular Structures
Computer and Information Science > Chemoinformatics
Computer and Information Science > Databases and Expert Systems
“…It is known that some noise and various contradictions are stored in, and migrate from one source of bioactivity data to another, along with correct records (Kramer and Lewis, 2012 ; Kalliokoski et al, 2013 ; Tiikkainen et al, 2013 ; Papadatos et al, 2015 ). Thus, it is necessary to filter the data before using them in order to eliminate incorrect data and records that are inconsistent with the goal of the virtual screening study (Fourches et al, 2016 ).…”
Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of “active” and “inactive” compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.