In both academia and the pharmaceutical industry, large-scale assays for drug discovery are expensive and often impractical, particularly for the increasingly important physiologically relevant model systems that require primary cells, organoids, whole organisms, or expensive or rare reagents. We hypothesized that data from a single high-throughput imaging assay can be repurposed to predict the biological activity of compounds in other assays, even those targeting alternate pathways or biological processes. Indeed, quantitative information extracted from a three-channel microscopy-based screen for glucocorticoid receptor translocation was able to predict assay-specific biological activity in two ongoing drug discovery projects. In these projects, repurposing increased hit rates by 50- to 250-fold over that of the initial project assays while increasing the chemical structure diversity of the hits. Our results suggest that data from high-content screens are a rich source of information that can be used to predict and replace customized biological assays.
Chemogenomics data generally refers to the activity data of chemical compounds on an array of protein targets and represents an important source of information for building in silico target prediction models. The increasing volume of chemogenomics data offers exciting opportunities to build models based on Big Data. Preparing a high quality data set is a vital step in realizing this goal and this work aims to compile such a comprehensive chemogenomics dataset. This dataset comprises over 70 million SAR data points from publicly available databases (PubChem and ChEMBL) including structure, target information and activity annotations. Our aspiration is to create a useful chemogenomics resource reflecting industry-scale data not only for building predictive models of in silico polypharmacology and off-target effects but also for the validation of cheminformatics approaches in general.Electronic supplementary materialThe online version of this article (doi:10.1186/s13321-017-0203-5) contains supplementary material, which is available to authorized users.
We propose Macau, a powerful and flexible Bayesian factorization method for heterogeneous data. Our model can factorize any set of entities and relations that can be represented by a relational model, including tensors and also multiple relations for each entity. Macau can also incorporate side information, specifically entity and relation features, which are crucial for predicting sparsely observed relations. Macau scales to millions of entity instances, hundred millions of observations, and sparse entity features with millions of dimensions. To achieve the scale up, we specially designed sampling procedure for entity and relation features that relies primarily on noise injection in linear regressions. We show performance and advanced features of Macau in a set of experiments, including challenging drugprotein activity prediction task. * Adam Arany and Jaak Simm contributed both equally as first authors. arXiv:1509.04610v2 [stat.ML]
We herewith present a novel approach to predict protein-ligand binding modes from the single two-dimensional structure of the ligand. Known protein-ligand X-ray structures were converted into binary bit strings encoding protein-ligand interactions. An artificial neural network was then set up to first learn and then predict protein-ligand interaction fingerprints from simple ligand descriptors. Specific models were constructed for three targets (CDK2, p38-α, HSP90-α) and 146 ligands for which protein-ligand X-ray structures are available. These models were able to predict protein-ligand interaction fingerprints and to discriminate important features from minor interactions. Predicted interaction fingerprints were successfully used as descriptors to discriminate true ligands from decoys by virtual screening. In some but not all cases, the predicted interaction fingerprints furthermore enable to efficiently rerank cross-docking poses and prioritize the best possible docking solutions.
A mini-HTS on 4000 compounds selected using 2D fragment-based similarity and 3D pharmacophoric and shape similarity to known selective tau aggregate binders identified N-(6-methylpyridin-2-yl)quinolin-2-amine 10 as a novel potent binder to human AD aggregated tau with modest selectivity versus aggregated β-amyloid (Aβ). Initial medicinal chemistry efforts identified key elements for potency and selectivity, as well as suitable positions for radiofluorination, leading to a first generation of fluoroalkyl-substituted quinoline tau binding ligands with suboptimal physicochemical properties. Further optimization toward a more optimal pharmacokinetic profile led to the discovery of 1,5-naphthyridine 75, a potent and selective tau aggregate binder with potential as a tau PET tracer.
Artificial intelligence (AI) is undergoing a revolution thanks to the breakthroughs of machine learning algorithms in computer vision, speech recognition, natural language processing and generative modelling. Recent works on publicly available pharmaceutical data showed that AI methods are highly promising for Drug Target prediction. However, the quality of public data might be different than that of industry data due to different labs reporting measurements, different measurement techniques, fewer samples and less diverse and specialized assays. As part of a European funded project (ExCAPE), that brought together expertise from pharmaceutical industry, machine learning, and high-performance computing, we investigated how well machine learning models obtained from public data can be transferred to internal pharmaceutical industry data. Our results show that machine learning models trained on public data can indeed maintain their predictive power to a large degree when applied to industry data. Moreover, we observed that deep learning derived machine learning models outperformed comparable models, which were trained by other machine learning algorithms, when applied to internal pharmaceutical company datasets. To our knowledge, this is the first large-scale study evaluating the potential of machine learning and especially deep learning directly at the level of industry-scale settings and moreover investigating the transferability of publicly learned target prediction models towards industrial bioactivity prediction pipelines.
We describe SILIRID (Simple Ligand–Receptor Interaction Descriptor), a novel fixed size descriptor characterizing protein–ligand interactions. SILIRID can be obtained from the binary interaction fingerprints (IFPs) by summing up the bits corresponding to identical amino acids. This results in a vector of 168 integer numbers corresponding to the product of the number of entries (20 amino acids and one cofactor) and 8 interaction types per amino acid (hydrophobic, aromatic face to face, aromatic edge to face, H-bond donated by the protein, H-bond donated by the ligand, ionic bond with protein cation and protein anion, and interaction with metal ion). Efficiency of SILIRID to distinguish different protein binding sites has been examined in similarity search in sc-PDB database, a druggable portion of the Protein Data Bank, using various protein–ligand complexes as queries. The performance of retrieval of structurally and evolutionary related classes of proteins was comparable to that of state-of-the-art approaches (ROC AUC ≈ 0.91). SILIRID can efficiently be used to visualize chemogenomic space covered by sc-PDB using Generative Topographic Mapping (GTM): sc-PDB SILIRID data form clusters corresponding to different protein types.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.