Fingerprint similarity search methods are especially useful in VS if only a few unrelated ligands are known for a given target and therefore more complex and information rich methods such as pharmacophore searches or structure-based design are not applicable. In addition, fingerprint methods are used in characterizing properties of compound collections such as chemical diversity, density in chemical space, and content of biologically active molecules (biodiversity). Such assessments are important for deciding what compounds to experimentally screen, to purchase, or to assemble in a virtual compound deck for in silico screening or de novo design.
Profile-QSAR (pQSAR) is a massively multi-task, 2-step machine learning method with unprecedented scope, accuracy and applicability domain. In step one, a "profile" of conventional single-assay random forest regression (RFR) models are trained on a very large number of biochemical and cellular pIC50 assays using Morgan 2 sub-structural fingerprints as compound descriptors. In step two, a panel of PLS models are built using the profile of pIC50 predictions from those RFR models as compound descriptors. Hence the name. Previously described for a panel of 728 biochemical and cellular kinase assays, we have now built an enormous pQSAR from 11,805 diverse Novartis IC50 and EC50 assays. This large number of assays, and hence of compound descriptors for PLS, dictated reducing the profile by only including RFR models whose predictions correlate with the assay being modeled. The RFR and pQSAR models were evaluated with our "realistically novel" held-out test set whose median average similarity to the nearest training set member across the 11,805 assays was only 0.34, thus testing a realistically large applicability domain. For the 11,805 single-assay RFR models, the median correlation of prediction with experiment was only R 2 ext=0.05, virtually random, and only 8% of the models achieved our standard success threshold of R 2 ext=0.30. For pQSAR, the median correlation was R 2 ext=0.53, comparable to 4-concentration experimental IC50s, and 72% of the models met our R 2 ext>0.30 standard, totaling 8558 successful models. The successful models included assays from all of the 51 annotated target sub-classes, as well as 4196 phenotypic assays, indicating that pQSAR can be applied to virtually any disease area. Every month, all models are updated to include new measurements, and predictions are made for 5.5 million Novartis compounds, totaling 50 billion predictions. Common uses have included virtual screening, selectivity design, toxicity and promiscuity prediction, mechanism-of-action prediction, and others.
Profile-QSAR is a novel 2D predictive model building method for kinases. This "meta-QSAR" method models the activity of each compound against a new kinase target as a linear combination of its predicted activities against a large panel of 92 previously studied kinases comprised from 115 assays. Profile-QSAR starts with a sparse incomplete kinase by compound (KxC) activity matrix, used to generate Bayesian QSAR models for the 92 "basis-set" kinases. These Bayesian QSARs generate a complete "synthetic" KxC activity matrix of predictions. These synthetic activities are used as "chemical descriptors" to train partial-least squares (PLS) models, from modest amounts of medium-throughput screening data, for predicting activity against new kinases. The Profile-QSAR predictions for the 92 kinases (115 assays) gave a median external R²(ext) = 0.59 on 25% held-out test sets. The method has proven accurate enough to predict pairwise kinase selectivities with a median correlation of R²(ext) = 0.61 for 958 kinase pairs with at least 600 common compounds. It has been further expanded by adding a "C(k)XC" cellular activity matrix to the KxC matrix to predict cellular activity for 42 kinase driven cellular assays with median R²(ext) = 0.58 for 24 target modulation assays and R²(ext) = 0.41 for 18 cell proliferation assays. The 2D Profile-QSAR, along with the 3D Surrogate AutoShim, are the foundations of an internally developed iterative medium-throughput screening (IMTS) methodology for virtual screening (VS) of compound archives as an alternative to experimental high-throughput screening (HTS). The method has been applied to 20 actual prospective kinase projects. Biological results have so far been obtained in eight of them. Q² values ranged from 0.3 to 0.7. Hit-rates at 10 uM for experimentally tested compounds varied from 25% to 80%, except in K5, which was a special case aimed specifically at finding "type II" binders, where none of the compounds were predicted to be active at 10 μM. These overall results are particularly striking as chemical novelty was an important criterion in selecting compounds for testing. The method is completely automated. Predicted activities for nearly 4 million internal and commercial compounds across 115 kinase assays and 42 cellular assays are stored in the corporate database. Like computed physical properties, this predicted kinase activity profile can be computed and stored as each compound is registered.
Malaria, in particular that caused by Plasmodium falciparum , is prevalent across the tropics, and its medicinal control is limited by widespread drug resistance. Cysteine proteases of P. falciparum , falcipain-2 (FP-2) and falcipain-3 (FP-3), are major hemoglobinases, validated as potential antimalarial drug targets. Structure-based virtual screening of a focused cysteine protease inhibitor library built with soft rather than hard electrophiles was performed against an X-ray crystal structure of FP-2 using the Glide docking program. An enrichment study was performed to select a suitable scoring function and to retrieve potential candidates against FP-2 from a large chemical database. Biological evaluation of 50 selected compounds identified 21 diverse nonpeptidic inhibitors of FP-2 with a hit rate of 42%. Atomic Fukui indices were used to predict the most electrophilic center and its electrophilicity in the identified hits. Comparison of predicted electrophilicity of electrophiles in identified hits with those in known irreversible inhibitors suggested the soft-nature of electrophiles in the selected target compounds. The present study highlights the importance of focused libraries and enrichment studies in structure-based virtual screening. In addition, few compounds were screened against homologous human cysteine proteases for selectivity analysis. Further evaluation of structure-activity relationships around these nonpeptidic scaffolds could help in the development of selective leads for antimalarial chemotherapy.
Severe acute respiratory syndrome is a highly infectious upper respiratory tract disease caused by SARS-CoV, a previously unidentified human coronavirus. SARS-3CL(pro) is a viral cysteine protease critical to the pathogen's life cycle and hence a therapeutic target of importance. The recently elucidated crystal structures of this enzyme provide an opportunity for the discovery of inhibitors through rational drug design. In the current study, Gold docking program was utilized to conduct extensive docking studies against the target crystal structure to develop a robust and predictive docking protocol. The validated docking protocol was used to conduct a structure-based virtual screening of the Asinex Platinum collection. Biological evaluation of a screened selection of compounds was carried out to identify novel inhibitors of the viral protease.
Reliable in silico prediction methods promise many advantages over experimental high-throughput screening (HTS): vastly lower time and cost, affinity magnitude estimates, no requirement for a physical sample, and a knowledge-driven exploration of chemical space. For the specific case of kinases, given several hundred experimental IC(50) training measurements, the empirically parametrized profile-quantitative structure-activity relationship (profile-QSAR) and surrogate AutoShim methods developed at Novartis can predict IC(50) with a reliability approaching experimental HTS. However, in the absence of training data, prediction is much harder. The most common a priori prediction method is docking, which suffers from many limitations: It requires a protein structure, is slow, and cannot predict affinity. (1) Highly accurate profile-QSAR (2) models have now been built for roughly 100 kinases covering most of the kinome. Analyzing correlations among neighboring kinases shows that near neighbors share a high degree of SAR similarity. The novel chemogenomic kinase-kernel method reported here predicts activity for new kinases as a weighted average of predicted activities from profile-QSAR models for nearby neighbor kinases. Three different factors for weighting the neighbors were evaluated: binding site sequence identity to the kinase neighbors, similarity of the training set for each neighbor model to the compound being predicted, and accuracy of each neighbor model. Binding site sequence identity was by far most important, followed by chemical similarity. Model quality had almost no relevance. The median R(2) = 0.55 for kinase-kernel interpolations on 25% of the data of each set held out from method optimization for 51 kinase assays, approached the accuracy of median R(2) = 0.61 for the trained profile-QSAR predictions on the same held out 25% data of each set, far faster and far more accurate than docking. Validation on the full data sets from 18 additional kinase assays not part of method optimization studies also showed strong performance with median R(2) = 0.48. Genetic algorithm optimization of the binding site residues used to compute binding site sequence identity identified 16 privileged residues from a larger set of 46. These 16 are consistent with the kinase selectivity literature and structural biology, further supporting the scientific validity of the approach. A priori kinase-kernel predictions for 4 million compounds were interpolated from 51 existing profile-QSAR models for the remaining >400 novel kinases, totaling 2 billion activity predictions covering the entire kinome. The method has been successfully applied in two therapeutic projects to generate predictions and select compounds for activity testing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.