Profile-QSAR (pQSAR) is a massively multi-task, 2-step machine learning method with unprecedented scope, accuracy and applicability domain. In step one, a "profile" of conventional single-assay random forest regression (RFR) models are trained on a very large number of biochemical and cellular pIC50 assays using Morgan 2 sub-structural fingerprints as compound descriptors. In step two, a panel of PLS models are built using the profile of pIC50 predictions from those RFR models as compound descriptors. Hence the name. Previously described for a panel of 728 biochemical and cellular kinase assays, we have now built an enormous pQSAR from 11,805 diverse Novartis IC50 and EC50 assays. This large number of assays, and hence of compound descriptors for PLS, dictated reducing the profile by only including RFR models whose predictions correlate with the assay being modeled. The RFR and pQSAR models were evaluated with our "realistically novel" held-out test set whose median average similarity to the nearest training set member across the 11,805 assays was only 0.34, thus testing a realistically large applicability domain. For the 11,805 single-assay RFR models, the median correlation of prediction with experiment was only R 2 ext=0.05, virtually random, and only 8% of the models achieved our standard success threshold of R 2 ext=0.30. For pQSAR, the median correlation was R 2 ext=0.53, comparable to 4-concentration experimental IC50s, and 72% of the models met our R 2 ext>0.30 standard, totaling 8558 successful models. The successful models included assays from all of the 51 annotated target sub-classes, as well as 4196 phenotypic assays, indicating that pQSAR can be applied to virtually any disease area. Every month, all models are updated to include new measurements, and predictions are made for 5.5 million Novartis compounds, totaling 50 billion predictions. Common uses have included virtual screening, selectivity design, toxicity and promiscuity prediction, mechanism-of-action prediction, and others. Furthermore, many cellular assays are purely phenotypic--not specific to any target or family.Thanks to improvements in the implementation, we were able to address this with a massive "allassay" (AA) pQSAR combining all NVS IC 50 assays. The success of the AA pQSAR demonstrated the effectiveness of transfer learning between protein families as well as phenotypic assays.However, using the much larger AA FP of RFR predictions as descriptors to train PLS models on assays with few pIC50s could lead to overfitting, so a variable selection method was required.Overzealous variable selection from such a large descriptor pool also invites chance correlations 12 which was proven to be minimal by Y-scrambling.This report describes building a massive multi-task AA pQSAR combining 11,805 diverse NVS dose-response assays crossing all protein families as well as phenotypic assays. To facilitate public comparisons to other methods, a second massive AA pQSAR was built for 4276 diverse, publically available ChEMBL 13 assays.
METHODS
Data ...