Machine learning algorithms have attained widespread use in assessing the potential toxicities of pharmaceuticals and industrial chemicals because of their faster-speed and lowercost compared to experimental bioassays. Gradient boosting is an effective algorithm that often achieves high predictivity, but historically the relative long computational time limited its applications in predicting large compound libraries or developing in silico predictive models that require frequent retraining. LightGBM, a recent improvement of the gradient boosting algorithm inherited its high predictivity but resolved its scalability and long computational time by adopting leaf-wise tree growth strategy and introducing novel techniques. In this study, we compared the predictive performance and the computational time of LightGBM to deep neural networks, random forests, support vector machines, and XGBoost. All algorithms were rigorously evaluated on publicly available Tox21 and mutagenicity datasets using a Bayesian optimization integrated nested 10-fold cross-validation scheme that performs hyperparameter optimization while examining model generalizability and transferability to new data. The evaluation results demonstrated that LightGBM is an effective and highly scalable algorithm offering the best predictive performance while consuming significantly shorter computational time than the other investigated algorithms across all Tox21 and mutagenicity datasets. We recommend LightGBM for applications in in silico safety assessment and also in other areas of cheminformatics to fulfill the ever-growing demand for accurate and rapid prediction of various toxicity or activity related endpoints of large compound libraries present in the pharmaceutical and chemical industry.
Virtual screening is widely applied in drug discovery, and significant effort has been put into improving current methods. In this study, we have evaluated the performance of compound ranking in virtual screening using five different data fusion algorithms on a total of 16 data sets. The data were generated by docking, pharmacophore search, shape similarity, and electrostatic similarity, spanning both structure- and ligand-based methods. The algorithms used for data fusion were sum rank, rank vote, sum score, Pareto ranking, and parallel selection. None of the fusion methods require any prior knowledge or input other than the results from the single methods and, thus, are readily applicable. The results show that compound ranking using data fusion improves the performance and consistency of virtual screening compared to the single methods alone. The best performing data fusion algorithm was parallel selection, but both rank voting and Pareto ranking also have good performance.
Multiple sclerosis (MS) is a T-cell-mediated disease of the central nervous system, characterized by damage to myelin and axons, resulting in progressive neurological disability. Genes may influence susceptibility to MS, but results of association studies are inconsistent, aside from the identification of HLA class II haplotypes. Whole-genome linkage screens in MS have both confirmed the importance of the HLA region and uncovered non-HLA loci that may harbor susceptibility genes. In this twostage analysis, we determined genotypes, in up to 672 MS patients and 672 controls, for 123 single-nucleotide polymorphisms (SNPs) in 66 genes. Genes were chosen based on their chromosomal positions or biological functions. In stage one, 22 genes contained at least one SNP for which the carriage rate for one allele differed significantly (Po0.08) between patients and controls. After additional genotyping in stage two, two genes-each containing at least three significantly (Po0.05) associated SNPs-conferred susceptibility to MS: LAG3 on chromosome 12p13, and IL7R on 5p13. LAG3 inhibits activated T cells, while IL7R is necessary for the maturation of T and B cells. These results imply that germline allelic variation in genes involved in immune homeostasis-and, by extension, derangement of immune homeostasis-influence the risk of MS.
Carboxylesterase Notum is a negative regulator of the Wnt signaling pathway. There is an emerging understanding of the role Notum plays in disease supporting the need to discover new small molecule inhibitors. A crystallographic x-ray fragment screen was performed, which identified fragment hit 1,2,3triazole 7 as an attractive starting point for a structure-based drug design hit-to-lead program. Optimization of 7 identified oxadiazol-2-one 23dd as a preferred example with properties consistent with drug-like chemical space. Screening 23dd in a cell-based TCF/LEF reporter gene assay restored activation of Wnt signaling in the presence of Notum. Mouse pharmacokinetic studies with oral administration of 23dd demonstrated good plasma exposure and partial blood-brain barrier penetration. Significant progress was made in developing fragment hit 7 into lead 23dd (>600-fold increase in activity) making it suitable as a new chemical tool for exploring the role of Notum mediated regulation of Wnt signaling.
Making predictions with an associated confidence is highly desirable as it facilitates decision making and resource prioritization. Conformal regression is a machine learning framework that allows the user to define the required confidence and delivers predictions that are guaranteed to be correct to the selected extent. In this study, we apply conformal regression to model molecular properties and bioactivity values and investigate different ways to scale the resultant prediction intervals to create as efficient (i.e., narrow) regressors as possible. Different algorithms to estimate the prediction uncertainty were used to normalize the prediction ranges, and the different approaches were evaluated on 29 publicly available data sets. Our results show that the most efficient conformal regressors are obtained when using the natural exponential of the ensemble standard deviation from the underlying random forest to scale the prediction intervals, but other approaches were almost as efficient. This approach afforded an average prediction range of 1.65 pIC50 units at the 80% confidence level when applied to bioactivity modeling. The choice of nonconformity function has a pronounced impact on the average prediction range with a difference of close to one log unit in bioactivity between the tightest and widest prediction range. Overall, conformal regression is a robust approach to generate bioactivity predictions with associated confidence.
The assessment of compound cytotoxicity is an important part of the drug discovery process. Accurate predictions of cytotoxicity have the potential to expedite decision making and save considerable time and effort. In this work we apply class conditional conformal prediction to model the cytotoxicity of compounds based on 16 high throughput cytotoxicity assays from PubChem. The data span 16 cell lines and comprise more than 440 000 unique compounds. The data sets are heavily imbalanced with only 0.8% of the tested compounds being cytotoxic. We trained one classification model for each cell line and validated the performance with respect to validity and accuracy. The generated models deliver high quality predictions for both toxic and non-toxic compounds despite the imbalance between the two classes. On external data collected from the same assay provider as one of the investigated cell lines the model had a sensitivity of 74% and a specificity of 65% at the 80% confidence level among the compounds assigned to a single class. Compared to previous approaches for large scale cytotoxicity modelling, this represents a balanced performance in the prediction of the toxic and non-toxic classes. The conformal prediction framework also allows the modeller to control the error frequency of the predictions, allowing predictions of cytotoxicity outcomes with confidence.
An increasing number of new drugs have their origin in small biotech or academia. In contrast to big pharma, these environments are often more limited in terms of resources and this necessitates different approaches to the drug discovery process. In this perspective, we outline how computational methods can help advance drug discovery in a setting with more limited resources and we share what, based on our experience, are the best practices for these methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.