Retrieving descriptor information (x information) from a value of an objective variable (y) is a fundamental problem in inverse quantitative structure-property relationship (inverse-QSPR) analysis but challenging because of the complexity of the preimage function. Herewith, we propose using a cluster-wise multiple linear regression (cMLR) model as a QSPR model for inverse-QSPR analysis. x information is acquired as a probability density function by combining cMLR and the prior distribution modeled with a mixture of Gaussians (GMMs). Three case studies were conducted to demonstrate various aspects of the potential of cMLR. It was found that the predictive power of cMLR was superior to that of MLR, especially for data with nonlinearity. Moreover, it turned out that the applicability domain could be considered since the posterior distribution inherits the prior distribution's feature (i.e., training data feature) and represents the possibility of having the desired property. Finally, a series of inverse analyses with the GMMs/cMLR was demonstrated with the aim to generate de novo structures having specific aqueous solubility.
Screening of compound libraries against panels of targets yields profiling matrices. Such matrices typically contain structurally diverse screening compounds, large numbers of inactives, and small numbers of hits per assay. As such, they represent interesting and challenging test cases for computational screening and activity predictions. In this work, modeling of large compound profiling matrices was attempted that were extracted from publicly available screening data. Different machine learning methods including deep learning were compared and different prediction strategies explored. Prediction accuracy varied for assays with different numbers of active compounds, and alternative machine learning approaches often produced comparable results. Deep learning did not further increase the prediction accuracy of standard methods such as random forests or support vector machines. Target-based random forest models were prioritized and yielded successful predictions of active compounds for many assays.
Chemical structure generation based on quantitative structure property relationship (QSPR) or quantitative structure activity relationship (QSAR) models is one of the central themes in the field of computer-aided molecular design. The objective of structure generation is to find promising molecules, which according to statistical models, are considered to have desired properties. In this paper, a new method is proposed for the exhaustive generation of chemical structures based on inverse-QSPR/QSAR. In this method, QSPR/QSAR models are constructed by multiple linear regression method, and then the conditional distribution of explanatory variables given the desired properties is estimated by inverse analysis of the models using the framework of a linear Gaussian model. Finally, chemical structures are exhaustively generated by a sophisticated algorithm that is based on a canonical construction path method. The usefulness of the proposed method is demonstrated using a dataset of the boiling points of acyclic hydrocarbons containing up to 12 carbon atoms. The QSPR model was constructed with 600 hydrocarbons and their boiling points. Using the proposed method, chemical structures which had boiling points of 100, 150, or 200 °C were exhaustively generated.
Primary goal of ligand-based virtual screening is to identify active compounds consisting of a core scaffold that is not found in the current active compound pool. Scaffold-hopping is the term used for this purpose. In the present study, topological representations of pharmacophore features on chemical graphs were investigated for scaffold-hopping. Pharmacophore Graphs (PhGs), which consist of pharmacophore features as nodes and their topological distances as edges, were used as a representation of important information of compounds being active. We investigated ranking methods for prioritizing PhGs for scaffold hopping. The proposed method: NScaffold, which ranks PhGs based on the number of scaffolds covered by the PhGs, outperforms other conventional methods. As a demonstrative case, using a thrombin inhibitor data set, we interpreted the highest ranked PhGs by NScaffold from the protein-ligand interaction point of view. It resulted that the NScaffold method successfully retrieved three known important interactions, showing potential for identifying scaffold hopped compounds with interpretable PhGs.
Chemical space view of an analog series. Shown are inactive (red) and active (blue) analogs together with virtual candidate compounds (green) available to expand the series. Chemical neighborhoods of analogs are depicted in gray.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.