The goal of this paper is to estimate the number of realistic drug-like molecules which could ever be synthesized. Unlike previous studies based on exhaustive enumeration of molecular graphs or on combinatorial enumeration preselected fragments, we used results of constrained graphs enumeration by Reymond to establish a correlation between the number of generated structures (M) and the number of heavy atoms (N): logM = 0.584 × N × logN + 0.356. The number of atoms limiting drug-like chemical space of molecules which follow Lipinsky's rules (N = 36) has been obtained from the analysis of the PubChem database. This results in M ≈ 10³³ which is in between the numbers estimated by Ertl (10²³) and by Bohacek (10⁶⁰).
The estimation of accuracy and applicability of QSAR and QSPR models for biological and physicochemical properties represents a critical problem. The developed parameter of "distance to model" (DM) is defined as a metric of similarity between the training and test set compounds that have been subjected to QSAR/QSPR modeling. In our previous work, we demonstrated the utility and optimal performance of DM metrics that have been based on the standard deviation within an ensemble of QSAR models. The current study applies such analysis to 30 QSAR models for the Ames mutagenicity data set that were previously reported within the 2009 QSAR challenge. We demonstrate that the DMs based on an ensemble (consensus) model provide systematically better performance than other DMs. The presented approach identifies 30-60% of compounds having an accuracy of prediction similar to the interlaboratory accuracy of the Ames test, which is estimated to be 90%. Thus, the in silico predictions can be used to halve the cost of experimental measurements by providing a similar prediction accuracy. The developed model has been made publicly available at http://ochem.eu/models/1 .
This paper is an overview of the most significant and impactful interpretation approaches of quantitative structure-activity relationship (QSAR) models, their development, and application. The evolution of the interpretation paradigm from "model → descriptors → (structure)" to "model → structure" is indicated. The latter makes all models interpretable regardless of machine learning methods or descriptors used for modeling. This opens wide prospects for application of corresponding interpretation approaches to retrieve structure-property relationships captured by any models. Issues of separate approaches are discussed as well as general issues and prospects of QSAR model interpretation.
This review is devoted to the critical analysis of advantages and disadvantages of existing mixture descriptors and their usage in various QSAR/QSPR tasks. We describe good practices for the QSAR modeling of mixtures, data sources for mixtures, a discussion of various mixture descriptors and their application, recommendations about proper external validation specific for mixture QSAR modeling, and future perspectives of this field. The biggest problem in QSAR of mixtures is the lack of reliable data about the mixtures' properties. Various mixture descriptors are used for the modeling of different endpoints. However, these descriptors have certain disadvantages, such as applicability only to 1 : 1 binary mixtures, and additive nature. The field of QSAR of mixtures is still under development, and existing efforts could be considered as a foundation for future approaches and studies. The usage of non-additive mixture descriptors, which are sensitive to interaction effects, in combination with best practices of QSAR model development (e.g., thorough data collection and curation, rigorous external validation, etc.) will significantly improve the quality of QSAR studies of mixtures.
This work is devoted to the application of the random forest approach to QSAR analysis of aquatic toxicity of chemical compounds tested on Tetrahymena pyriformis. The simplex representation of the molecular structure approach implemented in HiT QSAR Software was used for descriptors generation on a two-dimensional level. Adequate models based on simplex descriptors and the RF statistical approach were obtained on a modeling set of 644 compounds. Model predictivity was validated on two external test sets of 339 and 110 compounds. The high impact of lipophilicity and polarizability of investigated compounds on toxicity was determined. It was shown that RF models were tolerant for insertion of irrelevant descriptors as well as for randomization of some part of toxicity values that were representing a "noise". The fast procedure of optimization of the number of trees in the random forest has been proposed. The discussed RF model had comparable or better statistical characteristics than the corresponding PLS or KNN models.
In this paper we offer a novel approach for the structural interpretation of QSAR models. The major advantage of our developed methodology is its universality, i.e., it can be applied to any QSAR/QSPR model irrespective of chemical descriptors and machine learning methods applied. This universality was achieved by using only the information obtained from substructures of the compounds of interest to interpret model outcomes. Reliability of the offered approach was confirmed by the results of three case studies, including end-points of different types (continuous and binary classification) and nature (solubility, mutagenicity, and inhibition of Transglutaminase 2), various fragment and whole-molecule descriptors (Simplex and Dragon), and multiple modeling techniques (partial least squares, random forest, and support vector machines). We compared the global contributions of molecular fragments obtained using our methodology with known SAR rules derived experimentally. In all cases high concordance between our interpretation and results published by others was observed. Although the proposed interpretation approach could be easily extended to any type of descriptors, we would recommend using Simplex descriptors to achieve a larger variety of investigated molecular fragments. The developed approach is a good tool for interpretation of such "black box" models like random forest, neural networks, etc. Analysis of fragment global contributions and their deviation across a dataset could be useful for the identification of key fragments and structural alerts. This information could be helpful to maximize the positive influence of structural surroundings on the given fragment and to decrease the negative effects.
Structure generators are widely used in de novo design studies and their performance substantially influences an outcome. Approaches based on the deep learning models and conventional atom-based approaches may result in invalid structures and fail to address their synthetic feasibility issues. On the other hand, conventional reaction-based approaches result in synthetically feasible compounds but novelty and diversity of generated compounds may be limited. Fragment-based approaches can provide both better novelty and diversity of generated compounds but the issue of synthetic complexity of generated structure was not explicitly addressed before. Here we developed a new framework of fragment-based structure generation that, by design, results in the chemically valid structures and provides flexible control over diversity, novelty, synthetic complexity and chemotypes of generated compounds. The framework was implemented as an open-source Python module and can be used to create custom workflows for the exploration of chemical space.
A new algorithm for the interpretation of Random Forest models has been developed. It allows to calculate the contribution of each descriptor to the calculated property value. In case of the simplex representation of a molecular structure, contributions of individual atoms can be calculated, and thus it becomes possible to estimate the influence of separate molecular fragments on the investigated property. Such information can be used for the design of new compounds with a predefined property value. The proposed measure of descriptor contributions is not an alternative to the importance of Breiman's variable, but it characterizes the contribution of a particular explanatory variable to the calculated response value.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.