The estimation of accuracy and applicability of QSAR and QSPR models for biological and physicochemical properties represents a critical problem. The developed parameter of "distance to model" (DM) is defined as a metric of similarity between the training and test set compounds that have been subjected to QSAR/QSPR modeling. In our previous work, we demonstrated the utility and optimal performance of DM metrics that have been based on the standard deviation within an ensemble of QSAR models. The current study applies such analysis to 30 QSAR models for the Ames mutagenicity data set that were previously reported within the 2009 QSAR challenge. We demonstrate that the DMs based on an ensemble (consensus) model provide systematically better performance than other DMs. The presented approach identifies 30-60% of compounds having an accuracy of prediction similar to the interlaboratory accuracy of the Ames test, which is estimated to be 90%. Thus, the in silico predictions can be used to halve the cost of experimental measurements by providing a similar prediction accuracy. The developed model has been made publicly available at http://ochem.eu/models/1 .
This article is about the hierarchical quantitative structure-activity relationship technology (HiT QSAR) based on the Simplex representation of molecular structure (SiRMS) and its application for different QSAR/QSP(property)R tasks. The essence of this technology is a sequential solution (with the use of the information obtained on the previous steps) to the QSAR problem by the series of enhanced models of molecular structure description [from one dimensional (1D) to four dimensional (4D)]. It is a system of permanently improved solutions. In the SiRMS approach, every molecule is represented as a system of different simplexes (tetratomic fragments with fixed composition, structure, chirality and symmetry). The level of simplex descriptors detailing increases consecutively from the 1D to 4D representation of the molecular structure. The advantages of the approach reported here are the absence of "molecular alignment" problems, consideration of different physical-chemical properties of atoms (e.g. charge, lipophilicity, etc.), the high adequacy and good interpretability of obtained models and clear ways for molecular design. The efficiency of the HiT QSAR approach is demonstrated by comparing it with the most popular modern QSAR approaches on two representative examination sets. The examples of successful application of the HiT QSAR for various QSAR/QSPR investigations on the different levels (1D-4D) of the molecular structure description are also highlighted. The reliability of developed QSAR models as predictive virtual screening tools and their ability to serve as the base of directed drug design was validated by subsequent synthetic and biological experiments, among others. The HiT QSAR is realized as a complex of computer programs known as HIT QSAR: software that also includes a powerful statistical block and a number of useful utilities.
This review is devoted to the critical analysis of advantages and disadvantages of existing mixture descriptors and their usage in various QSAR/QSPR tasks. We describe good practices for the QSAR modeling of mixtures, data sources for mixtures, a discussion of various mixture descriptors and their application, recommendations about proper external validation specific for mixture QSAR modeling, and future perspectives of this field. The biggest problem in QSAR of mixtures is the lack of reliable data about the mixtures' properties. Various mixture descriptors are used for the modeling of different endpoints. However, these descriptors have certain disadvantages, such as applicability only to 1 : 1 binary mixtures, and additive nature. The field of QSAR of mixtures is still under development, and existing efforts could be considered as a foundation for future approaches and studies. The usage of non-additive mixture descriptors, which are sensitive to interaction effects, in combination with best practices of QSAR model development (e.g., thorough data collection and curation, rigorous external validation, etc.) will significantly improve the quality of QSAR studies of mixtures.
This work is devoted to the application of the random forest approach to QSAR analysis of aquatic toxicity of chemical compounds tested on Tetrahymena pyriformis. The simplex representation of the molecular structure approach implemented in HiT QSAR Software was used for descriptors generation on a two-dimensional level. Adequate models based on simplex descriptors and the RF statistical approach were obtained on a modeling set of 644 compounds. Model predictivity was validated on two external test sets of 339 and 110 compounds. The high impact of lipophilicity and polarizability of investigated compounds on toxicity was determined. It was shown that RF models were tolerant for insertion of irrelevant descriptors as well as for randomization of some part of toxicity values that were representing a "noise". The fast procedure of optimization of the number of trees in the random forest has been proposed. The discussed RF model had comparable or better statistical characteristics than the corresponding PLS or KNN models.
In this paper we offer a novel approach for the structural interpretation of QSAR models. The major advantage of our developed methodology is its universality, i.e., it can be applied to any QSAR/QSPR model irrespective of chemical descriptors and machine learning methods applied. This universality was achieved by using only the information obtained from substructures of the compounds of interest to interpret model outcomes. Reliability of the offered approach was confirmed by the results of three case studies, including end-points of different types (continuous and binary classification) and nature (solubility, mutagenicity, and inhibition of Transglutaminase 2), various fragment and whole-molecule descriptors (Simplex and Dragon), and multiple modeling techniques (partial least squares, random forest, and support vector machines). We compared the global contributions of molecular fragments obtained using our methodology with known SAR rules derived experimentally. In all cases high concordance between our interpretation and results published by others was observed. Although the proposed interpretation approach could be easily extended to any type of descriptors, we would recommend using Simplex descriptors to achieve a larger variety of investigated molecular fragments. The developed approach is a good tool for interpretation of such "black box" models like random forest, neural networks, etc. Analysis of fragment global contributions and their deviation across a dataset could be useful for the identification of key fragments and structural alerts. This information could be helpful to maximize the positive influence of structural surroundings on the given fragment and to decrease the negative effects.
In this work, a hierarchic system of QSAR models from 1D to 4D is considered on the basis of the simplex representation of molecular structure (SiRMS). The essence of this system is that the QSAR problem is solved sequentially in a series of the improved models of the description of molecular structure. Thus, at each subsequent stage of a hierarchic system, the QSAR problem is not solved ab ovo, but rather the information obtained from the previous step is used. Actually, we deal with a system of solutions defined more exactly. In the SiRMS approach, a molecule is represented as a system of different simplex descriptors (tetratomic fragments with fixed composition, structure, chirality and symmetry). The level of simplex-descriptor detail increases consecutively from 1D to 4D representations of molecular structure. It enables us to determine the fragments of structure that promote or interfere with the given biological activity easily. Molecular design of compounds with a given level of activity is possible on the basis of SiRMS. The efficiency of the method is demonstrated for the example of the analysis of substituted piperazines affinity for the 5-HT1A receptor.
A new algorithm for the interpretation of Random Forest models has been developed. It allows to calculate the contribution of each descriptor to the calculated property value. In case of the simplex representation of a molecular structure, contributions of individual atoms can be calculated, and thus it becomes possible to estimate the influence of separate molecular fragments on the investigated property. Such information can be used for the design of new compounds with a predefined property value. The proposed measure of descriptor contributions is not an alternative to the importance of Breiman's variable, but it characterizes the contribution of a particular explanatory variable to the calculated response value.
Influence of the molecular structure of macrocyclic pyridinophanes, their analogues and some other compounds on anticancer activity (Leukemia, central nervous system (CNS) cancer, prostate cancer, breast cancer, melanoma, non-small cell lung cancer, colon cancer, ovarian cancer, renal cancer) was investigated by means of a new 4D-QSAR approach based on the simplex representation of molecular structures (SiRMS). For all the investigated molecules, the 3D structural models were first created and the set of conformers (fourth dimension) was used. Each conformer was represented as a system of different simplexes (tetratomic fragments of fixed structure, chirality and symmetry). Statistic characteristics of the QSAR partial least squares (PLS) models were satisfactory (correlation coefficient r=0.990-0.861; cross-validation coefficient CVR=0.914-0.633). The molecular fragments increasing and decreasing anticancer activity were defined. This information may be useful for the design and direct synthesis of novel anticancer agents.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.