Prior to using a quantitative structure activity relationship (QSAR) model for external predictions, its predictive power should be established and validated. In the absence of a true external data set, the best way to validate the predictive ability of a model is to perform its statistical external validation. In statistical external validation, the overall data set is divided into training and test sets. Commonly, this splitting is performed using random division. Rational splitting methods can divide data sets into training and test sets in an intelligent fashion. The purpose of this study was to determine whether rational division methods lead to more predictive models compared to random division. A special data splitting procedure was used to facilitate the comparison between random and rational division methods. For each toxicity end point, the overall data set was divided into a modeling set (80% of the overall set) and an external evaluation set (20% of the overall set) using random division. The modeling set was then subdivided into a training set (80% of the modeling set) and a test set (20% of the modeling set) using rational division methods and by using random division. The Kennard-Stone, minimal test set dissimilarity, and sphere exclusion algorithms were used as the rational division methods. The hierarchical clustering, random forest, and k-nearest neighbor (kNN) methods were used to develop QSAR models based on the training sets. For kNN QSAR, multiple training and test sets were generated, and multiple QSAR models were built. The results of this study indicate that models based on rational division methods generate better statistical results for the test sets than models based on random division, but the predictive power of both types of models are comparable.
Quantitative structure -activity relationships (QSARs) are used to predict many different endpoints, utilize hundreds, and even thousands of different parameters (or descriptors), and are created using a variety of approaches. The one thing they all have in common is the assumption that the chemical structures used are correct. This research investigates this assumption by examining six public and private databases that contain structural information for chemicals. Molecular fingerprinting techniques are used to determine the error rates for structures in each of the databases. It was observed that the databases had error rates ranging from 0.1 to 3.4%. A case study to predict the n-octanol/water partition coefficient was also investigated to highlight the effects of these errors in the predictions of QSARs. In this case study, QSARs were developed using both (i) all correct structures and (ii) structures from a database with an error rate of 3.4%. This case study showed how slight errors in chemical structures, such as misplacing a Cl atom or swapping hydroxy and methoxy functional groups on a multiple ring structure, can result in significant differences in the accuracy of the prediction for those chemicals.
A quantitative structure-activity relationship (QSAR) methodology based on hierarchical clustering was developed to predict toxicological endpoints. This methodology utilizes Ward's method to divide a training set into a series of structurally similar clusters. The structural similarity is defined in terms of 2-D physicochemical descriptors (such as connectivity and E-state indices). A genetic algorithm-based technique is used to generate statistically valid QSAR models for each cluster (using the pool of descriptors described above). The toxicity for a given query compound is estimated using the weighted average of the predictions from the closest cluster from each step in the hierarchical clustering assuming that the compound is within the domain of applicability of the cluster. The hierarchical clustering methodology was tested using a Tetrahymena pyriformis acute toxicity data set containing 644 chemicals in the training set and with two prediction sets containing 339 and 110 chemicals. The results from the hierarchical clustering methodology were compared to the results from several different QSAR methodologies.
Two important disadvantages of long‐term animal bioassays are that testing involves substantial amounts of time and money, and that high doses are usually used in the testing process. These disadvantages can be circumvented using (quantitative) structure‐activity relationships ((Q)SARs). In the field of computational toxicology, (Q)SARs are predictive models that provide a quantitative measure of the relationship between the chemical structure and a measure of a given health‐related end point. Such relationships can be expressed in terms of continuous dose‐response data (e.g. carcinogenic potency) based on some type of regression analysis for quantitative end points, or a dichotomous classification (e.g. yes/no‐type answers for carcinogenicity, etc.) based on discriminant analysis or other pattern recognition techniques for qualitative end points. There are a limited number of (Q)SAR models to predict the carcinogenicity of various chemicals, a majority of which relate the carcinogenic potency to measures of carcinogenicity such as mutagenicity, lethal dose (LD 50 ) or the maximum tolerated dose (MTD). Other (Q)SAR models relate the carcinogenicity of a chemical to its structure, either in terms of its chemical fragments (groups of one or more atoms that make up the chemical structure) or in terms of its physical and chemical properties. In addition, a variety of commercial and noncommercial software that contain modules to predict the carcinogenicity of chemicals are also available.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.