The estimation of the accuracy of predictions is a critical problem in QSAR modeling. The "distance to model" can be defined as a metric that defines the similarity between the training set molecules and the test set compound for the given property in the context of a specific model. It could be expressed in many different ways, e.g., using Tanimoto coefficient, leverage, correlation in space of models, etc. In this paper we have used mixtures of Gaussian distributions as well as statistical tests to evaluate six types of distances to models with respect to their ability to discriminate compounds with small and large prediction errors. The analysis was performed for twelve QSAR models of aqueous toxicity against T. pyriformis obtained with different machine-learning methods and various types of descriptors. The distances to model based on standard deviation of predicted toxicity calculated from the ensemble of models afforded the best results. This distance also successfully discriminated molecules with low and large prediction errors for a mechanism-based model developed using log P and the Maximum Acceptor Superdelocalizability descriptors. Thus, the distance to model metric could also be used to augment mechanistic QSAR models by estimating their prediction errors. Moreover, the accuracy of prediction is mainly determined by the training set data distribution in the chemistry and activity spaces but not by QSAR approaches used to develop the models. We have shown that incorrect validation of a model may result in the wrong estimation of its performance and suggested how this problem could be circumvented. The toxicity of 3182 and 48774 molecules from the EPA High Production Volume (HPV) Challenge Program and EINECS (European chemical Substances Information System), respectively, was predicted, and the accuracy of prediction was estimated. The developed models are available online at http://www.qspr.org site.
Selecting most rigorous quantitative structure-activity relationship (QSAR) approaches is of great importance in the development of robust and predictive models of chemical toxicity. To address this issue in a systematic way, we have formed an international virtual collaboratory consisting of six independent groups with shared interests in computational chemical toxicology. We have compiled an aqueous toxicity data set containing 983 unique compounds tested in the same laboratory over a decade against Tetrahymena pyriformis. A modeling set including 644 compounds was selected randomly from the original set and distributed to all groups that used their own QSAR tools for model development. The remaining 339 compounds in the original set (external set I) as well as 110 additional compounds (external set II) published recently by the same laboratory (after this computational study was already in progress) were used as two independent validation sets to assess the external predictive power of individual models. In total, our virtual collaboratory has developed 15 different types of QSAR models of aquatic toxicity for the training set. The internal prediction accuracy for the modeling set ranged from 0.76 to 0.93 as measured by the leave-one-out cross-validation correlation coefficient ( Q abs2). The prediction accuracy for the external validation sets I and II ranged from 0.71 to 0.85 (linear regression coefficient R absI2) and from 0.38 to 0.83 (linear regression coefficient R absII2), respectively. The use of an applicability domain threshold implemented in most models generally improved the external prediction accuracy but at the same time led to a decrease in chemical space coverage. Finally, several consensus models were developed by averaging the predicted aquatic toxicity for every compound using all 15 models, with or without taking into account their respective applicability domains. We find that consensus models afford higher prediction accuracy for the external validation data sets with the highest space coverage as compared to individual constituent models. Our studies prove the power of a collaborative and consensual approach to QSAR model development. The best validated models of aquatic toxicity developed by our collaboratory (both individual and consensus) can be used as reliable computational predictors of aquatic toxicity and are available from any of the participating laboratories.
The estimation of accuracy and applicability of QSAR and QSPR models for biological and physicochemical properties represents a critical problem. The developed parameter of "distance to model" (DM) is defined as a metric of similarity between the training and test set compounds that have been subjected to QSAR/QSPR modeling. In our previous work, we demonstrated the utility and optimal performance of DM metrics that have been based on the standard deviation within an ensemble of QSAR models. The current study applies such analysis to 30 QSAR models for the Ames mutagenicity data set that were previously reported within the 2009 QSAR challenge. We demonstrate that the DMs based on an ensemble (consensus) model provide systematically better performance than other DMs. The presented approach identifies 30-60% of compounds having an accuracy of prediction similar to the interlaboratory accuracy of the Ames test, which is estimated to be 90%. Thus, the in silico predictions can be used to halve the cost of experimental measurements by providing a similar prediction accuracy. The developed model has been made publicly available at http://ochem.eu/models/1 .
Two new areas of application are identified in this review--environmental communication and expert elicitation. Conjoint analysis can thus be developed into a useful instrument for environmental risk analysis and communication, both of which are necessary for an efficient approach to risk governance.
Heterocystous filamentous cyanobacteria are regarded as the main N2 -fixing organisms (diazotrophs) in the Baltic Sea. However, some studies indicate that picoplankton may also be important. The aim of this study was to examine the composition of putative diazotrophs in the picoplankton (< 3 µm) and to identify links to environmental factors. Nitrogenase (nifH) genes were amplified from community DNA by nested PCR, followed by cloning and sequencing. Clone libraries from nine environmental samples collected from the central Baltic Sea (April-October 2003, 3 m depth) and a negative control yielded a total of 433 sequences with an average clone library coverage of 92%. The sequences fell within nifH Clusters I, II and III and formed 15 distinct groups (> 96% amino acid similarity). Most of the sequences (77%) fell into nifH Cluster I (cyanobacteria and α-, β- and γ-Proteobacteria). However, only 26 sequences were related to cyanobacteria (e.g. Pseudanabaena) and among these no unicellular phylotypes were found. Sequences clustering with alternative nitrogenases (anfH) and Archaea were found in one sample while sequences related to anaerobic phylotypes were found in six samples distributed throughout the season. The identified phylogenetic groups showed covariance with several environmental factors but no strong links could be established. This suggests a variable and complex regulation of diazotrophic groups within Baltic Sea picoplankton.
The interest in modeling and application of structure-activity relationships has steadily increased in recent decades. It is generally acknowledged that these empirical relationships are valid only within the same domain for which they were developed. However, model validation is sometimes neglected, and the application domain is not always well-defined. The purpose of this paper is to outline how validation and domain definition can facilitate the modeling and prediction of baseline toxicity for a large database. A large number of theoretical descriptors (867) were generated from two-dimensional molecular structures for compounds present in the U.S. EPA's Fathead Minnow Database (611) and the Syracuse Research Corporation's PhysProp Database (25,000+). A quantitative structure-activity relationship model was developed for baseline toxicity (narcosis) toward the fathead minnow (Pimephales promelas) using a projection-based regression technique, PLSR (partial least squares regression). The PLSR model was subsequently validated with an external test set. The main factors of variation were related to size/shape and polar interactions. The prediction error was comparable to, or slightly better than, the ECOSAR procedures. A set of 16,805 compounds, drawn from the PhysProp Database, was projected onto the PLSR model. More than 90% (15,597) of the compounds fall within the valid model domain, defined by the residual standard deviation and the leverage. The predicted baseline toxicity indicates an acute hazard for two-thirds of these compounds, classes I-III in the OECD Globally Harmonized Classification System (LC50 < or = 100 mg L(-1)). Finally, the mode of action assigned in the U.S. EPA Fathead Minnow Database was investigated. Reclassification to narcosis as the mode of action is suggested for 92 compounds, mostly from the groups "unsure" and "mixed". The present classification into specific modes of action seems to be further strengthened by the findings in this investigation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.