An unsupervised learning method is proposed for variable selection and its performance assessed using three typical QSAR data sets. The aims of this procedure are to generate a subset of descriptors from any given data set in which the resultant variables are relevant, redundancy is eliminated, and multicollinearity is reduced. Continuum regression, an algorithm encompassing ordinary least squares regression, regression on principal components, and partial least squares regression, was used to construct models from the selected variables. The variable selection routine is shown to produce simple, robust, and easily interpreted models for the chosen data sets.
The linear interaction energy (LIE) method has been applied to the calculation of the binding free energies of 15 inhibitors of the enzyme neuraminidase. This is a particularly challenging system for this methodology since the protein conformation and the number of tightly bound water molecules in the active site are known to change for different inhibitors. It is not clear that the basic LIE method will calculate the contributions to the binding free energies arising from these effects correctly. Application of the basic LIE equation yielded an rms error with respect to experiment of 1.51 kcal mol(-1) for the free energies of binding. To determine whether it is appropriate to include extra terms in the LIE equation, a detailed statistical analysis was undertaken. Multiple linear regression (MLR) is often used to determine the significance of terms in a fitting equation; this method is inappropriate for the current system owing to the highly correlated nature of the descriptor variables. Use of MLR in other applications of the LIE equation is therefore not recommended without a correlation analysis being performed first. Here factor analysis was used to determine the number of useful dimensions contained within the data and, hence, the maximum number of variables to be considered when specifying a model or equation. Biased fitting methods using orthogonalized components were then used to generate the most predictive model. The final model gave a q(2) of 0.74 and contained van der Waals and electrostatic energy terms. This result was obtained without recourse to prior knowledge and was based solely on the information content of the data.
We describe the use of Bayesian regularized artificial neural networks (BRANNs) coupled with automatic relevance determination (ARD) in the development of quantitative structure-activity relationship (QSAR) models. These BRANN-ARD networks have the potential to solve a number of problems which arise in QSAR modeling such as the following: choice of model; robustness of model; choice of validation set; size of validation effort; and optimization of network architecture. The ARD method ensures that irrelevant or highly correlated indices used in the modeling are neglected as well as showing which are the most important variables in modeling the activity data. The application of the methods to QSAR of compounds active at the benzodiazepine and muscarinic receptors as well as some toxicological data of the effect of substituted benzenes on Tetetrahymena pyriformis is illustrated.
BCUT [Burden, CAS, and University of Texas] descriptors, defined as eigenvalues of modified connectivity matrices, have traditionally been applied to drug design tasks such as defining receptor relevant subspaces to assist in compound selections. In this paper we present studies of consensus neural networks trained on BCUTs to discriminate compounds with poor aqueous solubility from those with reasonable solubility. This level was set at 0.1 mg/mL on advice from drug formulation and drug discovery scientists. By applying strict criteria to the insolubility predictions, approximately 95% of compounds are classified correctly. For compounds whose predictions have a lower level of confidence, further parameters are examined in order to flag those considered to possess unsuitable biopharmaceutical and physicochemical properties. This approach is not designed to be applied in isolation but is intended to be used as a filter in the selection of screening candidates, compound purchases, and the application of synthetic priorities to combinatorial libraries.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.