Models that estimate and predict the normal boiling point (NBP) of
alkanes based on a molecular distance-edge (MDE) vector, λ, have been developed by using multiple linear
regression (MLR) methods. The
structures of the examined compounds are selectively described by an
MDE vector structure descriptor, a
novel molecular distance-edge vector recently developed in our
laboratory. MLR was used to develop a
linear model containing ten variables with a high precision root mean
squares error (RMS = 4.985K) and
a good correlation with the correlation coefficient (R =
0.9948). In addition, a predictive model has been
developed by using 125 isomers in alkanes as the training set, and its
performance was certified by employing
25 alkanes chosen randomly as the test set from a total of 150 alkane
compounds; excellent predicted results
were obtained with the RMS and R values found between the
calculated value and observed NBP being
RMS = 4.486K and R = 0.9945.
The use of numerous descriptors that are indicative of molecular structure and topology is becoming more common in quantitative structure-activity relationship (QSAR). How to choose the adequate descriptors for QSAR studies is important but difficult because there are no absolute rules to govern this choice. A variety of variable selection techniques including stepwise, partial least squares/principal component analysis (PLS/PCA), neural network, and evolutionary algorithm such as genetic algorithm have been applied to this common problem. All-subsets regression (ASR) is capable of finding out the best variable subset from among a large pool. In this paper, a novel variable selection and modeling method based on the prediction, for short VSMP, has been developed. Here two controllable parameters, the interrelation coefficient between the pairs of the independent variables (r(int)) and the correlation coefficient (q(2)) obtained using the leave-one-out (LOO) cross-validation technique, are introduced into the ASR to improve its performances. This technique differs from the other variable selection procedures related to the ASR by two main features: (1) The search of various optimal subset search is controlled by the statistic q(2) or root-mean-square error (RMSEP) in the LOO cross-validation step rather than the correlation coefficient obtained in the modeling step (r(2)). (2) The searching speed of all optimal subsets is expedited by the statistic r(int) together with q(2). A comparison of the results of the VSMP applied to the Selwood data set (n = 31 compounds, m = 53 descriptors) with those obtained from alternative algorithms shows the good performance of the technique.
A molecular electronegativity distance vector based on 13 atomic types, called MEDV-13, is a descriptor for predicting the biological activities of molecules based on the quantitative structure-activity relations (QSAR). The MEDV-13 uses a modified electrotopological state (E-state) index to substitute for the relative eletronegativity (q) of non-hydrogen atoms in the molecule of interest in the MEDV and a topological distance for the relative distance (d) in the MEDV. For an organic molecule containing several chemical elements such as C, H, O, N, S, F, Cl, Br, I, and P, the MEDV-13 includes at best 91 descriptors. Then it is essential to employ a principal component regression (PCR) technique to derive a QSAR model relating the biological activities to the MEDV-13. The MEDV-13 is used to study the QSAR of the corticosteroid-binding globulin (CBG) binding affinity of the steroids and the activity inhibiting angiotensin-converting enzyme (ACE) of dipeptides, and resulting models have a comparable quality to the current three-dimensional (3D) methods such as CoMFA though the MEDV-13 is a descriptor based on two-dimensional topological information.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.