y-Randomization is a tool used in validation of QSPR/QSAR models, whereby the performance of the original model in data description (r2) is compared to that of models built for permuted (randomly shuffled) response, based on the original descriptor pool and the original model building procedure. We compared y-randomization and several variants thereof, using original response, permuted response, or random number pseudoresponse and original descriptors or random number pseudodescriptors, in the typical setting of multilinear regression (MLR) with descriptor selection. For each combination of number of observations (compounds), number of descriptors in the final model, and number of descriptors in the pool to select from, computer experiments using the same descriptor selection method result in two different mean highest random r2 values. A lower one is produced by y-randomization or a variant likewise based on the original descriptors, while a higher one is obtained from variants that use random number pseudodescriptors. The difference is due to the intercorrelation of real descriptors in the pool. We propose to compare an original model's r2 to both of these whenever possible. The meaning of the three possible outcomes of such a double test is discussed. Often y-randomization is not available to a potential user of a model, due to the values of all descriptors in the pool for all compounds not being published. In such cases random number experiments as proposed here are still possible. The test was applied to several recently published MLR QSAR equations, and cases of failure were identified. Some progress also is reported toward the aim of obtaining the mean highest r2 of random pseudomodels by calculation rather than by tedious multiple simulations on random number variables.
The experimental boiling points (bp) of saturated hydrocarbons (acyclic through polycyclic) up to decanes are systematically compiled. The bp values are classified into groups of lower or higher reliability according to the accuracy and frequency with which they were reproduced by independent researchers. For each hydrocarbon structure the values of several simple topological indices (TI) of widely varying origin are given, including the values of molecular walk counts of various lengths and their sum. The sensitivity of the TIs for structural changes within comprehensive groups of cyclic saturated hydrocarbons is evaluated, and the total walk count is found to be most sensitive. By multilinear regression structure-bp correlations are obtained for various comprehensive compound samples. Both the detour index and the walk counts are found to play a major role in the best models. Comparison of the bp models obtained with those from the recent literature reveals significant improvements for both cyclic and acyclic alkanes, which is attributed in part to the higher quality of experimental data, in part to the use of novel descriptors, and in part to the use of a more diverse pool of descriptors to select from. Despite this a descriptor combination allowing to accurately model cycloalkane bps is not yet found.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.