2014
DOI: 10.1002/cem.2603
|View full text |Cite
|
Sign up to set email alerts
|

ReshapedSequential Replacementfor variable selection in QSPR: comparison with other reference methods

Abstract: The objective of the present work was to compare the Reshaped Sequential Replacement (RSR) algorithm with other well‐known variable selection techniques in the field of Quantitative Structure–Property Relationship (QSPR) modelling. RSR algorithm is based on a simple sequential replacement procedure with the addition of several ‘reshaping’ functions that aimed to (i) ensure a faster convergence upon optimal subsets of variables and (ii) reject models affected by chance correlation, overfitting and other patholo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 18 publications
(14 citation statements)
references
References 35 publications
(50 reference statements)
0
14
0
Order By: Relevance
“… Variable selection and modelling. The Genetic Algorithms (GA) [ 23 ], a benchmark variable selection method characterized by an optimal trade-off between computational time and exploration/exploitation ability [ 24 ], were used to retain the most relevant subsets of variables. A refined two-step GA procedure (see Materials and Methods) was applied on the training set descriptors in combination with six classification techniques: (a) Classification and Regression Trees (CART) [ 25 ]; (b) k -Nearest Neighbor ( k -NN) [ 26 ]; (c) N -Nearest Neighbors (N3) [ 27 ]; (d) Binned Nearest Neighbors (BNN) [ 27 ]; (e) Linear Discriminant Analysis (LDA) [ 28 ]; and (f) Partial Least Squares Analysis (PLSDA) [ 29 ].…”
Section: Resultsmentioning
confidence: 99%
“… Variable selection and modelling. The Genetic Algorithms (GA) [ 23 ], a benchmark variable selection method characterized by an optimal trade-off between computational time and exploration/exploitation ability [ 24 ], were used to retain the most relevant subsets of variables. A refined two-step GA procedure (see Materials and Methods) was applied on the training set descriptors in combination with six classification techniques: (a) Classification and Regression Trees (CART) [ 25 ]; (b) k -Nearest Neighbor ( k -NN) [ 26 ]; (c) N -Nearest Neighbors (N3) [ 27 ]; (d) Binned Nearest Neighbors (BNN) [ 27 ]; (e) Linear Discriminant Analysis (LDA) [ 28 ]; and (f) Partial Least Squares Analysis (PLSDA) [ 29 ].…”
Section: Resultsmentioning
confidence: 99%
“…Seven regression strategies were coupled with 2 variable selection methods (Table S2), namely (i) genetic algorithms (Holland ), which are a benchmark selection method, and (ii) reshaped sequential replacement (Cassotti et al ), which has a good exploration ability of the descriptors’ space (Grisoni et al ). The final models were selected to ensure the best compromise between simplicity, easy‐to‐interpret molecular descriptors and accuracy.…”
Section: Methodsmentioning
confidence: 99%
“…The Canonical measure of correlation (CMC) index, with the threshold value of 0.3, was used for screening the “tabu” descriptors. Roulette wheel: The initialization of the population was based on the calculated CMC index for each descriptor, whereby the descriptors with the higher CMC index had higher probability to be selected. Randomization test: The real model error is compared with the random classification error (random error rate (RER) test), that is, with the error rate obtained if the cases are randomly assigned to the classes, and final model is accepted if the real error rate is smaller than the corresponding RER value. Nested models: The final population of models was checked in the terms of complexity and performance. The model is rejected if its higher complexity is not balanced by higher performance. Model distance and correlation: The canonical measure of distance (CMD) and CMC indices were used to determine whether the final models with different variables are actually different in their nature …”
Section: Methodsmentioning
confidence: 99%