Benchmarking Variable Selection in QSAR

Eklund, Martin; Norinder, Ulf; Boyer, Scott; Carlsson, Lars

doi:10.1002/minf.201100142

Cited by 25 publications

(30 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Eklund et al, 2 we found MARS and lasso to be the feature selection methods that performed best among the methods included in the benchmarking experiments. Therefore, we use these feature selection methods here (or rather, we use a generalization of lasso: the elastic nets).…”

Section: ■ Methodsmentioning

confidence: 93%

Choosing Feature Selection and Learning Algorithms in QSAR

Eklund

Norinder

Boyer

et al. 2014

J. Chem. Inf. Model.

Self Cite

View full text Add to dashboard Cite

Feature selection is an important part of contemporary QSAR analysis. In a recently published paper, we investigated the performance of different feature selection methods in a large number of in silico experiments conducted using real QSAR datasets. However, an interesting question that we did not address is whether certain feature selection methods are better than others in combination with certain learning methods, in terms of producing models with high prediction accuracy. In this report we extend our work from the previous investigation by using four different feature selection methods (wrapper, ReliefF, MARS, and elastic nets), together with eight learners (MARS, elastic net, random forest, SVM, neural networks, multiple linear regression, PLS, kNN) in an empirical investigation to address this question. The results indicate that state-of-the-art learners (random forest, SVM, and neural networks) do not gain prediction accuracy from feature selection, and we found no evidence that a certain feature selection is particularly well-suited for use in combination with a certain learner.

show abstract

Section: ■ Methodsmentioning

confidence: 93%

Choosing Feature Selection and Learning Algorithms in QSAR

Eklund

Norinder

Boyer

et al. 2014

J. Chem. Inf. Model.

Self Cite

View full text Add to dashboard Cite

show abstract

“…27 Our study showed that every combination model can be improved with a tremendous reduction of the number of descriptors. For example, the transparency of optimal RF models for the eight data sets ranges from 0.04 (MRP2) to 0.53 (BCPR), which means that as much as 96% of variables could a For one specific data set, the italic bold font style represents variable number and associated transparency with the best performance among the four modeling methods.…”

Section: ■ Resultsmentioning

confidence: 96%

“…27 Transparency represents the ability of a variable selection algorithm to extract the key ones from a pool of variables with noisy information. Usually, transparency was calculated for the variable set which maximizes the predictive performance of a model.…”

Section: ■ Materials and Methodsmentioning

confidence: 99%

Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO

Zhu¹,

Xin²,

2015

J. Chem. Inf. Model.

View full text Add to dashboard Cite

Variable selection is of crucial significance in QSAR modeling since it increases the model predictive ability and reduces noise. The selection of the right variables is far more complicated than the development of predictive models. In this study, eight continuous and categorical data sets were employed to explore the applicability of two distinct variable selection methods random forests (RF) and least absolute shrinkage and selection operator (LASSO). Variable selection was performed: (1) by using recursive random forests to rule out a quarter of the least important descriptors at each iteration and (2) by using LASSO modeling with 10-fold inner cross-validation to tune its penalty λ for each data set. Along with regular statistical parameters of model performance, we proposed the highest pairwise correlation rate, average pairwise Pearson's correlation coefficient, and Tanimoto coefficient to evaluate the optimal by RF and LASSO in an extensive way. Results showed that variable selection could allow a tremendous reduction of noisy descriptors (at most 96% with RF method in this study) and apparently enhance model's predictive performance as well. Furthermore, random forests showed property of gathering important predictors without restricting their pairwise correlation, which is contrary to LASSO. The mutual exclusion of highly correlated variables in LASSO modeling tends to skip important variables that are highly related to response endpoints and thus undermine the model's predictive performance. The optimal variables selected by RF share low similarity with those by LASSO (e.g., the Tanimoto coefficients were smaller than 0.20 in seven out of eight data sets). We found that the differences between RF and LASSO predictive performances mainly resulted from the variables selected by different strategies rather than the learning algorithms. Our study showed that the right selection of variables is more important than the learning algorithm for modeling. We hope that a standard procedure could be developed based on these proposed statistical metrics to select the truly important variables for model interpretation, as well as for further use to facilitate drug discovery and environmental toxicity assessment.

show abstract

“…Features are then removed one by one until a certain criterion is satisfied. [34] 1n-m Algorithm. The 1n-m algorithm combines FS with BE.…”

Section: Feature Selection Methodsmentioning

confidence: 99%

“…Forward selection (FS) begins with the presence of nonindependent features in the model, and the independent features are subsequently added one by one according to the predictive squared correlation coefficient of cross‐validation in the training set (

Q_{cv}^{2}

) until the criteria are satisfied . In FS as well as the following two methods, the best feature subset is established when the value of

Q_{cv}^{2}

has reached its maximum.…”

Section: Methodsmentioning

confidence: 99%

A cascaded QSAR model for efficient prediction of overall power conversion efficiency of all-organic dye-sensitized solar cells

Zhong

Lin

et al. 2015

J. Comput. Chem.

View full text Add to dashboard Cite

A cascaded model is proposed to establish the quantitative structure-activity relationship (QSAR) between the overall power conversion efficiency (PCE) and quantum chemical molecular descriptors of all-organic dye sensitizers. The cascaded model is a two-level network in which the outputs of the first level (JSC, VOC, and FF) are the inputs of the second level, and the ultimate end-point is the overall PCE of dye-sensitized solar cells (DSSCs). The model combines quantum chemical methods and machine learning methods, further including quantum chemical calculations, data division, feature selection, regression, and validation steps. To improve the efficiency of the model and reduce the redundancy and noise of the molecular descriptors, six feature selection methods (multiple linear regression, genetic algorithms, mean impact value, forward selection, backward elimination, and +n-m algorithm) are used with the support vector machine. The best established cascaded model predicts the PCE values of DSSCs with a MAE of 0.57 (%), which is about 10% of the mean value PCE (5.62%). The validation parameters according to the OECD principles are R(2) (0.75), Q(2) (0.77), and Qcv2 (0.76), which demonstrate the great goodness-of-fit, predictivity, and robustness of the model. Additionally, the applicability domain of the cascaded QSAR model is defined for further application. This study demonstrates that the established cascaded model is able to effectively predict the PCE for organic dye sensitizers with very low cost and relatively high accuracy, providing a useful tool for the design of dye sensitizers with high PCE.

show abstract

Benchmarking Variable Selection in QSAR

Cited by 25 publications

References 30 publications

Choosing Feature Selection and Learning Algorithms in QSAR

Choosing Feature Selection and Learning Algorithms in QSAR

Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO

A cascaded QSAR model for efficient prediction of overall power conversion efficiency of all-organic dye-sensitized solar cells

Contact Info

Product

Resources

About