2015
DOI: 10.1002/minf.201400122
|View full text |Cite
|
Sign up to set email alerts
|

Greedy and Linear Ensembles of Machine Learning Methods Outperform Single Approaches for QSPR Regression Problems

Abstract: The application of Machine Learning to cheminformatics is a large and active field of research, but there exist few papers which discuss whether ensembles of different Machine Learning methods can improve upon the performance of their component methodologies. Here we investigated a variety of methods, including kernel-based, tree, linear, neural networks, and both greedy and linear ensemble methods. These were all tested against a standardised methodology for regression with data relevant to the pharmaceutical… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2015
2015
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 10 publications
(14 citation statements)
references
References 61 publications
0
14
0
Order By: Relevance
“…These methods have been previously discussed in a number of publications 10,19,27,106 and are only briefly outlined below. The workflow is outlined in Scheme 1.…”
Section: Qspr Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…These methods have been previously discussed in a number of publications 10,19,27,106 and are only briefly outlined below. The workflow is outlined in Scheme 1.…”
Section: Qspr Modelsmentioning
confidence: 99%
“…The fitted equation approach has shown good accuracy, but is limited by requirements for additional experimental input, although promising attempts have been made to predict some of these quantities. 20,[27][28][29][30] There are examples in the literature which utilize predicted melting points and logP (octanol water partition coefficient) values for solubility predictions via the general solubility equation. 20,30,31 The first principles calculation methods have generally been less accurate and more time consuming, but can provide more fundamental understanding of the process via physically meaningful decomposition of the predicted solution free energy.…”
Section: Introductionmentioning
confidence: 99%
“…Cao et al (2010) found a better RMSE, 0.731 over 45 compounds, with SVM than with two other machine learning methods. Kew et al (2015) observed SVM to get an RMSE of 1.01 in a 10-fold crossvalidation over 262 compounds taken from Hughes et al (2008), and thus to be essentially joint best with RF of 15 methods for solubility prediction. Boobier et al (2017), however, found SVM to be only the eighth best out of 10 methods for a 75-25 training-test split of the DLS-100 dataset (Mitchell et al, 2017) with an RMSE of 1.280 for 25 test compounds.…”
Section: Support Vector Machinementioning
confidence: 95%
“…RF methods have been applied by Hughes et al (2008), Kovdienko et al (2010), McDonagh et al (2014, and also by Boobier et al (2017) who found that it was the joint second best amongst 10 machine learning predictors tested and of similar quality to the second best of a panel of 22 human predictors. Kew et al (2015) observed RF to generate an RMSE of 1.02 in a 10-fold cross-validation using 262 molecules from Hughes et al (2008), and thus to be essentially joint best alongside Support Vector Machine of 15 methods for solubility prediction.…”
Section: Random Forestmentioning
confidence: 98%
“…MultiDK uses all training molecules as support vector molecules for kernel processing similar to support vector machines. We use the Tanimoto kernel which has been used in a wide range of machine learning applications, such as exploiting binary feature information to recognize white images on a black background 76 as well as a kernel for support vector and Gaussian progress regression in molecular property prediction8,55 .In the MultiDK approach, ensemble learning is employed based on multiple combinational descriptors according to the principle of the 'wisdom of the crowds'77 . The set of descriptors in MultiDK includes the Morgan circular fingerprints 53 , MACCS Keys 46 fingerprints and three non-binary molecular properties.…”
mentioning
confidence: 99%