2007
DOI: 10.1186/1471-2105-8-25
|View full text |Cite
|
Sign up to set email alerts
|

Bias in random forest variable importance measures: Illustrations, sources and a solution

Abstract: Background: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

15
2,075
2
8

Year Published

2010
2010
2023
2023

Publication Types

Select...
9

Relationship

0
9

Authors

Journals

citations
Cited by 2,735 publications
(2,197 citation statements)
references
References 29 publications
15
2,075
2
8
Order By: Relevance
“…Classification and regression trees examine the degree to which factors predict a dependent variable, and determine the relative importance of individual factors (Olden et al 2008;Strobl et al 2009). Specifically, conditional inference trees utilize an iterative, binary recursive data-partitioning algorithm to examine each variable, searching for the best predictor, splitting the data for the dependent variable into two distinct groups, and then repeating the variable selection until no more significant predictors are found (Hothorn et al 2006).…”
Section: Resultsmentioning
confidence: 99%
“…Classification and regression trees examine the degree to which factors predict a dependent variable, and determine the relative importance of individual factors (Olden et al 2008;Strobl et al 2009). Specifically, conditional inference trees utilize an iterative, binary recursive data-partitioning algorithm to examine each variable, searching for the best predictor, splitting the data for the dependent variable into two distinct groups, and then repeating the variable selection until no more significant predictors are found (Hothorn et al 2006).…”
Section: Resultsmentioning
confidence: 99%
“…Discussion of the algorithm, associated metrics and uses of RF in ecology is provided by Cutler et al [31]. We used the 'randomForests' package implemented in R [34] to run models and the 'cforest' function in package 'party' to obtain unbiased variable importance estimates to corroborate variable selection [35].…”
Section: Methodsmentioning
confidence: 99%
“…The method injects randomness to guarantee that trees in the forest are different. This somewhat counterintuitive strategy turns out to perform very well compared to many other classifiers, including Strobl et al 2007). Before we classified the Bochanski2007b M dwarf template using RF, we divided each spectrum from 6000 Å to 9000 Å into 600 regions, with each region covering 5 Å.…”
Section: Spectral Typesmentioning
confidence: 99%