2010
DOI: 10.1093/bioinformatics/btq257
|View full text |Cite
|
Sign up to set email alerts
|

On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data

Abstract: The RJ software package is freely available at http://www.randomjungle.org

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
154
0

Year Published

2011
2011
2021
2021

Publication Types

Select...
5
2

Relationship

2
5

Authors

Journals

citations
Cited by 214 publications
(154 citation statements)
references
References 51 publications
0
154
0
Order By: Relevance
“…First, the R packages randomForest (Liaw and Wiener 2002), randomForestSRC (Ishwaran and Kogalur 2015) and Rborist (Seligman 2015), the C++ application Random Jungle (Schwarz et al 2010;Kruppa et al 2014b), and the R version of the new implementation ranger were run with small simulated datasets, a varying number of features p, sample size n, number of features tried for splitting (mtry) and a varying number of trees grown in the RF. In each case, the other three parameters were kept fixed to 500 trees, 1,000 samples, 1,000 features and mtry = √ p. The datasets mimic genetic data, consisting of p single nucleotide polymorphisms (SNPs) measured on n subjects.…”
Section: Runtime and Memory Usagementioning
confidence: 99%
See 2 more Smart Citations
“…First, the R packages randomForest (Liaw and Wiener 2002), randomForestSRC (Ishwaran and Kogalur 2015) and Rborist (Seligman 2015), the C++ application Random Jungle (Schwarz et al 2010;Kruppa et al 2014b), and the R version of the new implementation ranger were run with small simulated datasets, a varying number of features p, sample size n, number of features tried for splitting (mtry) and a varying number of trees grown in the RF. In each case, the other three parameters were kept fixed to 500 trees, 1,000 samples, 1,000 features and mtry = √ p. The datasets mimic genetic data, consisting of p single nucleotide polymorphisms (SNPs) measured on n subjects.…”
Section: Runtime and Memory Usagementioning
confidence: 99%
“…The R implementation randomForest by Liaw and Wiener (2002) is feature-rich and widely used. However, it has not been optimized for the use with high dimensional data (Schwarz, König, and Ziegler 2010). This also applies to other implementations, such as Willows (Zhang, Wang, and Chen 2009) which has been optimized for large sample size but not for a large number of features, also termed independent variables.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…This will greatly facilitate making honest comparisons between methods and/or identifying the true context-dependent benefits of each method. Combining multiple classification or regression models typically gives improved results compared to using only a single such model (Schwarz et al 2010). along the same line, each analytic epistasis detection tool can be envisaged to partition the (SNP-SNP) interaction space into "interesting" regions, according to some prespecified criteria or variables (which could include power to detect the interaction with the tool, biological interaction evidence, etc.).…”
Section: Resultsmentioning
confidence: 99%
“…However, we do not think that this is the main explanation for their limited use in large epistasis screening. Indeed, the heavily used Random Forests as a data mining approach (Schwarz et al 2010) also does not assess significance for sets of variables, but provides individual variable importance scores and threshold above which to retain variables (r2vIM-recurrent relative variable importance scores; personal communication with Silke Szymczak). Notably, patterns obtained via Random Forests may be the result of random variations or of the recursive nature of the tree building algorithms, or of true interactions.…”
Section: Statistical Methods To Screen For Epistasis In Large-scale Gmentioning
confidence: 99%