On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data

Schwarz, Dániel; König, Inke R.; Ziegler, Andreas

doi:10.1093/bioinformatics/btq257

Cited by 214 publications

(154 citation statements)

References 51 publications

Supporting

Mentioning

154

Contrasting

Order By: Relevance

“…First, the R packages randomForest (Liaw and Wiener 2002), randomForestSRC (Ishwaran and Kogalur 2015) and Rborist (Seligman 2015), the C++ application Random Jungle (Schwarz et al 2010;Kruppa et al 2014b), and the R version of the new implementation ranger were run with small simulated datasets, a varying number of features p, sample size n, number of features tried for splitting (mtry) and a varying number of trees grown in the RF. In each case, the other three parameters were kept fixed to 500 trees, 1,000 samples, 1,000 features and mtry = √ p. The datasets mimic genetic data, consisting of p single nucleotide polymorphisms (SNPs) measured on n subjects.…”

Section: Runtime and Memory Usagementioning

confidence: 99%

“…The R implementation randomForest by Liaw and Wiener (2002) is feature-rich and widely used. However, it has not been optimized for the use with high dimensional data (Schwarz, König, and Ziegler 2010). This also applies to other implementations, such as Willows (Zhang, Wang, and Chen 2009) which has been optimized for large sample size but not for a large number of features, also termed independent variables.…”

Section: Introductionmentioning

confidence: 99%

“…This package is studied in greater detail in Section 5. Finally, an RF implementation optimized for analyzing high dimensional data is Random Jungle (Schwarz et al 2010;Kruppa et al 2014b). This package is only available as C++ application with library dependencies, and it is not portable to R or another statistical programming language.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R

Wright¹,

Ziegler²

2017

J. Stat. Soft.

Self Cite

2,075

1,510

View full text Add to dashboard Cite

We introduce the C++ application and R package ranger. The software is a fast implementation of random forests for high dimensional data. Ensembles of classification, regression and survival trees are supported. We describe the implementation, provide examples, validate the package with a reference implementation, and compare runtime and memory usage with other implementations. The new software proves to scale best with the number of features, samples, trees, and features tried for splitting. Finally, we show that ranger is the fastest and most memory efficient implementation of random forests to analyze data on the scale of a genome-wide association study.

show abstract

Section: Runtime and Memory Usagementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R

Wright¹,

Ziegler²

2017

J. Stat. Soft.

Self Cite

2,075

1,510

View full text Add to dashboard Cite

show abstract

“…This will greatly facilitate making honest comparisons between methods and/or identifying the true context-dependent benefits of each method. Combining multiple classification or regression models typically gives improved results compared to using only a single such model (Schwarz et al 2010). along the same line, each analytic epistasis detection tool can be envisaged to partition the (SNP-SNP) interaction space into "interesting" regions, according to some prespecified criteria or variables (which could include power to detect the interaction with the tool, biological interaction evidence, etc.).…”

Section: Resultsmentioning

confidence: 99%

“…However, we do not think that this is the main explanation for their limited use in large epistasis screening. Indeed, the heavily used Random Forests as a data mining approach (Schwarz et al 2010) also does not assess significance for sets of variables, but provides individual variable importance scores and threshold above which to retain variables (r2vIM-recurrent relative variable importance scores; personal communication with Silke Szymczak). Notably, patterns obtained via Random Forests may be the result of random variations or of the recursive nature of the tree building algorithms, or of true interactions.…”

Section: Statistical Methods To Screen For Epistasis In Large-scale Gmentioning

confidence: 99%

Practical aspects of genome-wide association interaction analysis

Gusareva

Steen

2014

Hum Genet

View full text Add to dashboard Cite

Large-scale epistasis studies can give new clues to system-level genetic mechanisms and a better understanding of the underlying biology of human complex disease traits. Though many novel methods have been proposed to carry out such studies, so far only a few of them have demonstrated replicable results. Here, we propose a minimal protocol for genome-wide association interaction (GWAI) analysis to identify gene-gene interactions from large-scale genomic data. The different steps of the developed protocol are discussed and motivated, and encompass interaction screening in a hypothesis-free and hypothesis-driven manner. In particular, we examine a wide range of aspects related to epistasis discovery in the context of complex traits in humans, hereby giving practical recommendations for data quality control, variant selection or prioritization strategies and analytic tools, replication and meta-analysis, biological validation of statistical findings and other related aspects. The minimal protocol provides guidelines and attention points for anyone involved in GWAI analysis and aims to enhance the biological relevance of GWAI findings. At the same time, the protocol improves a better assessment of strengths and weaknesses of published GWAI methodologies.

show abstract

Analysis of Gene–Gene Interactions Underlying Human Disease

Wen

2014

Encyclopedia of Life Sciences

View full text Add to dashboard Cite

Following the identification of disease‐susceptibility variants in genome‐wide association studies by using the standard single‐locus analyses, the discovery process is shifting towards gene–gene interactions of functional importance in the pathophysiology and aetiology of complex diseases. The results from these gene–gene interaction analyses could lead to new genetic findings that account for the heritability of human diseases as well as novel insights about underlying genetic aetiology through later bench science research and clinical applications. To facilitate gene–gene interaction analyses, various statistical methods have been proposed, each of which is applicable for certain study designs and has its own advantages under certain conditions. In this article, the authors provide a survey of the statistical methods and software packages that are currently available for population‐based and family‐based gene–gene interaction studies. The strength of each method is discussed and the difficulties in determining the relationship between biological and statistical interactions are laid out. Key Concepts: A biological interaction describes a scenario in which two or more genes jointly affect a disease. A statistical interaction describes the nonadditive effect in generalised linear models. The heritability of a phenotype is defined as the proportion of phenotypic variations between individuals due to their genetic differences. Population based case‐control study recruits individuals with a disease of interest along with the unrelated healthy individuals, and compares the allele/genotype distributions between cases and controls to determine whether a statistical interaction exists. Family based study design avoids the potential confounding effect due to population stratification and admixture by recruiting the parents and/or siblings of the cases.

show abstract

On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data

Abstract: The RJ software package is freely available at http://www.randomjungle.org

Cited by 214 publications

References 51 publications

ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R

ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R

Practical aspects of genome-wide association interaction analysis

Analysis of Gene–Gene Interactions Underlying Human Disease

Contact Info

Product

Resources

About