2019
DOI: 10.1101/702902
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data

Abstract: The demands on machine learning methods to cater for ultra high dimensional datasets, datasets with millions of features, have been increasing in domains like life sciences and the Internet of Things (IoT). While Random Forests are suitable for "wide" datasets, current implementations such as Google's PLANET lack the ability to scale to such dimensions. Recent improvements by Yggdrasil begin to address these limitations but do not extend to Random Forest. This paper introduces CursedForest, a novel Random Fore… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
1
1

Relationship

2
0

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 15 publications
0
3
0
Order By: Relevance
“…PLANET uses horizontal partitioning, which is a parallelization along the wrong dimension because it does not allow high-dimensional data to be loaded into memory as required for random access by the RF algorithm. PLANET is faster than the randomForest R package, with comparisons to other implementations provided in Bayat et al [ 33 ]. ReForeSt [ 34 ] is, to the best of our knowledge, the fastest distributed implementation of RF and is up to 3 times faster than MLlib (PLANET).…”
Section: Resultsmentioning
confidence: 99%
“…PLANET uses horizontal partitioning, which is a parallelization along the wrong dimension because it does not allow high-dimensional data to be loaded into memory as required for random access by the RF algorithm. PLANET is faster than the randomForest R package, with comparisons to other implementations provided in Bayat et al [ 33 ]. ReForeSt [ 34 ] is, to the best of our knowledge, the fastest distributed implementation of RF and is up to 3 times faster than MLlib (PLANET).…”
Section: Resultsmentioning
confidence: 99%
“…Future improvements will cover the use of epistatic genomic relationship matrix (EGRM) to control for the effect of diversity [29], as well as more advanced visualization approaches using either d3 or Cytoscape JavaScript library for dynamic web-based visualization. We also plan to add an end-to-end integration with cloud-based Random Forest implementation Vari-antSpark [15], to enable epistasis search within the ultra-high dimensional data of whole-genome sequencing cohorts.…”
Section: Discussionmentioning
confidence: 99%
“…Random Forest [13] is an efficient method for this filter as it preserves higher-order interactions [14]. Particularly, a new cloud-based implementation of Random Forest called VariantSpark [15] is able to process whole-genome data with 100,000,000 SNVs. It is capable of fitting tens of thousands of trees, which enables the interrogation of the search space more deeply, thereby reducing the chance of missing important interactions.…”
Section: Introductionmentioning
confidence: 99%