<i>VariantSpark</i>, A <i>Random Forest</i> Machine Learning Implementation for Ultra High Dimensional Data

Bayat, Arash; Szul, Piotr; O’Brien, Aidan; Dunne, Robert; Luo, Oscar Junhong; Jain, Yatish; Hosking, Brendan; Bauer, Denis C.

doi:10.1101/702902

Cited by 2 publications

(3 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…PLANET uses horizontal partitioning, which is a parallelization along the wrong dimension because it does not allow high-dimensional data to be loaded into memory as required for random access by the RF algorithm. PLANET is faster than the randomForest R package, with comparisons to other implementations provided in Bayat et al [ 33 ]. ReForeSt [ 34 ] is, to the best of our knowledge, the fastest distributed implementation of RF and is up to 3 times faster than MLlib (PLANET).…”

Section: Resultsmentioning

confidence: 99%

VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data

et al. 2020

Self Cite

View full text Add to dashboard Cite

Background Many traits and diseases are thought to be driven by >1 gene (polygenic). Polygenic risk scores (PRS) hence expand on genome-wide association studies by taking multiple genes into account when risk models are built. However, PRS only considers the additive effect of individual genes but not epistatic interactions or the combination of individual and interacting drivers. While evidence of epistatic interactions ais found in small datasets, large datasets have not been processed yet owing to the high computational complexity of the search for epistatic interactions. Findings We have developed VariantSpark, a distributed machine learning framework able to perform association analysis for complex phenotypes that are polygenic and potentially involve a large number of epistatic interactions. Efficient multi-layer parallelization allows VariantSpark to scale to the whole genome of population-scale datasets with 100,000,000 genomic variants and 100,000 samples. Conclusions Compared with traditional monogenic genome-wide association studies, VariantSpark better identifies genomic variants associated with complex phenotypes. VariantSpark is 3.6 times faster than ReForeSt and the only method able to scale to ultra-high-dimensional genomic data in a manageable time.

show abstract

Section: Resultsmentioning

confidence: 99%

VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Future improvements will cover the use of epistatic genomic relationship matrix (EGRM) to control for the effect of diversity [29], as well as more advanced visualization approaches using either d3 or Cytoscape JavaScript library for dynamic web-based visualization. We also plan to add an end-to-end integration with cloud-based Random Forest implementation Vari-antSpark [15], to enable epistasis search within the ultra-high dimensional data of whole-genome sequencing cohorts.…”

Section: Discussionmentioning

confidence: 99%

“…Random Forest [13] is an efficient method for this filter as it preserves higher-order interactions [14]. Particularly, a new cloud-based implementation of Random Forest called VariantSpark [15] is able to process whole-genome data with 100,000,000 SNVs. It is capable of fitting tens of thousands of trees, which enables the interrogation of the search space more deeply, thereby reducing the chance of missing important interactions.…”

Section: Introductionmentioning

confidence: 99%

Fast and Accurate Exhaustive Higher-Order Epistasis Search with BitEpi

Bayat

Hosking

Jain

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Motivation: Higher-order epistatic interactions can be the driver for complex genetic diseases. An exhaustive search is the most accurate method for identifying interactive SNPs. While there is a fast bitwise algorithm for pairwise exhaustive searching (BOOST), higher-order exhaustive searching has yet to be efficiently optimized. Results: In this paper, we introduce BitEpi, a program to detect and visualize higher-order epistatic interactions using an exhaustive search. BitEpi introduces a novel bitwise algorithm that can perform higher-order analysis more quickly and is the first bitwise algorithm to search for 4-SNP interactions. Furthermore, BitEpi increases detection accuracy by using a novel entropy-based power analysis. BitEpi visualizes significant interactions in a publication-ready interactive graph. BitEpi is 56 times faster than MDR for 4-SNP searching and is up to 1.33 and 2.09 times more accurate than BOOST and MPI3SNP respectively.

show abstract

VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data

Cited by 2 publications

References 15 publications

VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data

VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data

Fast and Accurate Exhaustive Higher-Order Epistasis Search with BitEpi

Contact Info

Product

Resources

About