2019
DOI: 10.1371/journal.pone.0224784
|View full text |Cite
|
Sign up to set email alerts
|

SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework

Abstract: Due to the rapid decrease in the cost of NGS (Next Generation Sequencing), interest has increased in using data generated from NGS to diagnose genetic diseases. However, the data generated by NGS technology is usually in the order of hundreds of gigabytes per experiment, thus requiring efficient and scalable programs to perform data analysis quickly. This paper presents SparkGA2, a memory efficient, production quality framework for high performance DNA analysis in the cloud, which can scale according to the av… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 10 publications
(11 citation statements)
references
References 7 publications
0
11
0
Order By: Relevance
“…With the development of the big data processing technology, some Hadoop based genome compression method are proposed [ 22 ].But generally speaking, the research based on big data processing technologies still has a lot of work to be done. So far, to our best knowledge, there is no published research on Spark based genome compression, but only some Spark based genome analysis achievements [ 23 ].…”
Section: Related Workmentioning
confidence: 99%
“…With the development of the big data processing technology, some Hadoop based genome compression method are proposed [ 22 ].But generally speaking, the research based on big data processing technologies still has a lot of work to be done. So far, to our best knowledge, there is no published research on Spark based genome compression, but only some Spark based genome analysis achievements [ 23 ].…”
Section: Related Workmentioning
confidence: 99%
“…The introduction of Spark solved these shortcomings and led to the introduction of a new generation of sequence analysis pipelines. SparkBWA [ 22 ] and StreamBWA [ 23 ] leverage Spark for the task of read mapping, whereas SparkGA [ 24 , 25 ] implements a more comprehensive pipeline for germline variant calling according to the GATK best practices recommendations. A Spark-based adaption of an RNA-seq variant calling pipeline was provided by SparkRA [ 26 ].…”
Section: Positioning With Respect To State Of the Artmentioning
confidence: 99%
“…However, SparkGA creates too many files in the Hadoop distributed file system after the mapping phase, affecting overall performance. To address this drawback, SparkGA2 [22] aims at reducing the copied data in memory and reducing the memory footprint. SparKGA adapts the amount of generated files from the mapping phase in the cluster by analyzing the available memory.…”
Section: Related Workmentioning
confidence: 99%