SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework

Mushtaq, Hamid; Ahmed, Nauman; Al-Ars, Zaid

doi:10.1371/journal.pone.0224784

Cited by 10 publications

(11 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the development of the big data processing technology, some Hadoop based genome compression method are proposed [ 22 ].But generally speaking, the research based on big data processing technologies still has a lot of work to be done. So far, to our best knowledge, there is no published research on Spark based genome compression, but only some Spark based genome analysis achievements [ 23 ].…”

Section: Related Workmentioning

confidence: 99%

SparkGC: Spark based genome compression for large collections of genomes

Yao

Liu

et al. 2022

BMC Bioinformatics

View full text Add to dashboard Cite

Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome Compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark’s in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://github.com/haichangyao/SparkGC.

show abstract

Section: Related Workmentioning

confidence: 99%

SparkGC: Spark based genome compression for large collections of genomes

Yao

Liu

et al. 2022

BMC Bioinformatics

View full text Add to dashboard Cite

show abstract

“…The introduction of Spark solved these shortcomings and led to the introduction of a new generation of sequence analysis pipelines. SparkBWA [ 22 ] and StreamBWA [ 23 ] leverage Spark for the task of read mapping, whereas SparkGA [ 24 , 25 ] implements a more comprehensive pipeline for germline variant calling according to the GATK best practices recommendations. A Spark-based adaption of an RNA-seq variant calling pipeline was provided by SparkRA [ 26 ].…”

Section: Positioning With Respect To State Of the Artmentioning

confidence: 99%

Halvade somatic: Somatic variant calling with Apache Spark

et al. 2022

View full text Add to dashboard Cite

Background The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this leads to large volumes of raw sequencing data to process and hence, large computational requirements. For example, calling the somatic variants according to the GATK best practices guidelines requires days of computing time for a typical whole-genome sequencing sample. Findings We introduce Halvade Somatic, a framework for somatic variant calling from DNA sequencing data that takes advantage of multi-node and/or multi-core compute platforms to reduce runtime. It relies on Apache Spark to provide scalable I/O and to create and manage data streams that are processed on different CPU cores in parallel. Halvade Somatic contains all required steps to process the tumor and matched normal sample according to the GATK best practices recommendations: read alignment (BWA), sorting of reads, preprocessing steps such as marking duplicate reads and base quality score recalibration (GATK), and, finally, calling the somatic variants (Mutect2). Our approach reduces the runtime on a single 36-core node to 19.5 h compared to a runtime of 84.5 h for the original pipeline, a speedup of 4.3 times. Runtime can be further decreased by scaling to multiple nodes, e.g., we observe a runtime of 1.36 h using 16 nodes, an additional speedup of 14.4 times. Halvade Somatic supports variant calling from both whole-genome sequencing and whole-exome sequencing data and also supports Strelka2 as an alternative or complementary variant calling tool. We provide a Docker image to facilitate single-node deployment. Halvade Somatic can be executed on a variety of compute platforms, including Amazon EC2 and Google Cloud. Conclusions To our knowledge, Halvade Somatic is the first somatic variant calling pipeline that leverages Big Data processing platforms and provides reliable, scalable performance. Source code is freely available.

show abstract

“…However, SparkGA creates too many files in the Hadoop distributed file system after the mapping phase, affecting overall performance. To address this drawback, SparkGA2 [22] aims at reducing the copied data in memory and reducing the memory footprint. SparKGA adapts the amount of generated files from the mapping phase in the cluster by analyzing the available memory.…”

Section: Related Workmentioning

confidence: 99%

SparkFlow: Towards High-Performance Data Analytics for Spark-based Genome Analysis

Filgueira

Awaysheh

Carter

et al. 2022

2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

View full text Add to dashboard Cite

The recent advances in DNA sequencing technology triggered next-generation sequencing (NGS) research in full scale. Big Data (BD) is becoming the main driver in analyzing these large-scale bioinformatic data. However, this complicated process has become the system bottleneck, requiring an amalgamation of scalable approaches to deliver the needed performance and hide the deployment complexity. Utilizing cutting-edge scientific workflows can robustly address these challenges. This paper presents a Spark-based alignment workflow called SparkFlow for massive NGS analysis over singularity containers. SparkFlow is highly scalable, reproducible, and capable of parallelizing computation by utilizing data-level parallelism and load balancing techniques in HPC and Cloud environments. The proposed workflow capitalizes on benchmarking two state-of-art NGS workflows, i.e., BaseRecalibrator and ApplyBQSR. SparkFlow realizes the ability to accelerate large-scale cancer genomic analysis by scaling vertically (HyperThreading) and horizontally (provisions ondemand). Our result demonstrates a trade-off inevitably between the targeted applications and processor architecture. SparkFlow achieves a decisive improvement in NGS computation performance, throughput, and scalability while maintaining deployment complexity. The paper's findings aim to pave the way for a wide range of revolutionary enhancements and future trends within the High-performance Data Analytics (HPDA) genome analysis realm.

show abstract

SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework

Cited by 10 publications

References 7 publications

SparkGC: Spark based genome compression for large collections of genomes

SparkGC: Spark based genome compression for large collections of genomes

Halvade somatic: Somatic variant calling with Apache Spark

SparkFlow: Towards High-Performance Data Analytics for Spark-based Genome Analysis

Contact Info

Product

Resources

About