VC@Scale: Scalable and high-performance variant calling on cluster environments

Ahmad, Tanveer; Al-Ars, Zaid; Hofstee, Peter

doi:10.1093/gigascience/giab057

Cited by 4 publications

(6 citation statements)

References 23 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The table shows that apart from some tools that reports tests only on a multi-core workstation ( [16] , [17] , [18] , [19] ), Spark has been widely used to implement tools aimed at parallelizing the computation on a distributed computing environment. Most of these tools have been specifically devised for, or tested on, a cloud environment ( [20] , [21] , [22] , [23] , [24] , [25] , [26] , [27] , [28] [29] , [30] , [31] , [32] [33] , [34] , [35] , [36] , [37] ). Being the increasing availability of IaaS (Infrastructure as a Service) cloud computing services, it is desirable that the released tools are commonly designed to be supported also by such infrastructures.…”

Section: Apache Spark In Life Sciencesmentioning

confidence: 99%

“…CMAN Ext. tools/frameworks Genomics genome assembly SORA [20] de novo genome assembly GraphX ✓ - ✓ - - - variant calling DECA [21] copy number variantion discovery MLlib ✓ - ✓ - - ADAM ADS-HCSpark [48] SNPs and indels calling - ✓ - - - - - SparkGA2 [22] variant calling - ✓ - ✓ ✓ - - SparkRA [49] GATK best-practices pipeline - ✓ - - - - - DeepVariant on Spark [23] SNPs and indels calling - ✓ ✓ ✓ ✓ - Apache Parquet VC@Scale [24] SNPs and indels calling - ✓ ✓ ✓ - - Apache Arrow Halvade Somatic [25] somatic variant calling - ✓ - ✓ - - - …”

Section: Apache Spark In Life Sciencesmentioning

confidence: 99%

“…It should be pointed out that some applications in the fields of genomics ( [23] , [24] ) and biomedicine ( [35] , [36] ) have also been designed to distribute the computation on multiple GPUs on multiple nodes. GPU technology is widely used in life science.…”

Section: Apache Spark In Life Sciencesmentioning

confidence: 99%

See 2 more Smart Citations

Framing Apache Spark in life sciences

Manconi¹,

Gnocchi²,

Milanesi³

et al. 2023

Heliyon

View full text Add to dashboard Cite

Section: Apache Spark In Life Sciencesmentioning

confidence: 99%

Section: Apache Spark In Life Sciencesmentioning

confidence: 99%

See 1 more Smart Citation

Framing Apache Spark in life sciences

Manconi¹,

Gnocchi²,

Milanesi³

et al. 2023

Heliyon

View full text Add to dashboard Cite

“…Data formats like Apache Parquet, Apache Arrow, Apache Avro have been explored extensively in conjunction with these frameworks to store and process genomic data efficiently. These frameworks include ADAM (Massie et al, 2013), SparkGA2 (Mushtaq et al, 2019), VC@Scale (Ahmad et al, 2021) and Halvade (Decap et al, 2015). Due to many underlying dependencies, inefficient memory usage, issues related to scalability, cluster deployment challenges as well as incompatible data formats, solutions based on these frameworks are still not widely used in the mainstream Bioinformatics community.…”

Section: Introductionmentioning

confidence: 99%

GenMPI: Cluster Scalable Variant Calling for Short/Long Reads Sequencing Data

Ahmad

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Rapid technological advancements in sequencing technologies allow producing cost effective and high volume sequencing data. Processing this data for real-time clinical diagnosis is potentially time-consuming if done on a single computing node. This work presents a complete variant calling workflow, implemented using the Message Passing Interface (MPI) to leverage the benefits of high bandwidth interconnects. This solution (GenMPI) is portable and flexible, meaning it can be deployed to any private or public cluster/cloud infrastructure. Any alignment or variant calling application can be used with minimal adaptation. To achieve high performance, compressed input data can be streamed in parallel to alignment applications while uncompressed data can use internal file seek functionality to eliminate the bottleneck of streaming input data from a single node. Alignment output can be directly stored in multiple chromosome-specific SAM files or a single SAM file. After alignment, a distributed queue using MPI RMA (Remote Memory Access) atomic operations is created for sorting, indexing, marking of duplicates (if necessary) and variant calling applications. We ensure the accuracy of variants as compared to the original single node methods. We also show that for 300x coverage data, alignment scales almost linearly up to 64 nodes (8192 CPU cores) . Overall, this work outperforms existing big data based workflows by a factor of two and is almost 20\% faster than other MPI-based implementations for alignment without any extra memory overheads. Sorting, indexing, duplicate removal and variant calling is also scalable up to 8 nodes cluster . For pair-end short-reads (Illumina) data, we integrated the BWA-MEM aligner and three variant callers (GATK HaplotypeCaller, DeepVariant and Octopus), while for long-reads data, we integrated the Minimap2 aligner and three different variant callers (DeepVariant, DeepVariant with WhatsHap for phasing (PacBio) and Clair3 (ONT)).

show abstract

“…In the recent Genome Analysis Toolkit (GATK, McKenna et al ., 2010) version, several programs (including pileup calculations) have been implemented in a distributed manner ready to be run on the Apache Spark cluster. Other research studies confirm that big data programming paradigms can be successfully applied to many genomics analyses (Guo et al ., 2018, Capuccini et al ., 2020, Wiewiórka et al ., 2018, Wiewiórka et al ., 2017) including variant calling(Ahmad et al ., 2021). The analysis of the ever-increasing genomic data sets involves significant financial investments and administrative efforts to maintain secure and fault-tolerant storage solutions as well as fast and scalable processing units.…”

Section: Introductionmentioning

confidence: 99%

Cloud-native distributed genomic pileup operations

Wiewiórka

Szmurło

Stankiewicz

et al. 2022

Preprint

View full text Add to dashboard Cite

Motivation: Pileup analysis is a building block of many bioinformatics pipelines, including variant calling and genotyping. This step tends to become a bottleneck of the entire assay since the straightforward pileup implementations involve processing of all base calls from all alignments sequentially. On the other hand, a distributed version of the algorithm faces the intrinsic challenge of splitting reads-oriented file formats into self-contained partitions to avoid costly data exchange between computation nodes. Results: Here, we present a scalable, distributed, and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments. In particular, we implemented: (i) our custom data-partitioning algorithm optimized to work with the alignment reads, (ii) a novel and unique approach to process alignment events from sequencing reads using the MD tags, (iii) the source code micro-optimizations for recurrent operations, and (iv) a modular structure of the algorithm. We have proven that our novel approach consistently and significantly outperforms other state-of-the-art distributed tools in terms of execution time (up to 6.5x faster) and memory usage (up to 2x less), resulting in a substantial cloud cost reduction. SeQuiLa is a cloud-native solution that can be easily deployed using any managed Kubernetes and Hadoop services available in public clouds, like Microsoft Azure Cloud, Google Cloud Platform, or Amazon Web Services. Together with the already implemented distributed range joins and coverage calculations, our package provides end-users with an unified SQL interface for convenient analyzing of population-scale genomic data in an interactive way. See https://biodatageeks.github.io/sequila/ for details.

show abstract

VC@Scale: Scalable and high-performance variant calling on cluster environments

Cited by 4 publications

References 23 publications

Framing Apache Spark in life sciences

Framing Apache Spark in life sciences

GenMPI: Cluster Scalable Variant Calling for Short/Long Reads Sequencing Data

Cloud-native distributed genomic pileup operations

Contact Info

Product

Resources

About