PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead

Zhang, Lingqi; Liu, Cheng; Dong, Shoubin

doi:10.3390/genes10110886

Cited by 10 publications

(9 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Almost all BWA-MEM cluster-scaled implementations (SparkBWA [ 8 ], BWASpark [ 9 ], PipeMEM [ 10 ], ADAM [ 7 ], and SparkGA2 [ 6 ]) run multiple BWA-MEM instances on each Spark worker node as Spark tasks, which degrades the underlying efficient single-node multi-threaded scalability of this tool. Instead we use 1 BWA-MEM instance on each Spark worker node, storing output SAM files on storage and merging these SAM files to generate a single output SAM file.…”

Section: Methodsmentioning

confidence: 99%

“…pBWA [ 30 ] and mpiBLAST [ 31 ] use MPI, and CUSHAW3 [ 32 ] uses UPC++. Similarly ADAM’s Cannoli [ 7 ], SparkBWA [ 8 ], and PipeMEM [ 10 ] are a few Apache Spark–based BWA implementations that use BWA as loosely integrated underneath these implementations while GATK BWASpark modifies the original BWA to exploit the Spark scheduling and shuffling functionality to run BWA instances in parallel on clusters.…”

Section: Background and Related Workmentioning

confidence: 99%

“…Spark commonly uses HDFS to read/write data but also supports other storage systems such as Network File System (NFS), HBase, and Amazon’s S3. Many variant-calling workflows and tools have been developed over the past decade since its first release, including SparkGA2 [ 6 ], ADAM [ 7 ], SparkBWA [ 8 ], BWASpark [ 9 ], PipeBWA [ 10 ], and others.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

VC@Scale: Scalable and high-performance variant calling on cluster environments

2021

View full text Add to dashboard Cite

Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations. Results Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. Conclusions We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Background and Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

VC@Scale: Scalable and high-performance variant calling on cluster environments

2021

View full text Add to dashboard Cite

show abstract

“…Recently, Big Data technologies such as Apache Hadoop [4] and Apache Spark [5,6] are being employed. They allow the usage of high-level programming languages, such as Java, Python, or Scala, while providing ease of use and performance [7][8][9][10][11].…”

Section: Introductionmentioning

confidence: 99%

“…Big Data technologies, on the other hand, have become increasingly popular, and their usage is not longer restricted to data analytics, but has been successfully used in fields like bioinformatics [7][8][9][10][11]15], chemistry [29,30], or medicine [31,32]. Technologies like Apache Hadoop [4] or Apache Spark [5] offer a scalable way to process enormous amounts of data in large clusters of "cheap" computers or virtual machines in the cloud, using simple programming models.…”

Section: Introductionmentioning

confidence: 99%

Big Data in metagenomics: Apache Spark vs MPI

et al. 2020

View full text Add to dashboard Cite

The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.

show abstract