Rethinking Data-Intensive Science Using Scalable Analytics Systems

Nothaft, Frank Austin; Massie, Matt; Danford, Timothy; Zhang, Zhao; Laserson, Uri; Yeksigian, Carl; Kottalam, Jey; Ahuja, Arun; Hammerbacher, Jeff; Linderman, Michael D.; Franklin, Michael J.; Joseph, Anthony D.; Patterson, David A.

doi:10.1145/2723372.2742787

Cited by 80 publications

(45 citation statements)

References 46 publications

(69 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition to enabling and evaluating horizontal scalability, the cost of an analysis and the choice of virtual machine flavors are becoming increasingly important for efficient execution of bioinformatics analysis, since pipelines are increasingly deployed and evaluated on commercial clouds [6,21,22]. However, even on dedicated clusters it is important to understand how to scale a pipeline up and out on the available resources to improve the utilization of the resources.…”

Section: Summary and Discussionmentioning

confidence: 99%

“…ADAM [6] is a genomics pipeline that is built on top of the Apache Spark big data processing engine [15], Avro (https://avro.apache.org/) data serialization system, and Parquet (https://parquet.apache.org/) columnar storage system to improve the performance and reduce the cost of variant calling. It takes as input next-generation sequencing (NGS) short reads and outputs sites in the input genome where an individual differs from the reference genome.…”

Section: Adam Variant Calling Pipelinementioning

confidence: 99%

“…An example pipeline implemented with GESALL is their implementation of the GATK variant calling reference pipeline that was used as an example in the ADAM paper [6]. GESALL is evaluated on fewer but more powerful nodes (15, each with 24 cores, 64 GB RAM, and 3 TB disk) than the ADAM pipeline.…”

Section: Gesall Variant Calling Pipelinementioning

confidence: 99%

“…For example, the widely used BLAST [5] is computationally intensive but scales linearly with respect to the number of CPU cores. Finally, to efficiently support many users it is important that the analyses scale with respect to cost-performance [6].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A Review of Scalable Bioinformatics Pipelines

Fjukstad

Bongo

2017

Data Sci. Eng.

View full text Add to dashboard Cite

Scalability is increasingly important for bioinformatics analysis services, since these must handle larger datasets, more jobs, and more users. The pipelines used to implement analyses must therefore scale with respect to the resources on a single compute node, the number of nodes on a cluster, and also to cost-performance. Here, we survey several scalable bioinformatics pipelines and compare their design and their use of underlying frameworks and infrastructures. We also discuss current trends for bioinformatics pipeline development.

show abstract

Section: Summary and Discussionmentioning

confidence: 99%

Section: Adam Variant Calling Pipelinementioning

confidence: 99%

Section: Gesall Variant Calling Pipelinementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Review of Scalable Bioinformatics Pipelines

Fjukstad

Bongo

2017

Data Sci. Eng.

View full text Add to dashboard Cite

show abstract

“…A variety of scientific applications have been parallelized using Hadoop or Spark [15,24,9,1,17]. These tools demonstrate that good performance can be achieved without having to trade it for ease-of-use, expressive API, or fault tolerance.…”

Section: Context and Backgroundmentioning

confidence: 99%

Experiences with Performing MapReduce Analysis of Scientific Data on HPC Platforms

Moise¹

2016

Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing

View full text Add to dashboard Cite

The growing interest in being able to apply Big Data techniques to scientific data generated using HPC simulations led to the question of whether this is achievable on the same HPC platform, and if so, what is the performance that can be obtained on these systems. The motivation behind this approach is twofold: scientific datasets are often very large, and would take a long time to transfer to external Big Data clusters; furthermore, the ability to perform live analysis on the data as it is being generated on the HPC platform can be crucial to many scientific applications. Using as case-study a Hadoop-based application that analyzes Molecular Dynamics simulations data on the same HPC platform on which it was produced, we present our experiences with performing Big Data analysis on an HPC system. This work also describes the challenges that one has to deal with when performing Hadoop-based computations on scientific data on HPC platforms: data storage, data formats, ingesting data in Hadoop, optimizing the deployment to overcome the limitations of the HPC environment. Our work shows in a first phase that such an instantiation of Big Data analysis on an HPC system is both relevant and feasible; in a second phase, we greatly improve the performance by efficient configuration of HPC resources and tuning of the application. Our findings can be shared as best practices and recommendations in the context of the convergence of the HPC and Big Data environments.

show abstract