Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data 2015
DOI: 10.1145/2723372.2742787
|View full text |Cite
|
Sign up to set email alerts
|

Rethinking Data-Intensive Science Using Scalable Analytics Systems

Abstract: "Next generation" data acquisition technologies are allowing scientists to collect exponentially more data at a lower cost. These trends are broadly impacting many scientific fields, including genomics, astronomy, and neuroscience. We can attack the problem caused by exponential data growth by applying horizontally scalable techniques from current analytics systems to accelerate scientific processing pipelines. In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
45
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 80 publications
(45 citation statements)
references
References 46 publications
(69 reference statements)
0
45
0
Order By: Relevance
“…In addition to enabling and evaluating horizontal scalability, the cost of an analysis and the choice of virtual machine flavors are becoming increasingly important for efficient execution of bioinformatics analysis, since pipelines are increasingly deployed and evaluated on commercial clouds [6,21,22]. However, even on dedicated clusters it is important to understand how to scale a pipeline up and out on the available resources to improve the utilization of the resources.…”
Section: Summary and Discussionmentioning
confidence: 99%
See 3 more Smart Citations
“…In addition to enabling and evaluating horizontal scalability, the cost of an analysis and the choice of virtual machine flavors are becoming increasingly important for efficient execution of bioinformatics analysis, since pipelines are increasingly deployed and evaluated on commercial clouds [6,21,22]. However, even on dedicated clusters it is important to understand how to scale a pipeline up and out on the available resources to improve the utilization of the resources.…”
Section: Summary and Discussionmentioning
confidence: 99%
“…ADAM [6] is a genomics pipeline that is built on top of the Apache Spark big data processing engine [15], Avro (https://avro.apache.org/) data serialization system, and Parquet (https://parquet.apache.org/) columnar storage system to improve the performance and reduce the cost of variant calling. It takes as input next-generation sequencing (NGS) short reads and outputs sites in the input genome where an individual differs from the reference genome.…”
Section: Adam Variant Calling Pipelinementioning
confidence: 99%
See 2 more Smart Citations
“…A variety of scientific applications have been parallelized using Hadoop or Spark [15,24,9,1,17]. These tools demonstrate that good performance can be achieved without having to trade it for ease-of-use, expressive API, or fault tolerance.…”
Section: Context and Backgroundmentioning
confidence: 99%