2017
DOI: 10.1007/978-3-319-60131-1_34
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating Genomic Big Data Operations on SciDB and Spark

Abstract: We are developing a new, holistic data management system for genomics, which provides high-level abstractions for querying large genomic datasets. We designed our system so that it leverages on data management engines for low-level data access. Such design can be adapted to two different kinds of data engines: the family of scientific databases (among them, SciDB) and the broader family of generic platforms (among them, Spark). Trade-offs are not obvious; scientific databases are expected to outperform generic… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2018
2018
2019
2019

Publication Types

Select...
2
2
2

Relationship

4
2

Authors

Journals

citations
Cited by 8 publications
(9 citation statements)
references
References 13 publications
(11 reference statements)
0
9
0
Order By: Relevance
“…While the adoption of an array engine for GDMS was discarded [8] (due to lack of performance in many operations over classic row-based engines) in this paper we demonstrate that an array-based approach implemented in Spark is commended for executing chains of region-preserving operations. In such condition, the benefits of the optimization pays off the cost of transforming some specific datasets from a rowbased model to an array-based model and back.…”
Section: Introductionmentioning
confidence: 83%
See 1 more Smart Citation
“…While the adoption of an array engine for GDMS was discarded [8] (due to lack of performance in many operations over classic row-based engines) in this paper we demonstrate that an array-based approach implemented in Spark is commended for executing chains of region-preserving operations. In such condition, the benefits of the optimization pays off the cost of transforming some specific datasets from a rowbased model to an array-based model and back.…”
Section: Introductionmentioning
confidence: 83%
“…After an initial evaluation of these engines, we focused on one of them: the current GDMS implementation, described in [11], uses Spark. Our choice was influenced by our domain-specific comparative analysis of Flink and Spark [7] and of Spark and SciDB [8].…”
Section: Introductionmentioning
confidence: 99%
“…Comparative analysis, published in [9] and [10], shows that the performance of Flink and Spark are remarkably similar, while the performance of Spark and SciDB are very different, with SciDB faster then Spark when operations involve selections and aggregates (as they are facilitated by an array organization); whereas, Spark is faster than SciDB in JOIN and MAP operations (thanks to the general power of the Spark execution engine. )…”
Section: Discussionmentioning
confidence: 99%
“…Arrays are divided into chunks; an hash function uses the dimension values associated to each chunk in order to assign it to a specific node of the cluster; by using this method, called Multidimensional Array Clustering, every query processing operation is mapped to specific chunks and executed in parallel at the nodes where such chunks are allocated. A comparison of SciDB with our Spark-based implementation is provided in [13]; our binning-based join implementation has better performances, whereas SciDB appears superior for selection and aggregation operations which take advantage of the array-based architecture.…”
Section: High-level Query Languagesmentioning
confidence: 99%