Evaluating Genomic Big Data Operations on SciDB and Spark

Cattani, Simone; Ceri, Stefano; Kaitoua, Abdulrahman; Pinoli, Pietro

doi:10.1007/978-3-319-60131-1_34

Cited by 8 publications

(9 citation statements)

References 13 publications

(11 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While the adoption of an array engine for GDMS was discarded [8] (due to lack of performance in many operations over classic row-based engines) in this paper we demonstrate that an array-based approach implemented in Spark is commended for executing chains of region-preserving operations. In such condition, the benefits of the optimization pays off the cost of transforming some specific datasets from a rowbased model to an array-based model and back.…”

Section: Introductionmentioning

confidence: 83%

See 1 more Smart Citation

Multi-Dimensional Genomic Data Management for Region-Preserving Operations

Horlova

Kaitoua

Markl

et al. 2019

2019 IEEE 35th International Conference on Data Engineering (ICDE)

Self Cite

View full text Add to dashboard Cite

In previous work, we presented GenoMetric Query Language (GMQL), an algebraic language for querying genomic datasets, supported by Genomic Data Management System (GDMS), an open-source big data engine implemented on top of Apache Spark. GMQL datasets are represented as genomic regions (i.e. intervals of the genome, included within a start and stop position) with an associated value, representing the signal associated to that region (the most typical signals represent gene expressions, peaks of expressions, and variants relative to a reference genome.) GMQL can process queries over billions of regions, organized within distinct datasets. In this paper, we focus on the efficient execution of regionpreserving GMQL operations, in which the regions of the result are a subset of the regions of one of the operands; most GMQL operations are region-preserving. Chains of region-preserving operations can be efficiently executed by taking advantage of an array-based data organization, where region management can be separated from value management. We discuss this optimization in the context of the current GDMS system which has a row-based (relational) organization, and therefore requires dynamic data transformations. A similar approach applies to other application domains with interval-based data organization. Index Terms-Big data processing, data management, cloud computing, genomic computing.

show abstract

Section: Introductionmentioning

confidence: 83%

“…After an initial evaluation of these engines, we focused on one of them: the current GDMS implementation, described in [11], uses Spark. Our choice was influenced by our domain-specific comparative analysis of Flink and Spark [7] and of Spark and SciDB [8].…”

Section: Introductionmentioning

confidence: 99%

Multi-Dimensional Genomic Data Management for Region-Preserving Operations

Horlova

Kaitoua

Markl

et al. 2019

2019 IEEE 35th International Conference on Data Engineering (ICDE)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Comparative analysis, published in [9] and [10], shows that the performance of Flink and Spark are remarkably similar, while the performance of Spark and SciDB are very different, with SciDB faster then Spark when operations involve selections and aggregates (as they are facilitated by an array organization); whereas, Spark is faster than SciDB in JOIN and MAP operations (thanks to the general power of the Spark execution engine. )…”

Section: Discussionmentioning

confidence: 99%

Experiences in the Development of a Data Management System for Genomics

Ceri

Canakoglu

Kaitoua

et al. 2018

Communications in Computer and Information Science

Self Cite

View full text Add to dashboard Cite

GMQL is a high-level query language for genomics, which operates on datasets described through GDM, a unifying data model for processed data formats. They are ingredients for the integration of processed genomic datasets, i.e. of signals produced by the genome after sequencing and long data extraction pipelines. While most of the processing load of today's genomic platforms is due to data extraction pipelines, we anticipate soon a shift of attention towards processed datasets, as such data are being collected by large consortia and are becoming increasingly available. In our view, biology and personalized medicine will increasingly rely on data extraction and analysis methods for inferring new knowledge from existing heterogeneous repositories of processed datasets, typically augmented with the results of experimental data targeting individuals or small populations. While today's big data are raw reads of the sequencing machines, tomorrow's big data will also include billions or trillions of genomic regions, each featuring specific values depending on the processing conditions. Coherently, GMQL is a high-level, declarative language inspired by big data management, and its execution engines include classic cloud-based systems, from Pig to Flink to SciDB to Spark. In this paper, we discuss how the GMQL execution environment has been developed, by going through a major version change that marked a complete system redesign; we also discuss our experiences in comparatively evaluating the four platforms.

show abstract

“…Arrays are divided into chunks; an hash function uses the dimension values associated to each chunk in order to assign it to a specific node of the cluster; by using this method, called Multidimensional Array Clustering, every query processing operation is mapped to specific chunks and executed in parallel at the nodes where such chunks are allocated. A comparison of SciDB with our Spark-based implementation is provided in [13]; our binning-based join implementation has better performances, whereas SciDB appears superior for selection and aggregation operations which take advantage of the array-based architecture.…”

Section: High-level Query Languagesmentioning

confidence: 99%

Optimal Binning for Genomics

Gulino

Kaitoua²,

Ceri

2019

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

Genome sequencing is expected to be the most prolific source of big data in the next decade; millions of whole genome datasets will open new opportunities for biological research and personalized medicine. Genome sequences are abstracted in the form of interesting regions, describing abnormalities of the genome. The parallel execution on the cloud of complex operations for joining and mapping billions of genomic regions is increasingly important. Genome binning, i.e. partitioning of the genome into small-size segments, adapts classic data partitioning methods to genomics; region distributions to bins must reflect operation-specific correctness rules. As a consequence, determining the optimal bin size for such operations is a complex mathematical problem, whose solution requires careful modeling. The main result of this paper is the mathematical formulation and solution of the optimal binning problem for join and map operations in the context of GMQL, a query language over genomic regions; the model is validated by experiments showing its accuracy and sensitivity to the variation of operations' parameters. We also optimize sequences of operations by inheriting the binning between two consecutive operations and we show the deployment of GMQL and the tuning of the proposed model on different cloud computing systems.

show abstract

Evaluating Genomic Big Data Operations on SciDB and Spark

Cited by 8 publications

References 13 publications

Multi-Dimensional Genomic Data Management for Region-Preserving Operations

Multi-Dimensional Genomic Data Management for Region-Preserving Operations

Experiences in the Development of a Data Management System for Genomics

Optimal Binning for Genomics

Contact Info

Product

Resources

About