Data Management for Heterogeneous Genomic Datasets

Ceri, Stefano; Kaitoua, Abdulrahman; Masseroli, Marco; Pinoli, Pietro; Venco, Francesco

doi:10.1109/tcbb.2016.2576447

Cited by 14 publications

(14 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A line sweep algorithm is implemented in BEDOPs and BedTOOLS [29], [34] to compare two files (corresponding to our reference and experiment files), by sorting them on the start of the interval and then sweeping the two files sequentially, comparing the intervals and finding the intersections. BEDTools incorporates the genome-binning algorithm used by the UCSC Genome Browser in the search for overlapping regions; in [15] we show that region intersection between two samples in GMQL has slightly better performance than BEDTools. GMQLs use of binning is much more complex from a system's perspective, as it is implemented in the cloud environment to support implicit iteration over thousands of sample pairs.…”

Section: Technologies For Region Processingmentioning

confidence: 89%

Optimal Binning for Genomics

Gulino

Kaitoua²,

Ceri

2019

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

Genome sequencing is expected to be the most prolific source of big data in the next decade; millions of whole genome datasets will open new opportunities for biological research and personalized medicine. Genome sequences are abstracted in the form of interesting regions, describing abnormalities of the genome. The parallel execution on the cloud of complex operations for joining and mapping billions of genomic regions is increasingly important. Genome binning, i.e. partitioning of the genome into small-size segments, adapts classic data partitioning methods to genomics; region distributions to bins must reflect operation-specific correctness rules. As a consequence, determining the optimal bin size for such operations is a complex mathematical problem, whose solution requires careful modeling. The main result of this paper is the mathematical formulation and solution of the optimal binning problem for join and map operations in the context of GMQL, a query language over genomic regions; the model is validated by experiments showing its accuracy and sensitivity to the variation of operations' parameters. We also optimize sequences of operations by inheriting the binning between two consecutive operations and we show the deployment of GMQL and the tuning of the proposed model on different cloud computing systems.

show abstract

Section: Technologies For Region Processingmentioning

confidence: 89%

Optimal Binning for Genomics

Gulino

Kaitoua²,

Ceri

2019

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Big data challenge was observed and solved in various works devoted to intelligent transport and smart cities [11,19,42,43,74,75,84], water monitoring [12,22,90], social networks analysis [13,14,77], multimedia processing [72,82], internet of things (IoT) [9], social media monitoring [50], Life sciences [3,31,32,44,58,69] and disease data analysis [6,45,81], telecommunication [27], and finance [2], to mention just a few. Many hot issues in various sub-fields of bioinformatics were also solved with the use of Big Data ecosystems and Cloud computing, e.g., mapping nextgeneration sequence data to the human genome and other reference genomes, for use in a variety of biological analyzes including SNP discovery, genotyping and personal genomics [65], sequence analysis and assembly [17,30,34,35,47,62], multiple alignments of DNA and RNA sequences [86,91], codon analysis with local MapReduce aggregations [63], NGS data analysis [8], phylogeny [24,48], proteomics [37], analysis of proteinligand binding sites…”

Section: Related Workmentioning

confidence: 99%

Spark-IDPP: high-throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud

2018

View full text Add to dashboard Cite

Intrinsically disorder proteins (IDPs) constitute a significant part of proteins that exist and act in cells of living organisms. IDPs play key roles in central cellular processes and some of them are closely related to various human diseases, like cancer or neurodegenerative disorders. Identification of IDPs and studying their structural characteristics have become an important part of structural bioinformatics and structural genomics. However, growing amount of genomic and protein sequences in public repositories pose a pressure on existing methods for identification of IDPs. Large volumes of protein amino acid sequences need to be analyzed in terms of propensity to form disordered regions, and this task requires novel tools and scalable platforms to cope with this big biological data challenge. In this paper, we show how the identification of disordered regions of 3D protein structures can be efficiently accelerated with the use of Apache Spark cluster established and scaled on the public Cloud. For this purpose, we propose Spark-based meta-predictor (Spark-IDPP), which enables efficient prediction of disordered regions of proteins on a large-scale. Results of our performance tests show that, for large data sets, our method achieves almost linear speedup, when scaling out the computations on the 32-node Spark cluster located in the Azure cloud. This proves that through appropriate partitioning of data and by increasing the degree of parallelism, we can significantly improve efficiency of IDP predictions. Additionally, by using several basic predictors, aggregating their ranks in various consensus modes, and filtering the final outcome with a dedicated fuzzy filter, the Spark-IDPP increases the quality of predictions.

show abstract

“…From bottom to top, it includes the repository layer, the engine layer and the GMQL layer, which in turn consists of an orchestrator and a compiler, and is accessible through a web service API. We next briefly explain query execution, a detailed description can be found in [11]. Execution flow is controlled by the orchestrator, written in Java programming language; the processing flow includes compilation, data selection from the repository, scheduling of the Pig code execution over the Apache Pig engine [6], and storing of the resulting datasets in the repository in standard format.…”

Section: Gmql Implementation V1mentioning

confidence: 99%

“…In all cases, GMQL queries were translated to queries for cloudbased database engines. Version 1, described in [11], was developed between the spring 2014 and the spring 2015 and was based on Apache Pig [6] and Hadoop 1 [28]. Version 2 is described in [18]; its development started in the summer of 2015 and is still ongoing; Version 2 is based on Hadoop 2 [16] and uses Apache Spark [7]; project branches were developed for the engines Apache Flink [4] and SciDB [3].…”

Section: Introductionmentioning

confidence: 99%

Experiences in the Development of a Data Management System for Genomics

Ceri

Canakoglu

Kaitoua

et al. 2018

Communications in Computer and Information Science

Self Cite

View full text Add to dashboard Cite

GMQL is a high-level query language for genomics, which operates on datasets described through GDM, a unifying data model for processed data formats. They are ingredients for the integration of processed genomic datasets, i.e. of signals produced by the genome after sequencing and long data extraction pipelines. While most of the processing load of today's genomic platforms is due to data extraction pipelines, we anticipate soon a shift of attention towards processed datasets, as such data are being collected by large consortia and are becoming increasingly available. In our view, biology and personalized medicine will increasingly rely on data extraction and analysis methods for inferring new knowledge from existing heterogeneous repositories of processed datasets, typically augmented with the results of experimental data targeting individuals or small populations. While today's big data are raw reads of the sequencing machines, tomorrow's big data will also include billions or trillions of genomic regions, each featuring specific values depending on the processing conditions. Coherently, GMQL is a high-level, declarative language inspired by big data management, and its execution engines include classic cloud-based systems, from Pig to Flink to SciDB to Spark. In this paper, we discuss how the GMQL execution environment has been developed, by going through a major version change that marked a complete system redesign; we also discuss our experiences in comparatively evaluating the four platforms.

show abstract

Data Management for Heterogeneous Genomic Datasets

Cited by 14 publications

References 28 publications

Optimal Binning for Genomics

Optimal Binning for Genomics

Spark-IDPP: high-throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud

Experiences in the Development of a Data Management System for Genomics

Contact Info

Product

Resources

About