2017
DOI: 10.1109/tcbb.2016.2576447
|View full text |Cite
|
Sign up to set email alerts
|

Data Management for Heterogeneous Genomic Datasets

Abstract: Next Generation Sequencing (NGS), a family of technologies for reading DNA and RNA, is changing biological research, and will soon change medical practice, by quickly providing sequencing data and high-level features of numerous individual genomes in different biological and clinical conditions. The availability of millions of whole genome sequences may soon become the biggest and most important "big data" problem of mankind. In this exciting framework, we recently proposed a new paradigm to raise the level of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
5
2
1

Relationship

4
4

Authors

Journals

citations
Cited by 14 publications
(14 citation statements)
references
References 28 publications
0
14
0
Order By: Relevance
“…A line sweep algorithm is implemented in BEDOPs and BedTOOLS [29], [34] to compare two files (corresponding to our reference and experiment files), by sorting them on the start of the interval and then sweeping the two files sequentially, comparing the intervals and finding the intersections. BEDTools incorporates the genome-binning algorithm used by the UCSC Genome Browser in the search for overlapping regions; in [15] we show that region intersection between two samples in GMQL has slightly better performance than BEDTools. GMQLs use of binning is much more complex from a system's perspective, as it is implemented in the cloud environment to support implicit iteration over thousands of sample pairs.…”
Section: Technologies For Region Processingmentioning
confidence: 89%
“…A line sweep algorithm is implemented in BEDOPs and BedTOOLS [29], [34] to compare two files (corresponding to our reference and experiment files), by sorting them on the start of the interval and then sweeping the two files sequentially, comparing the intervals and finding the intersections. BEDTools incorporates the genome-binning algorithm used by the UCSC Genome Browser in the search for overlapping regions; in [15] we show that region intersection between two samples in GMQL has slightly better performance than BEDTools. GMQLs use of binning is much more complex from a system's perspective, as it is implemented in the cloud environment to support implicit iteration over thousands of sample pairs.…”
Section: Technologies For Region Processingmentioning
confidence: 89%
“…Big data challenge was observed and solved in various works devoted to intelligent transport and smart cities [11,19,42,43,74,75,84], water monitoring [12,22,90], social networks analysis [13,14,77], multimedia processing [72,82], internet of things (IoT) [9], social media monitoring [50], Life sciences [3,31,32,44,58,69] and disease data analysis [6,45,81], telecommunication [27], and finance [2], to mention just a few. Many hot issues in various sub-fields of bioinformatics were also solved with the use of Big Data ecosystems and Cloud computing, e.g., mapping nextgeneration sequence data to the human genome and other reference genomes, for use in a variety of biological analyzes including SNP discovery, genotyping and personal genomics [65], sequence analysis and assembly [17,30,34,35,47,62], multiple alignments of DNA and RNA sequences [86,91], codon analysis with local MapReduce aggregations [63], NGS data analysis [8], phylogeny [24,48], proteomics [37], analysis of proteinligand binding sites…”
Section: Related Workmentioning
confidence: 99%
“…From bottom to top, it includes the repository layer, the engine layer and the GMQL layer, which in turn consists of an orchestrator and a compiler, and is accessible through a web service API. We next briefly explain query execution, a detailed description can be found in [11]. Execution flow is controlled by the orchestrator, written in Java programming language; the processing flow includes compilation, data selection from the repository, scheduling of the Pig code execution over the Apache Pig engine [6], and storing of the resulting datasets in the repository in standard format.…”
Section: Gmql Implementation V1mentioning
confidence: 99%
“…In all cases, GMQL queries were translated to queries for cloudbased database engines. Version 1, described in [11], was developed between the spring 2014 and the spring 2015 and was based on Apache Pig [6] and Hadoop 1 [28]. Version 2 is described in [18]; its development started in the summer of 2015 and is still ongoing; Version 2 is based on Hadoop 2 [16] and uses Apache Spark [7]; project branches were developed for the engines Apache Flink [4] and SciDB [3].…”
Section: Introductionmentioning
confidence: 99%