Hadoop-BAM: directly manipulating next generation sequencing data in the cloud

Niemenmaa, Matti; Kallio, Aleksi; Schumacher, André; Klemelä, Petri; Korpelainen, Eija; Heljanko, Keijo

doi:10.1093/bioinformatics/bts054

Cited by 128 publications

(71 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…have implemented REST ful API web services or SOAP to share or integrate data in the form of FTP, HTML, XML, JSON, plain text, or AWK commands [29].Moreover, cloud computing services were offered to handle, analyze, or interpret big datasets through various remote applications/servers. There are many cloud servers such as Cloud BLAST [30], Myrna [31], Cloud Burst [32], Hadoop-BAM [33], GPU-BLAST [34], Hydra [35], Peak Ranger [36],Crossbow [37], etc. were available over cloud for analyzing different types of big datasets [38][39][40][41].…”

Section: Comprehensive Data Integration Methodsmentioning

confidence: 99%

Prediction of Potential Lead Molecules through Systematic Integration of Multi-omics Datasets - A Mini-Review

2017

IJCRR

View full text Add to dashboard Cite

Prediction of a novel or potential lead molecules for a therapeutic drug target without adverse effects is a challenging task in the drug designing, discovery, and development process. The systematic integration of multi-omics data from various data/knowledge bases through computational techniques enables to identify potential lead molecules and study the therapeutic properties. Over the last decades, several drug discoveries using multi-omics and huge dataset integration methods proven with successive results. In this paper, we present different types of computational approaches for prediction of potential lead molecules through the systems-level integration of multi-omics datasets.

show abstract

Section: Comprehensive Data Integration Methodsmentioning

confidence: 99%

Prediction of Potential Lead Molecules through Systematic Integration of Multi-omics Datasets - A Mini-Review

2017

IJCRR

View full text Add to dashboard Cite

show abstract

“…Cloudflow provides a variety of already implemented utilities, which facilitate the creation of pipelines in the field of Bioinformatics (especially for NGS data in Genetics). For that purpose, we implemented, based on HadoopBAM [10], several record types and loader classes in order to process FASTQ, BAM and VCF files. Moreover, we created several operations and filters for the analysis of biological datasets.…”

Section: Pipeline Execution On Sparkmentioning

confidence: 99%

Cloudflow - enabling faster biomedical pipelines with MapReduce and Spark

Forer¹,

Afgan²,

Weißensteiner³

et al. 2016

SCPE

View full text Add to dashboard Cite

Abstract. For many years Apache Hadoop has been used as a synonym for processing data in the MapReduce fashion. However, due to the complexity of developing MapReduce applications, adoption of this paradigm in genetics has been limited. To alleviate some of the issues, we have previously developed Cloudflow -a high-level pipeline framework that allows users to create sophisticated biomedical pipelines using predefined code blocks while the framework automatically translates those into the MapReduce execution model. With the introduction of the YARN resource management layer, new computational processing models such as Apache Spark are now plugable into the Hadoop ecosystem. In this paper we describe the extension of Cloudflow to support Apache Spark without any adaptions to already implemented pipelines. The described performance evaluation demonstrates that Spark can bring an additional boost for analysing next generation sequencing (NGS) data to the field of genetics. The Cloudflow framework is open source and freely available at https://github.com/genepi/cloudflow.Key words: Apache YARN, Pipeline Framework, Spark, Cloud Computing AMS subject classifications. 68M141. Introduction. Since the advent of high-throughput technologies in the field of molecular biology (i.e. Next Generation Sequencing (NGS)), a growing amount of data is produced and needs to be analysed. Thus, molecular biology has evolved into a big data science, where the bottleneck is no longer the production of raw data in the laboratory, but its subsequent analysis and interpretation. Due to the variety of data, users need to carefully select the suitable processing framework that fits their data structure and processing task best [8]; further, the size of the data makes analysis parallelization desirable. Fortunately, a large number of conceptual approaches exist on how to deal with the data boost [14]. One promising approach for efficient data parallelization is Apache Hadoop with its YARN (Yet Another Resource Negotiator) architecture [15]. Within Hadoop, users can focus on the functional parallelization of their problem while benefiting from the scalable Hadoop architecture stack in the background. However, writing Apache Hadoop applications requires custom code development and domain expertise, which has led to poor adoption of this parallelization model in biomedical research. More specifically, the MapReduce model is quite restrictive requiring users to break an existing workflow into a number of map and reduce steps, often a challenging task. Additionally, the reusability of the map and reduce functions are limited, resulting in a use-case specific implementation and therefore timeintensive solution for every problem. To alleviate some of the pressing issues, we developed Cloudflow [6] -a framework that simplifies pipeline creation in biomedical research, especially in the field of genetics. Cloudflow supports a variety of NGS data formats and contains a rich collection of built-in operations for analyzing such kind of datasets (e.g. qua...

show abstract

“…For that purpose, we implemented, based on HadoopBAM [7], several record types and loader classes in order to process FASTQ, BAM and VCF files. Moreover, we created several operations and filters for the analysis of biological datasets (see Table I for an overview of all currently implemented operations and filters).…”

Section: Cloudflow For Bioinformaticsmentioning

confidence: 99%

Cloudflow - A framework for MapReduce pipeline development in Biomedical Research

Forer

Afgan

Weißensteiner

et al. 2015

2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)

View full text Add to dashboard Cite

-The data-driven parallelization framework Hadoop MapReduce allows analysing large data sets in a scalable way. Since the development of MapReduce programs can be a time-intensive and challenging task, the application and usage of Hadoop in Biomedical Research is still limited. Here we present Cloudflow, a high-level framework to hide the implementation details of Hadoop and to provide a set of building blocks to create biomedical pipelines in a more intuitive way. We demonstrate the benefit of Cloudflow on three different genetic use cases. It will be shown how the framework can be combined with the Hadoop workflow system Cloudgene and the cloud orchestration platform CloudMan to provide Hadoop pipelines as a service to everyone.The framework is open source and free available at https://github.com/genepi/cloudflow.

show abstract

Hadoop-BAM: directly manipulating next generation sequencing data in the cloud

Cited by 128 publications

References 8 publications

Prediction of Potential Lead Molecules through Systematic Integration of Multi-omics Datasets - A Mini-Review

Prediction of Potential Lead Molecules through Systematic Integration of Multi-omics Datasets - A Mini-Review

Cloudflow - enabling faster biomedical pipelines with MapReduce and Spark

Cloudflow - A framework for MapReduce pipeline development in Biomedical Research

Contact Info

Product

Resources

About