Scalable genomics: From raw data to aligned reads on Apache YARN

Versaci, Francesco; Pireddu, Luca; Zanetti, Gianluigi

doi:10.1109/bigdata.2016.7840727

Cited by 5 publications

(4 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first module (preprocessor ) reads as input raw Illumina data and performs BCL conversion, filtering and demultiplexing. The second module implements the alignment step in Flink, using the Read Aligner API (RAPI [5]) which provides Java bindings for the BWA-MEM aligner. The two modules are connected by a Kafka broker.…”

Section: Methodsmentioning

confidence: 99%

Distributed stream processing for genomics pipelines

Versaci¹,

Pireddu²,

Zanetti³

2017

Preprint

Self Cite

View full text Add to dashboard Cite

Personalized medicine is in great part enabled by the progress in data acquisition technologies for modern biology, such as next-generation sequencing (NGS). Conventional NGS processing workflows are composed by independent tools implementing shared-memory parallelism which communicate by means of intermediate files. With increasing data sizes this approach is showing its limited scalability and robustness characteristics – problems that make it unsuitable for large-scale, population-wide personalized medicine applications. In this work we propose the adoption of the stream computing architecture to make the genomics pipeline more scalable, and fault-tolerant. We implemented the first processing phases for Illumina sequencing data – from raw data to alignment – using the Apache Flink distributed stream processing framework and Apache Kafka. The new pipeline has been tested processing the raw output of an Illumina HiSeq3000 sequencer and producing aligned reads in CRAM format. The results show near optimal scalability characteristics on experiments from 1 to 12 computing nodes, with a speed-up of 9.5x over the conventional solution (which cannot automatically run on multiple nodes). This result is particularly positive considering that the very short runtime of the experiment – less than 15 minutes – makes significant the constant time costs imposed by the overheads of the frameworks.

show abstract

Section: Methodsmentioning

confidence: 99%

Distributed stream processing for genomics pipelines

Versaci¹,

Pireddu²,

Zanetti³

2017

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Note: This expository section is reproduced almost verbatim from [15], for the reader's convenience.…”

Section: The Ngs Processmentioning

confidence: 99%

“…The first module in our pipeline takes care of preprocessing the raw Illumina data, which are available in the proprietary BCL format. To construct the preprocessor we extended our BCL to FASTQ converter [15], by enabling its output to be sent to a Kafka broker, using the built-in Flink-Kafka connector.…”

Section: A Data Preprocessingmentioning

confidence: 99%

“…The alignment module, implemented from scratch in Flink for this work, exploits our Read Aligner API (RAPI [15]), which in turn relies on a modified version of the standard BWA-MEM aligner [26]. The module consumes the reads via TCP from the Kakfa broker (which could thus be located far from the computation nodes used in this step) and it produces as output the aligned reads.…”

Section: B Alignmentmentioning

confidence: 99%

See 1 more Smart Citation

Kafka interfaces for composable streaming genomics pipelines

Versaci¹,

Pireddu²,

Zanetti³

2017

Preprint

Self Cite

View full text Add to dashboard Cite

Abstract-Modern sequencing machines produce order of a terabyte of data per day, which need subsequently to go through a complex processing pipeline. The standard workflow begins with a few independent, shared-memory tools, which communicate by means of intermediate files. Given the constant increase of the amount of data produced, this approach is proving more and more unmanageable, due to its lack of robustness and scalability.In this work we propose the adoption of stream computing to simplify the genomic pipeline, boost its performance and improve its fault-tolerance. We decompose the first steps of the genomic processing in two distinct and specialized modules (preprocessing and alignment) and we loosely compose them via communication through Kafka streams, in order to allow for easy composability and integration in the already existing Hadoop-based pipelines. The proposed solution is then experimentally validated on real data and shown to scale almost linearly.

show abstract