CloudEC: A MapReduce-based algorithm for correcting errors in next-generation sequencing big data

Chung, Wei-Chun; Ho, Jan-Ming; Lin, Chung‐Yen; Lee, D. T.

doi:10.1109/bigdata.2017.8258251

Cited by 10 publications

(8 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thanks to this change, CloudRS is able to process the sequences using multiple worker nodes, effectively allowing it to handle larger datasets than ALLPATHS-LG in less time. Finally, CloudEC [7] is another Hadoop-based MSA corrector that was presented as an enhanced version of CloudRS. The major improvement of CloudEC over its counterpart was the introduction of the spread corrector, a new MSA-based algorithm which increases the reliability of the reads at the cost of reducing its performance, as this algorithm is much more computationally intensive than the one provided by CloudRS (i.e., the pinch corrector).…”

Section: Big Data and Parallel Correctorsmentioning

confidence: 99%

“…However, most of the previous solutions usually lack either accuracy in correction, performance when processing large datasets, or the capability to scale out on a computing cluster. Among them, CloudEC [4] has been proved to perform precise corrections together with a scalable approach by relying on Big Data technologies, since its correction algorithms have been designed upon the MapReduce paradigm [5] using its most popular open-source implementation Apache Hadoop [6] (more details about Big Data and MapReduce are provided in Section 2 of Additional file 1). However, the usage of this tool comes at the cost of poor performance in terms of computational time when managing the huge amounts of data usually generated by NGS platforms.…”

mentioning

confidence: 99%

“…However, the usage of this tool comes at the cost of poor performance in terms of computational time when managing the huge amounts of data usually generated by NGS platforms. According to their own published results [7], the fastest experiment takes more than 18 h when correcting a dataset with 200 million reads on an 80-node computing cluster, showing a limited speedup of 5 × (5 times faster execution time using 8 times the number of nodes). In order to overcome this problem, in this work we are introducing SparkEC as a new tool based on this previous approach that can tackle these scalability limitations without giving up either of its advantages in terms of correction accuracy.…”

mentioning

confidence: 99%

See 2 more Smart Citations

SparkEC: speeding up alignment-based DNA error correction tools

2022

View full text Add to dashboard Cite

Background In recent years, huge improvements have been made in the context of sequencing genomic data under what is called Next Generation Sequencing (NGS). However, the DNA reads generated by current NGS platforms are not free of errors, which can affect the quality of downstream analysis. Although error correction can be performed as a preprocessing step to overcome this issue, it usually requires long computational times to analyze those large datasets generated nowadays through NGS. Therefore, new software capable of scaling out on a cluster of nodes with high performance is of great importance. Results In this paper, we present SparkEC, a parallel tool capable of fixing those errors produced during the sequencing process. For this purpose, the algorithms proposed by the CloudEC tool, which is already proved to perform accurate corrections, have been analyzed and optimized to improve their performance by relying on the Apache Spark framework together with the introduction of other enhancements such as the usage of memory-efficient data structures and the avoidance of any input preprocessing. The experimental results have shown significant improvements in the computational times of SparkEC when compared to CloudEC for all the representative datasets and scenarios under evaluation, providing an average and maximum speedups of 4.9$$\times$$ × and 11.9$$\times$$ × , respectively, over its counterpart. Conclusion As error correction can take excessive computational time, SparkEC provides a scalable solution for correcting large datasets. Due to its distributed implementation, SparkEC speed can increase with respect to the number of nodes in a cluster. Furthermore, the software is freely available under GPLv3 license and is compatible with different operating systems (Linux, Windows and macOS).

show abstract

Section: Big Data and Parallel Correctorsmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

SparkEC: speeding up alignment-based DNA error correction tools

2022

View full text Add to dashboard Cite

show abstract

“…In fact, the exploitation of Big Data clusters to accelerate the storage, processing and visualization of large NGS datasets has been recently explored in multiple previous works. For instance, many bioinformatics tools implemented on top of Big Data processing frameworks such as Hadoop [25] and Spark [9] have emerged in recent years, from error correction [26], [27], duplicate read removal [13] and sequence alignment [28]- [31], to variant calling [32], de novo genome assembly [33], [34] and protein structure prediction [35]- [37], among many others. Most of these tools are executed within a bioinformatics pipeline (or scientific workflow engines such as SAASFEE [38] or Pegasus [39]) that usually starts with a quality control of the input FASTA/FASTQ datasets.…”

Section: Related Workmentioning

confidence: 99%

SeQual: Big Data Tool to Perform Quality Control and Data Preprocessing of Large NGS Datasets

2020

View full text Add to dashboard Cite

This paper presents SeQual, a scalable tool to efficiently perform quality control of large genomic datasets. Our tool currently supports more than 30 different operations (e.g., filtering, trimming, formatting) that can be applied to DNA/RNA reads in FASTQ/FASTA formats to improve subsequent downstream analyses, while providing a simple and user-friendly graphical interface for non-expert users. Furthermore, SeQual takes full advantage of Big Data technologies to process massive datasets on distributed-memory systems such as clusters by relying on the open-source Apache Spark cluster computing framework. Our scalable Spark-based implementation allows to reduce the runtime from more than three hours to less than 20 minutes when processing a paired-end dataset with 251 million reads per input file on an 8-node multi-core cluster.

show abstract

“…Hence, multiple algorithms have been proposed in the literature to correct these mistakes in the samples and make up higher quality reads. Among them, CloudEC [3] is a Big Data tool built upon the Apache Hadoop framework [4] that is able to perform corrections to genetic datasets by running multiple steps of alignments of the input samples, and replacing the bases with the lowest qualities of all those aligned samples with another representations of higher quality.…”

Section: Introductionmentioning

confidence: 99%

Performance Optimization of a Parallel Error Correction Tool

Martínez-Sánchez

Expósito

Touriño

2021

The 4th XoveTIC Conference

View full text Add to dashboard Cite

Due to the continuous development in the field of Next Generation Sequencing (NGS) technologies that have allowed researchers to take advantage of greater genetic samples in less time, it is a matter of relevance to improve the existing algorithms aimed at the enhancement of the quality of those generated reads. In this work, we present a Big Data tool implemented upon the open-source Apache Spark framework that is able to execute validated error-correction algorithms at an improved performance. The experimental evaluation conducted on a multi-core cluster has shown significant improvements in execution times, providing a maximum speedup of 9.5 over existing error correction tools when processing an NGS dataset with 25 million reads.

show abstract

CloudEC: A MapReduce-based algorithm for correcting errors in next-generation sequencing big data

Cited by 10 publications

References 37 publications

SparkEC: speeding up alignment-based DNA error correction tools

SparkEC: speeding up alignment-based DNA error correction tools

SeQual: Big Data Tool to Perform Quality Control and Data Preprocessing of Large NGS Datasets

Performance Optimization of a Parallel Error Correction Tool

Contact Info

Product

Resources

About

CloudEC: A MapReduce-based algorithm for correcting errors in next-generation sequencing big data

Cited by 10 publications

References 37 publications

﻿SparkEC: speeding up alignment-based DNA error correction tools

﻿SparkEC: speeding up alignment-based DNA error correction tools

SeQual: Big Data Tool to Perform Quality Control and Data Preprocessing of Large NGS Datasets

Performance Optimization of a Parallel Error Correction Tool

Contact Info

Product

Resources

About

SparkEC: speeding up alignment-based DNA error correction tools

SparkEC: speeding up alignment-based DNA error correction tools