SeQual: Big Data Tool to Perform Quality Control and Data Preprocessing of Large NGS Datasets

Expósito, Roberto R.; Galego-Torreiro, Roi; González-Domínguez, Jorge

doi:10.1109/access.2020.3015016

Cited by 6 publications

(7 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2 ). As previously mentioned, the functionality supported by our tool is inspired on those operations provided by SeQual [ 27 ], but adapted to the stream processing model. The first group of quality control operations consists of 12 single filters that were implemented using the Spark’s filter method.…”

Section: Methodsmentioning

confidence: 99%

“…QC-Chain [ 22 ] and PRINSEQ++ [ 23 ] does provide such parallel support through multithreading, and so their scalability is limited to a single node, whereas FastQC [ 24 ] and Falco [ 25 ], which is an emulation of the former, only support parallelism at the file level. SOAPnuke [ 26 ] is able to distribute the data processing to a cluster of nodes through Hadoop, whereas SeQual [ 27 ] is also capable of scaling out across a cluster by relying on the more efficient Spark RDDs, greatly enhancing performance compared to previous solutions. Nevertheless, both SOAPnuke and SeQual are still limited by the batch processing operation mode they are based on.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

SeQual-Stream: approaching stream processing to quality control of NGS datasets

Castellanos-Rodríguez,

Expósito,

Touriño

2023

BMC Bioinformatics

View full text Add to dashboard Cite

Background Quality control of DNA sequences is an important data preprocessing step in many genomic analyses. However, all existing parallel tools for this purpose are based on a batch processing model, needing to have the complete genetic dataset before processing can even begin. This limitation clearly hinders quality control performance in those scenarios where the dataset must be downloaded from a remote repository and/or copied to a distributed file system for its parallel processing. Results In this paper we present SeQual-Stream, a streaming tool that allows performing multiple quality control operations on genomic datasets in a fast, distributed and scalable way. To do so, our approach relies on the Apache Spark framework and the Hadoop Distributed File System (HDFS) to fully exploit the stream paradigm and accelerate the preprocessing of large datasets as they are being downloaded and/or copied to HDFS. The experimental results have shown significant improvements in the execution times of SeQual-Stream when compared to a batch processing tool with similar quality control features, providing a maximum speedup of 2.7$$\times$$ × when processing a dataset with more than 250 million DNA sequences, while also demonstrating good scalability features. Conclusion Our solution provides a more scalable and higher performance way to carry out quality control of large genomic datasets by taking advantage of stream processing features. The tool is distributed as free open-source software released under the GNU AGPLv3 license and is publicly available to download at https://github.com/UDC-GAC/SeQual-Stream.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

SeQual-Stream: approaching stream processing to quality control of NGS datasets

Castellanos-Rodríguez,

Expósito,

Touriño

2023

BMC Bioinformatics

View full text Add to dashboard Cite

show abstract

“…With this large volume of data in mind, the processing and downstream analysis of the data are important to achieve meaningful results and interpretations. The quality of NGS data is also important for various downstream analyses, such as gene expression studies, genome sequence assembly, and microbiome analysis [15,16]. Prior to analysis, the sequencing data must first be checked and processed.…”

Section: Introductionmentioning

confidence: 99%

Review of the Current State of Freely Accessible Web Tools for the Analysis of 16S rRNA Sequencing of the Gut Microbiome

Ibal

Park

et al. 2022

IJMS

View full text Add to dashboard Cite

Owing to the emergence and improvement of high-throughput technology and the associated reduction in costs, next-generation sequencing (NGS) technology has made large-scale sampling and sequencing possible. With the large volume of data produced, the processing and downstream analysis of data are important for ensuring meaningful results and interpretation. Problems in data analysis may be encountered if researchers have little experience in using programming languages, especially if they are clinicians and beginners in the field. A strategy for solving this problem involves ensuring easy access to commercial software and tools. Here, we observed the current status of free web-based tools for microbiome analysis that can help users analyze and handle microbiome data effortlessly. We limited our search to freely available web-based tools and identified MicrobiomeAnalyst, Mian, gcMeta, VAMPS, and Microbiome Toolbox. We also highlighted the various analyses that each web tool offers, how users can analyze their data using each web tool, and noted some of their limitations. From the abovementioned list, gcMeta, VAMPS, and Microbiome Toolbox had several issues that made the analysis more difficult. Over time, as more data are generated and accessed, more users will analyze microbiome data. Thus, the availability of free and easily accessible web tools can enable the easy use and analysis of microbiome data, especially for those users with less experience in using command-line interfaces.

show abstract

“…Since the birth of big data, people have gradually realized the social and economic value of big data and paid great attention to it [1][2][3]. Many developed countries have successively issued relevant policies to promote the development of big data, and have promoted big data as a national strategy [4][5][6].…”

Section: Introductionmentioning

confidence: 99%

Research on Comprehensive Evaluation of Data Source Quality in Big Data Environment

Li¹,

Xu²,

Peng³

2021

IJCIS

View full text Add to dashboard Cite

Data quality is the prerequisite of big data research and the basis of all data analysis, mining, and decision support. Therefore, a comprehensive fuzzy evaluation method for big data quality evaluation is proposed. Through the analysis of big data quality characteristics, a big data quality evaluation system for the whole process of data processing is constructed. The subjective weight and objective weight of each indicator are calculated through the analytic hierarchy process and entropy method. In order to overcome the subjective and one-sided shortcomings of the single weight determination method, the subjective weight and the objective weight are organically integrated through the distance function method to determine the combined weight of each indicator. The quantified result of big data quality is obtained through fuzzy calculation of membership degree. Finally the ranking results of the proposed method are compared with those of some existing multi-attribute decision-making (MADM) methods. The obtained results indicate that the proposed method is reasonable and efficient to deal with MADM problems. It can comprehensively measure the level of big data quality, and provide users with accurate and efficient quality evaluation results.

show abstract

SeQual: Big Data Tool to Perform Quality Control and Data Preprocessing of Large NGS Datasets

Cited by 6 publications

References 42 publications

SeQual-Stream: approaching stream processing to quality control of NGS datasets

SeQual-Stream: approaching stream processing to quality control of NGS datasets

Review of the Current State of Freely Accessible Web Tools for the Analysis of 16S rRNA Sequencing of the Gut Microbiome

Research on Comprehensive Evaluation of Data Source Quality in Big Data Environment

Contact Info

Product

Resources

About