2020
DOI: 10.1109/access.2020.3015016
|View full text |Cite
|
Sign up to set email alerts
|

SeQual: Big Data Tool to Perform Quality Control and Data Preprocessing of Large NGS Datasets

Abstract: This paper presents SeQual, a scalable tool to efficiently perform quality control of large genomic datasets. Our tool currently supports more than 30 different operations (e.g., filtering, trimming, formatting) that can be applied to DNA/RNA reads in FASTQ/FASTA formats to improve subsequent downstream analyses, while providing a simple and user-friendly graphical interface for non-expert users. Furthermore, SeQual takes full advantage of Big Data technologies to process massive datasets on distributed-memory… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 42 publications
0
7
0
Order By: Relevance
“…2 ). As previously mentioned, the functionality supported by our tool is inspired on those operations provided by SeQual [ 27 ], but adapted to the stream processing model. The first group of quality control operations consists of 12 single filters that were implemented using the Spark’s filter method.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…2 ). As previously mentioned, the functionality supported by our tool is inspired on those operations provided by SeQual [ 27 ], but adapted to the stream processing model. The first group of quality control operations consists of 12 single filters that were implemented using the Spark’s filter method.…”
Section: Methodsmentioning
confidence: 99%
“…QC-Chain [ 22 ] and PRINSEQ++ [ 23 ] does provide such parallel support through multithreading, and so their scalability is limited to a single node, whereas FastQC [ 24 ] and Falco [ 25 ], which is an emulation of the former, only support parallelism at the file level. SOAPnuke [ 26 ] is able to distribute the data processing to a cluster of nodes through Hadoop, whereas SeQual [ 27 ] is also capable of scaling out across a cluster by relying on the more efficient Spark RDDs, greatly enhancing performance compared to previous solutions. Nevertheless, both SOAPnuke and SeQual are still limited by the batch processing operation mode they are based on.…”
Section: Related Workmentioning
confidence: 99%
“…With this large volume of data in mind, the processing and downstream analysis of the data are important to achieve meaningful results and interpretations. The quality of NGS data is also important for various downstream analyses, such as gene expression studies, genome sequence assembly, and microbiome analysis [15,16]. Prior to analysis, the sequencing data must first be checked and processed.…”
Section: Introductionmentioning
confidence: 99%
“…Since the birth of big data, people have gradually realized the social and economic value of big data and paid great attention to it [1][2][3]. Many developed countries have successively issued relevant policies to promote the development of big data, and have promoted big data as a national strategy [4][5][6].…”
Section: Introductionmentioning
confidence: 99%