FQSqueezer: k-mer-based compression of sequencing data

Deorowicz, Sebastian

doi:10.1038/s41598-020-57452-6

Cited by 23 publications

(15 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Since sequence headers contribute marginally to the sizes of FASTA/FASTQ les, they are compressed with well-established token-based method analogously as in FQSqueezer [23] or ENANO.…”

Section: Colord Overviewmentioning

confidence: 99%

CoLoRd: Compressing long reads

Kokot

Gudyś

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The costs of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today's genomics. In spite of the increasing popularity of the third generation sequencing, the existing algorithms for compressing long reads exhibit minor advantage over general purpose gzip. We present CoLoRd, an algorithm able to reduce 3rd generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyzes.

show abstract

“…Since sequence headers contribute marginally to the sizes of FASTA/FASTQ les, they are compressed with well-established token-based method analogously as in FQSqueezer [23] or ENANO.…”

Section: Colord Overviewmentioning

confidence: 99%

CoLoRd: Compressing long reads

Kokot

Gudyś

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Because compressors designed for FASTQ data can be trivially adopted for FASTA-formatted inputs, we also included a comprehensive array of compressors designed primarily or specifically for FASTQ data: BEETL [ 28 ], Quip [ 29 ], fastqz [ 10 ], fqzcomp [ 10 ], DSRC 2 [ 30 ], Leon [ 31 ], LFQC [ 32 ], KIC [ 33 ], ALAPY [ 34 ], GTX.Zip [ 35 ], HARC [ 36 ], LFastqC [ 37 ], SPRING [ 38 ], Minicom [ 39 ], and FQSqueezer [ 40 ]. We also included AC—a compressor designed exclusively for protein sequences [ 41 ].…”

Section: Resultsmentioning

confidence: 99%

Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences

et al. 2020

View full text Add to dashboard Cite

Abstract Background Nearly all molecular sequence databases currently use gzip for data compression. Ongoing rapid accumulation of stored data calls for a more efficient compression tool. Although numerous compressors exist, both specialized and general-purpose, choosing one of them was difficult because no comprehensive analysis of their comparative advantages for sequence compression was available. Findings We systematically benchmarked 430 settings of 48 compressors (including 29 specialized sequence compressors and 19 general-purpose compressors) on representative FASTA-formatted datasets of DNA, RNA, and protein sequences. Each compressor was evaluated on 17 performance measures, including compression strength, as well as time and memory required for compression and decompression. We used 27 test datasets including individual genomes of various sizes, DNA and RNA datasets, and standard protein datasets. We summarized the results as the Sequence Compression Benchmark database (SCB database, http://kirr.dyndns.org/sequence-compression-benchmark/), which allows custom visualizations to be built for selected subsets of benchmark results. Conclusion We found that modern compressors offer a large improvement in compactness and speed compared to gzip. Our benchmark allows compressors and their settings to be compared using a variety of performance measures, offering the opportunity to select the optimal compressor on the basis of the data type and usage scenario specific to a particular application.

show abstract

“…Minimizer are used to face the two challenges of processing k-mers: the high volume of data due to redundancy and the impossibility or difficulty of partitioning treatment [7]. Data structure to reduce redundancy: Minimizers are used to define data structures where not all the k-mers of a read are stored, but those that are contiguous and have the same minimizer are merged [14]. The product of this fusion is subsequences called super k-mers [8].…”

Section: What Do the Minimizers Contribute To The Processing Of K-mers?mentioning

confidence: 99%

Heterogeneous Computing to Accelerate the Search of Super K-Mers Based on Minimizers

Vera-Parra¹,

López-Sarmiento²,

Rojas-Quintero³

2020

IJC

View full text Add to dashboard Cite

The k-mers processing techniques based on partitioning of the data set on the disk using minimizer-type seeds have led to a significant reduction in memory requirements; however, it has added processes (search and distribution of super k-mers) that can be intensive given the large volume of data. This paper presents a massive parallel processing model in order to enable the efficient use of heterogeneous computation to accelerate the search of super k-mers based on seeds (minimizers or signatures). The model includes three main contributions: a new data structure called CISK for representing the super k-mers, their minimizers and two massive parallelization patterns in an indexed and compact way: one for obtaining the canonical m-mers of a set of reads and another for searching for super k-mers based on minimizers. The model was implemented through two OpenCL kernels. The evaluation of the kernels shows favorable results in terms of execution times and memory requirements to use the model for constructing heterogeneous solutions with simultaneous execution (workload distribution), which perform co-processing using the current search methods of super k -mers on the CPU and the methods presented herein on GPU. The model implementation code is available in the repository: https://github.com/BioinfUD/K-mersCL.

show abstract

FQSqueezer: k-mer-based compression of sequencing data

Cited by 23 publications

References 29 publications

CoLoRd: Compressing long reads

CoLoRd: Compressing long reads

Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences

Heterogeneous Computing to Accelerate the Search of Super K-Mers Based on Minimizers

Contact Info

Product

Resources

About