2020
DOI: 10.1093/gigascience/giaa072
|View full text |Cite
|
Sign up to set email alerts
|

Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences

Abstract: Abstract Background Nearly all molecular sequence databases currently use gzip for data compression. Ongoing rapid accumulation of stored data calls for a more efficient compression tool. Although numerous compressors exist, both specialized and general-purpose, choosing one of them was difficult because no comprehensive analysis of their comparative advantages for sequence compression was av… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
14
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 27 publications
(18 citation statements)
references
References 47 publications
(40 reference statements)
0
14
0
Order By: Relevance
“…DS4 : highly repetitive DNA with the human Y-chromosome (HoSaY) and a human mitogenome collection (Mito) (proposed in [ 90 ]);…”
Section: Resultsmentioning
confidence: 99%
“…DS4 : highly repetitive DNA with the human Y-chromosome (HoSaY) and a human mitogenome collection (Mito) (proposed in [ 90 ]);…”
Section: Resultsmentioning
confidence: 99%
“…The code is not highly optimized in terms of speed and memory usage in both compression and decompression, but is sufficient as the main goal of this version of implementation is to proof the concept of the OST algorithm and conduct a trial testing and investigation. A set of genomes as listed in Table 2 and 3, which is the same dataset used in 110 , were used for testing and comparing the results of OST-DNA with the results of the tools in Table 1.…”
Section: Resultsmentioning
confidence: 99%
“…The DENV nr dataset ( Section 2.4 ) was used for performance comparison analysis between ITERmin, a re-implementation herein of the earlier Khan algorithm (KA) [ 20 , 21 ], and UNIQmin. Additionally, a literature search showed that several protein compressors have been developed [ 42 , 43 ], which, however, mainly focus on the reduction of sequence file size for storage purposes. Nonetheless, a comparison was performed against the existing compressors to evaluate the compression ability of UNIQmin.…”
Section: Methodsmentioning
confidence: 99%
“…An HS dataset was used because it was the only one that allowed both direct and an approximate indirect (with 2019 HS dataset) comparison between the compressors. Direct comparison of UNIQmin was only carried out with Gzip ( , accessed on 3 March 2020), a widely used compressor that balances speed and compression [ 43 ], and AC, which typically showcases the best compression [ 42 ]. The direct comparison between UNIQmin, Gzip, and AC was made by use of the 2020 HS dataset (HS 2020 ) and 2018 all reported viral sequence dataset (“All viruses”), retrieved from the NCBI Entrez Protein Database by use of the txid “10239” (as of November 2018), while indirect comparisons were made with other tools using earlier HS datasets (HS 2019 ) and viral datasets of Acanthamoeba polyphaga (AP 2019 ) and Enterococcus phage (EP 2019 ) [ 42 ].…”
Section: Methodsmentioning
confidence: 99%