2021
DOI: 10.1371/journal.pcbi.1009229
|View full text |Cite
|
Sign up to set email alerts
|

Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression

Abstract: Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pai… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(3 citation statements)
references
References 27 publications
0
3
0
Order By: Relevance
“…Among the abundant choices, we employ the reordering-based compression tools as they have been proven to achieve superior compression rates. Reordering-based compressors including SPRING ( Chandak et al 2019 ), Minicom ( Liu et al 2019 ), PgRC ( Kowalski and Grabowski 2020 ), Mstcom ( Liu and Li 2021 ), and CURC ( Xie et al 2022 ) adopt various strategies to identify overlaps between reads and approximately rearrange them based on their derived positions in the genome. Usually, the original order of reads is discarded because maintaining it significantly degrades the compression ratio and is not necessary for downstream analysis.…”
Section: Introductionmentioning
confidence: 99%
“…Among the abundant choices, we employ the reordering-based compression tools as they have been proven to achieve superior compression rates. Reordering-based compressors including SPRING ( Chandak et al 2019 ), Minicom ( Liu et al 2019 ), PgRC ( Kowalski and Grabowski 2020 ), Mstcom ( Liu and Li 2021 ), and CURC ( Xie et al 2022 ) adopt various strategies to identify overlaps between reads and approximately rearrange them based on their derived positions in the genome. Usually, the original order of reads is discarded because maintaining it significantly degrades the compression ratio and is not necessary for downstream analysis.…”
Section: Introductionmentioning
confidence: 99%
“…The development of short reads and long reads sequencing technologies has greatly reduced the cost of obtaining genomics sequencing data, the price dropping from $5292.390/MB in 2002 to $0.006/MB in 2022 ( Wetterstrand 2023 ). This has propelled rapid advancements in virus tracing, precision diagnosis treatment, and new drug development ( Hernaez et al 2019 , Kredens et al 2020 , Liu and Li 2021 , Sun et al 2023b ). As a result, the growth rate of sequencing data surpasses the Moore’s law ( Schaller 1997 , Hernaez et al 2019 ).…”
Section: Introductionmentioning
confidence: 99%
“…Such a huge amount of genomic data has posed great challenges to genomic data centers and genomic research institutions, such as in data storage, backup, migration, sharing, etc. [ 3 ]. The compression of genomic data naturally becomes the best choice to resolve the challenge.…”
Section: Introductionmentioning
confidence: 99%