Impact of lossy compression of nanopore raw signal data on basecalling and consensus accuracy

Chandak, Shubham; Tatwawadi, Kedar; Sridhar, Srivatsan; Weissman, Tsachy

doi:10.1101/2020.04.19.049262

Cited by 2 publications

(3 citation statements)

References 34 publications

(52 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides improvements in accuracy, improvements in data handling will become central as the raw data are multifold larger than those obtained from short read data. Methods that improve compression of fast5 files and more space-efficient alternative file types for storing raw nanopore data are currently being developed [78,79] xi , and graphics processing unit acceleration is used routinely [80]. However, further improvements to reduce file sizes, standardizing file formats, and compute and memory-efficient algorithms will greatly reduce the barrier for larger-scale applications and adaptation.…”

Section: Discussionmentioning

confidence: 99%

Beyond sequencing: machine learning algorithms extract biology hidden in Nanopore signal data

et al. 2022

View full text Add to dashboard Cite

Section: Discussionmentioning

confidence: 99%

Beyond sequencing: machine learning algorithms extract biology hidden in Nanopore signal data

et al. 2022

View full text Add to dashboard Cite

“…Given the high sequencing depth, there is much redundancy to be exploited in the reads, and several specialized compressors like SPRING (Chandak et al, 2019) and PgRC (Kowalski and Grabowski, 2019) have been developed for this data. The typical approach used by these compressors is to efficiently build an with the advent of deep learning based basecallers which achieve median error rate close to 5% or better (Chandak et al, 2020), suggesting that a similar approximate assembly approach with some adaptations can be applied to nanopore sequencing reads.…”

Section: Introductionmentioning

confidence: 99%

“…On the other hand, nanopore reads are much longer (often over hundreds of thousands of bases long), and have a much higher error rate, including substitution, insertion, and deletion errors from the basecalling process that converts the raw current signal to the read sequences (Wick et al ., 2019). However, the error rate has fallen dramatically in the recent years with the advent of deep learning based basecallers which achieve median error rate close to 5% or better (Chandak et al ., 2020), suggesting that a similar approximate assembly approach with some adaptations can be applied to nanopore sequencing reads.…”

Section: Introductionmentioning

confidence: 99%

NanoSpring: reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

Meng

Chandak

Zhu

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Motivation: The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files. Previous work ENANO focuses mostly on quality score compression and does not achieve significant gains for the compression of read sequences over general-purpose compressors. RENANO achieves significantly better compression for read sequences but is limited to aligned data with a reference available. Results: We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring achieves close to 3x improvement in compression over state-of-the-art reference-free compressors. The computational requirements of NanoSpring are practical, although it uses more time and memory during compression than previous tools to achieve the compression gains. Availability: NanoSpring is available on GitHub at https://github.com/qm2/NanoSpring.

show abstract

Impact of lossy compression of nanopore raw signal data on basecalling and consensus accuracy

Cited by 2 publications

References 34 publications

Beyond sequencing: machine learning algorithms extract biology hidden in Nanopore signal data

Beyond sequencing: machine learning algorithms extract biology hidden in Nanopore signal data

NanoSpring: reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

Contact Info

Product

Resources

About