Coding Over Sets for DNA Storage

Lenz, Andreas; Siegel, Paul H.; Wachter-Zeh, Antonia; Yaakobi, Eitan

doi:10.1109/tit.2019.2961265

Cited by 81 publications

(62 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The improved insertion and deletion correction can extend the applicability of the framework to sequencing platforms such as nanopore sequencing [28] which have higher insertion and deletion error rates. Another interesting direction is to incorporate ideas from [18] and [29] to reduce the inefficiency of index error correction.…”

Section: Discussionmentioning

confidence: 99%

“…In this section, we consider a simplified model for DNAbased storage to develop a better understanding of the coding theoretic tradeoffs. While several previous works such as [15], [16], [18] theoretically analyze various aspects of the DNA-based storage problem (such as the information-theoretic capacity in the asymptotic setting and the optimality of various techniques to recover the order of the oligonucleotides), our main focus is to understand the tradeoff between the writing and reading cost associated with DNA-based storage and to motivate the scheme described in Section 3.…”

Section: Theoretical Analysismentioning

confidence: 99%

See 1 more Smart Citation

Improved read/write cost tradeoff in DNA-based data storage using LDPC codes

Chandak

Tatwawadi

Lau

et al. 2019

Preprint

View full text Add to dashboard Cite

With the amount of data being stored increasing rapidly, there is significant interest in exploring alternative storage technologies. In this context, DNA-based storage systems can offer significantly higher storage densities (petabytes/gram) and durability (thousands of years) than current technologies. Specifically, DNA has been found to be stable over extended periods of time which has been demonstrated in the analysis of organisms long since extinct. Recent advances in DNA sequencing and synthesis pipelines have made DNA-based storage a promising candidate for the storage technology of the future.Recently, there have been multiple efforts in this direction, focusing on aspects such as error correction for synthesis/sequencing errors and erasure correction for handling missing sequences. The typical approach is to use separate codes for handling errors and erasures, but there is limited understanding of the efficiency of this framework. Furthermore, the existing techniques use short block-length codes and heavily rely on read consensus, both of which are known to be suboptimal in coding theory.In this work, we study the tradeoff between the writing and reading costs involved in DNA-based storage and propose a practical scheme to achieve an improved tradeoff between these quantities. Our scheme breaks with the traditional separation framework and instead uses a single large block-length LDPC code for both erasure and error correction. We also introduce novel techniques to handle insertion and deletion errors introduced by the synthesis process. For a range of writing costs, the proposed scheme achieves 30-40% lower reading costs than state-of-the-art techniques on experimental data obtained using array synthesis and Illumina sequencing.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Theoretical Analysismentioning

confidence: 99%

Improved read/write cost tradeoff in DNA-based data storage using LDPC codes

Chandak

Tatwawadi

Lau

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…3(c)). Reed-Solomon outer code: We use a Reed-Solomon (RS) code with field size 2 16 as the outer code to recover lost sequences and to correct any errors left undetected by the CRC. The amount of additional RS redundancy can be chosen to tradeoff the writing and reading costs [12], and is set to 30% by default.…”

Section: Methodsmentioning

confidence: 99%

“…Recent works have examined various aspects of DNA storage, including error correction [1,5,10,11,12], random access [4,5,13], novel synthesis techniques [14,15] and analysis of the fundamental limits [16,17,18]. While initial works used Illumina sequencing which provides highly accurate short reads, there is growing interest in the use of nanopore sequencing [19] because it is a portable, real-time and low-cost platform that also supports long reads.…”

Section: Introductionmentioning

confidence: 99%

Overcoming High Nanopore Basecaller Error Rates for DNA Storage Via Basecaller-Decoder Integration and Convolutional Codes

Chandak

Neu

Tatwawadi

et al. 2019

Preprint

View full text Add to dashboard Cite

As magnetization and semiconductor based storage technologies approach their limits, bio-molecules, such as DNA, have been identified as promising media for future storage systems, due to their high storage density (petabytes/gram) and long-term durability (thousands of years). Furthermore, nanopore DNA sequencing enables high-throughput sequencing using devices as small as a USB thumb drive and thus is ideally suited for DNA storage applications. Due to the high insertion/deletion error rates associated with basecalled nanopore reads, current approaches rely heavily on consensus among multiple reads and thus incur very high reading costs. We propose a novel approach which overcomes the high error rates in basecalled sequences by integrating a Viterbi error correction decoder with the basecaller, enabling the decoder to exploit the soft information available in the deep learning based basecaller pipeline. Using convolutional codes for error correction, we experimentally observed 3x lower reading costs than the state-of-the-art techniques at comparable writing costs.The code, data and Supplementary Material is available at https://github.com/shubhamchandak94/nanopore_ dna_storage.

show abstract

“…Related literature: Motivated by DNA-based storage, a few recent works have considered the problem of coding across an unordered set of strings [15][16][17][18]. The setting studied in all these works bears similarities with the one in this paper, but they focus on providing explicit code constructions, as opposed to characterizing the channel capacity, as we do here.…”

Section: Introductionmentioning

confidence: 99%

Capacity Results for the Noisy Shuffling Channel

Shomorony

Heckel

2019

2019 IEEE International Symposium on Information Theory (ISIT)

View full text Add to dashboard Cite

Motivated by DNA-based storage, we study the noisy shuffling channel, which can be seen as the concatenation of a standard noisy channel (such as the BSC) and a shuffling channel, which breaks the data block into small pieces and shuffles them. This channel models a DNA storage system, by capturing two of its key aspects: (1) the data is written onto many short DNA molecules that are stored in an unordered way and (2) the molecules are corrupted by noise at synthesis, sequencing, and during storage. For the BSC-shuffling channel we characterize the capacity exactly (for a large set of parameters), and show that a simple index-based coding scheme is optimal.

show abstract

Coding Over Sets for DNA Storage

Cited by 81 publications

References 39 publications

Improved read/write cost tradeoff in DNA-based data storage using LDPC codes

Improved read/write cost tradeoff in DNA-based data storage using LDPC codes

Overcoming High Nanopore Basecaller Error Rates for DNA Storage Via Basecaller-Decoder Integration and Convolutional Codes

Capacity Results for the Noisy Shuffling Channel

Contact Info

Product

Resources

About