DNA-based data storage is an emerging nonvolatile memory technology of potentially unprecedented density, durability, and replication efficiency. The basic system implementation steps include synthesizing DNA strings that contain user information and subsequently retrieving them via high-throughput sequencing technologies. Existing architectures enable reading and writing but do not offer random-access and error-free data recovery from low-cost, portable devices, which is crucial for making the storage technology competitive with classical recorders. Here we show for the first time that a portable, random-access platform may be implemented in practice using nanopore sequencers. The novelty of our approach is to design an integrated processing pipeline that encodes data to avoid costly synthesis and sequencing errors, enables random access through addressing, and leverages efficient portable sequencing via new iterative alignment and deletion error-correcting codes. Our work represents the only known random access DNA-based data storage system that uses error-prone nanopore sequencers, while still producing error-free readouts with the highest reported information rate/density. As such, it represents a crucial step towards practical employment of DNA molecules as storage media.
This paper studies codes that correct bursts of deletions. Namely, a code will be called a b-burst-deletion-correcting code if it can correct a deletion of any b consecutive bits. While the lower bound on the redundancy of such codes was shown by Levenshtein to be asymptotically log(n) + b − 1, the redundancy of the best code construction by Cheng et al. is b(log(n/b + 1)). In this paper we close on this gap and provide codes with redundancy at most log(n) + (b − 1) log(log(n)) + b − log(b).We also derive a non-asymptotic upper bound on the size of b-burst-deletion-correcting codes and extend the burst deletion model to two more cases: 1) A deletion burst of at most b consecutive bits and 2) A deletion burst of size at most b (not necessarily consecutive). We extend our code construction for the first case and study the second case for b = 3, 4. The equivalent models for insertions are also studied and are shown to be equivalent to correcting the corresponding burst of deletions.
Motivated by average-case trace reconstruction and coding for portable DNA-based storage systems, we initiate the study of coded trace reconstruction, the design and analysis of high-rate efficiently encodable codes that can be efficiently decoded with high probability from few reads (also called traces) corrupted by edit errors. Codes used in current portable DNA-based storage systems with nanopore sequencers are largely based on heuristics, and have no provable robustness or performance guarantees even for an error model with i.i.d. deletions and constant deletion probability. Our work is a first step towards the design of efficient codes with provable guarantees for such systems. We consider a constant rate of i.i.d. deletions, and perform an analysis of marker-based code-constructions. This gives rise to codes with redundancy O(n/ log n) (resp. O(n/ log log n)) that can be efficiently reconstructed from exp(O(log 2/3 n)) (resp. exp(O(log log n) 2/3 )) traces, where n is the message length. Then, we give a construction of a code with O(log n) bits of redundancy that can be efficiently reconstructed from poly(n) traces if the deletion probability is small enough. Finally, we show how to combine both approaches, giving rise to an efficient code with O(n/ log n) bits of redundancy which can be reconstructed from poly(log n) traces for a small constant deletion probability.This point of view naturally leads to the problem of coded trace reconstruction: The goal is to design high rate, efficiently encodable codes whose codewords can be efficiently reconstructed with high probability from very few traces with constant deletion probability. Here, "high rate" refers to a rate approaching 1 as the block length increases. We remark that in such a case, the number of traces must grow with the block length of the code. Coded trace reconstruction is also closely related to and motivated by the read process in portable DNA-based data storage systems, which we discuss below.Motivation A practical motivation for coded trace reconstruction comes from portable DNA-based data storage systems using DNA nanopores, first introduced in [13]. In DNA-based storage, a block of user-defined data is first encoded over the nucleotide alphabet {A, C, G, T }, and then transformed into moderately long strands of DNA through a DNA synthesis process. For ease of synthesis, the DNA strands are usually encoded to have balanced GC-content, so that the fraction of {A, T } and {G, C} bases is roughly the same. To recover the block of data, the associated strand of DNA is sequenced with nanopores, resulting in multiple corrupted reads of its encoding. Although the errors encountered during nanopore sequencing include both deletions/insertions as well as substitution errors, careful read preprocessing alignment [13] allows the processed reads to be viewed as traces of the data block's encoding. As a result, recovering the data block in question can be cast in the setting of trace reconstruction. Due to sequencing delay constraints 1 , it is of great ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.