Clustering-Correcting Codes

Shinkar, Tal; Yaakobi, Eitan; Lenz, Andreas; Wachter-Zeh, Antonia

doi:10.1109/isit.2019.8849737

Cited by 16 publications

(10 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The problem of clustering output strings was studied from a coding-theoretic standpoint in [15]. It was shown that "codeaware" clustering, i.e., a clustering algorithm that exploits…”

Section: A Related Literaturementioning

confidence: 99%

Achieving the Capacity of a DNA Storage Channel with Linear Coding Schemes

Levick¹,

Heckel²,

Shomorony³

2021

Preprint

View full text Add to dashboard Cite

Due to the redundant nature of DNA synthesis and sequencing technologies, a basic model for a DNA storage system is a multi-draw "shuffling-sampling" channel. In this model, a random number of noisy copies of each sequence is observed at the channel output. Recent works have characterized the capacity of such a DNA storage channel under different noise and sequencing models, relying on sophisticated typicality-based approaches for the achievability. Here, we consider a multi-draw DNA storage channel in the setting of noise corruption by a binary erasure channel. We show that, in this setting, the capacity is achieved by linear coding schemes. This leads to a considerably simpler derivation of the capacity expression of a multi-draw DNA storage channel than existing results in the literature.

show abstract

“…The problem of clustering output strings was studied from a coding-theoretic standpoint in [15]. It was shown that "codeaware" clustering, i.e., a clustering algorithm that exploits…”

Section: A Related Literaturementioning

confidence: 99%

Achieving the Capacity of a DNA Storage Channel with Linear Coding Schemes

Levick¹,

Heckel²,

Shomorony³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…is w-subsequenceunique and simultaneously satisfies Properties 23 and 39 is found. Such a string z is guaranteed to exist because all such properties hold for x ′ + g(U t ) with probability 1 − o(1) (see Lemmas 22,34,and 38). Moreover, whether x ′ + g(z) satisfies all three properties can be checked in time poly(m);…”

Section: Using the Code Within A Marker-based Constructionmentioning

confidence: 99%

“…An information-theoretic treatment of related but abstracted models of DNA-based data storage may be found in [32,33]. Very recently, a model for clustering sequencing outputs according to the relevant DNA strand and codes that allow for correct clustering have been studied in [34].…”

Section: Introductionmentioning

confidence: 99%

Coded Trace Reconstruction

Cheraghchi

Ribeiro

Gabrys

et al. 2019

2019 IEEE Information Theory Workshop (ITW)

View full text Add to dashboard Cite

Motivated by average-case trace reconstruction and coding for portable DNA-based storage systems, we initiate the study of coded trace reconstruction, the design and analysis of high-rate efficiently encodable codes that can be efficiently decoded with high probability from few reads (also called traces) corrupted by edit errors. Codes used in current portable DNA-based storage systems with nanopore sequencers are largely based on heuristics, and have no provable robustness or performance guarantees even for an error model with i.i.d. deletions and constant deletion probability. Our work is a first step towards the design of efficient codes with provable guarantees for such systems. We consider a constant rate of i.i.d. deletions, and perform an analysis of marker-based code-constructions. This gives rise to codes with redundancy O(n/ log n) (resp. O(n/ log log n)) that can be efficiently reconstructed from exp(O(log 2/3 n)) (resp. exp(O(log log n) 2/3 )) traces, where n is the message length. Then, we give a construction of a code with O(log n) bits of redundancy that can be efficiently reconstructed from poly(n) traces if the deletion probability is small enough. Finally, we show how to combine both approaches, giving rise to an efficient code with O(n/ log n) bits of redundancy which can be reconstructed from poly(log n) traces for a small constant deletion probability.This point of view naturally leads to the problem of coded trace reconstruction: The goal is to design high rate, efficiently encodable codes whose codewords can be efficiently reconstructed with high probability from very few traces with constant deletion probability. Here, "high rate" refers to a rate approaching 1 as the block length increases. We remark that in such a case, the number of traces must grow with the block length of the code. Coded trace reconstruction is also closely related to and motivated by the read process in portable DNA-based data storage systems, which we discuss below.Motivation A practical motivation for coded trace reconstruction comes from portable DNA-based data storage systems using DNA nanopores, first introduced in [13]. In DNA-based storage, a block of user-defined data is first encoded over the nucleotide alphabet {A, C, G, T }, and then transformed into moderately long strands of DNA through a DNA synthesis process. For ease of synthesis, the DNA strands are usually encoded to have balanced GC-content, so that the fraction of {A, T } and {G, C} bases is roughly the same. To recover the block of data, the associated strand of DNA is sequenced with nanopores, resulting in multiple corrupted reads of its encoding. Although the errors encountered during nanopore sequencing include both deletions/insertions as well as substitution errors, careful read preprocessing alignment [13] allows the processed reads to be viewed as traces of the data block's encoding. As a result, recovering the data block in question can be cast in the setting of trace reconstruction. Due to sequencing delay constraints 1 , it is of great ...

show abstract

“…The redundancy required to force such a constraint on a collection of vectors will be calculated later. For the case of t = 0, the set A(l, 0, ǫ 1 , ǫ 2 ) is called clustering-correcting code, and explicit constructions which require only one bit of redundancy and can be encoded and decoded efficiently can be found in [14]. The anchoring property will be used to reconstruct the ordering of the sequences.…”

Section: Constructionmentioning

confidence: 99%

Anchor-Based Correction of Substitutions in Indexed Sets

Lenz

Siegel

Wachter-Zeh

et al. 2019

2019 IEEE International Symposium on Information Theory (ISIT)

Self Cite

View full text Add to dashboard Cite

Motivated by DNA-based data storage, we investigate a system where digital information is stored in an unordered set of several vectors over a finite alphabet. Each vector begins with a unique index that represents its position in the whole data set and does not contain data. This paper deals with the design of error-correcting codes for such indexed sets in the presence of substitution errors. We propose a construction that efficiently deals with the challenges that arise when designing codes for unordered sets. Using a novel mechanism, called anchoring, we show that it is possible to combat the ordering loss of sequences with only a small amount of redundancy, which allows to use standard coding techniques, such as tensor-product codes to correct errors within the sequences. We finally derive upper and lower bounds on the achievable redundancy of codes within the considered channel model and verify that our construction yields a redundancy that is close to the best possible achievable one. Our results surprisingly indicate that it requires less redundancy to correct errors in the indices than in the data part of vectors.

show abstract

Clustering-Correcting Codes

Cited by 16 publications

References 24 publications

Achieving the Capacity of a DNA Storage Channel with Linear Coding Schemes

Achieving the Capacity of a DNA Storage Channel with Linear Coding Schemes

Coded Trace Reconstruction

Anchor-Based Correction of Substitutions in Indexed Sets

Contact Info

Product

Resources

About