A recent study by one of the authors has demonstrated the importance of profile vectors in DNA-based data storage. We provide exact values and lower bounds on the number of profile vectors for finite values of alphabet size q, read length , and word length n. Consequently, we demonstrate that for q ≥ 2 and n ≤ q /2−1 , the number of profile vectors is at least q κn with κ very close to 1. In addition to enumeration results, we provide a set of efficient encoding and decoding algorithms for each of two particular families of profile vectors.
Tandem duplication in DNA is the process of inserting a copy of a segment of DNA adjacent to the original position. Motivated by applications that store data in living organisms, Jain et al. (2016) proposed the study of codes that correct tandem duplications to improve the reliability of data storage. We investigate algorithms associated with the study of these codes.Two words are said to be k-confusable if there exists two sequences of tandem duplications of lengths at most k such that the resulting words are equal. We demonstrate that the problem of deciding whether two words is kconfusable is linear-time solvable through a characterisation that can be checked efficiently for k = 3. Combining with previous results, the decision problem is linear-time solvable for k 3. We conjecture that this problem is undecidable for k > 3.Using insights gained from the algorithm, we study the size of tandem-duplication codes. We improve the previous known upper bound and then construct codes with larger sizes as compared to the previous constructions. We determine the sizes of optimal tandem-duplication codes for lengths up to twenty, develop recursive methods to construct tandemduplication codes for all word lengths, and compute explicit lower bounds for the size of optimal tandem-duplication codes for lengths from 21 to 30. arXiv:1707.03956v2 [math.CO] 17 Nov 2017 * = ⇒ k y .Therefore, to determine if a set of words is a tandem-duplication code, we need to verify that all pairs of distinct words are not confusable. Hence, we state our problem of interest.
CONFUSABILITY PROBLEMInstance: Two words x and y over Σ q , and an integer k Question: Are x and y k-confusable?While the confusability problem is a natural question, efficient algorithms are only known for the case where k ∈ {1, 2}. We review these results in the next subsection.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.