Constructing Antidictionaries in Output-Sensitive Space

Ayad, Lorraine A. K.; Badkobeh, Golnaz; Fici, Gabriele; Héliou, Alice; Pissis, Solon P.

doi:10.1109/dcc.2019.00062

Cited by 5 publications

(18 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A preliminary version of this paper appeared as [1]. Compared to the preliminary version, we have extended the work by adding a simplified space-efficient version of the algorithm (see Section 4).…”

Section: N=1mentioning

confidence: 99%

“…The set of all words over Σ of length at most is denoted by Σ ≤ . We fix a constant-sized alphabet Σ, i.e., |Σ| = O (1). Given a word y = uxv over Σ, we say that u is a prefix of y, x is a factor (or subword) of y, and v is a suffix of y.…”

Section: Preliminariesmentioning

confidence: 99%

“…) Given a collection Q of weighted ancestor queries on a weighted tree T on n nodes with integer weights up to n O (1) , all the queries in Q can be answered off-line in O(n + |Q|) time.…”

Section: Theorem 3 ([25]mentioning

confidence: 99%

“…The time-efficient algorithm discussed in Section 5 (with the exception of storing and searching the reduced sets of words explicitly rather than in the constant-space form previously described) has been implemented in the C++ programming language 1 . The correctness of our implementation has been confirmed against that of [7].…”

Section: Proof-of-concept Experimentsmentioning

confidence: 99%

See 3 more Smart Citations

Constructing Antidictionaries of Long Texts in Output-Sensitive Space

et al. 2020

Self Cite

View full text Add to dashboard Cite

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y1, … , yk over an alphabet Σ, we are asked to compute the set $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ of minimal absent words of length at most ℓ of the collection {y1, … , yk}. The set $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ contains all the words x such that x is absent from all the words of the collection while there exist i,j, such that the maximal proper suffix of x is a factor of yi and the maximal proper prefix of x is a factor of yj. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set $\mathrm {M}^{\ell }_{y}$ M y ℓ of minimal absent words of a word y is equal to $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ for any decomposition of y into a collection of words y1, … , yk such that there is an overlap of length at least ℓ − 1 between any two consecutive words in the collection. This computation generally requires Ω(n) space for n = |y| using any of the plenty available $\mathcal {O}(n)$ O ( n ) -time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when $\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| =o(n)$ ∥ M { y 1 , … , y N } ℓ ∥ = o ( n ) , for all N ∈ [1,k], where ∥S∥ denotes the sum of the lengths of words in set S. For instance, in the human genome, n ≈ 3 × 109 but $\| \mathrm {M}^{12}_{\{y_1,\ldots ,y_k\}}\| \approx 10^{6}$ ∥ M { y 1 , … , y k } 12 ∥ ≈ 1 0 6 . We consider a constant-sized alphabet for stating our results. We show that all$\mathrm {M}^{\ell }_{y_{1}},\ldots ,\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M y 1 ℓ , … , M { y 1 , … , y k } ℓ can be computed in $\mathcal {O}(kn+{\sum }^{k}_{N=1}\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| )$ O ( k n + ∑ N = 1 k ∥ M { y 1 , … , y N } ℓ ∥ ) total time using $\mathcal {O}(\textsc {MaxIn}+\textsc {MaxOut})$ O ( MaxIn + MaxOut ) space, where MaxIn is the length of the longest word in {y1, … , yk} and $\textsc {MaxOut}=\max \limits \{\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| :N\in [1,k]\}$ MaxOut = max { ∥ M { y 1 , … , y N } ℓ ∥ : N ∈ [ 1 , k ] } . Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

show abstract

Section: N=1mentioning

confidence: 99%

Section: Preliminariesmentioning

confidence: 99%

“…) Given a collection Q of weighted ancestor queries on a weighted tree T on n nodes with integer weights up to n O (1) , all the queries in Q can be answered off-line in O(n + |Q|) time.…”

Section: Theorem 3 ([25]mentioning

confidence: 99%

Section: Proof-of-concept Experimentsmentioning

confidence: 99%

See 2 more Smart Citations

Constructing Antidictionaries of Long Texts in Output-Sensitive Space

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Motivated by these two concepts and the role they play, we study in this paper the set of absent subsequences of a string w, i.e., the set of strings which are not subsequences of w. As such, our investigation is also strongly related to the study of missing factors (or missing words, MAWs) in strings, where the focus is on the set of strings which are not substrings (or factors) of w. The literature on the respective topic ranges from many very practical applications of this concept [5,13,14,20,47,53] to deep theoretical results of combinatorial [10,19,22,23,24,44,43] or algorithmic nature [1,2,5,6,15,17,27]. Absent subsequences are also related to the well-studied notion of patterns avoided by permutations, see for instance [35], with the main difference being that a permutation is essentially a word whose letters are pairwise distinct.…”

Section: Introductionmentioning

confidence: 99%

Absent Subsequences in Words

Kosche¹,

Koß²,

Manea³

et al. 2021

Preprint

View full text Add to dashboard Cite

An absent factor of a string w is a string u which does not occur as a contiguous substring (a.k.a. factor) inside w. We extend this well-studied notion and define absent subsequences: a string u is an absent subsequence of a string w if u does not occur as subsequence (a.k.a. scattered factor) inside w. Of particular interest to us are minimal absent subsequences, i.e., absent subsequences whose every subsequence is not absent, and shortest absent subsequences, i.e., absent subsequences of minimal length. We show a series of combinatorial and algorithmic results regarding these two notions. For instance: we give combinatorial characterisations of the sets of minimal and, respectively, shortest absent subsequences in a word, as well as compact representations of these sets; we show how we can test efficiently if a string is a shortest or minimal absent subsequence in a word, and we give efficient algorithms computing the lexicographically smallest absent subsequence of each kind; also, we show how a data structure for answering shortest absent subsequencequeries for the factors of a given string can be efficiently computed.

show abstract