Search citation statements
Paper Sections
Citation Types
Year Published
Publication Types
Relationship
Authors
Journals
A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y1, … , yk over an alphabet Σ, we are asked to compute the set $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ of minimal absent words of length at most ℓ of the collection {y1, … , yk}. The set $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ contains all the words x such that x is absent from all the words of the collection while there exist i,j, such that the maximal proper suffix of x is a factor of yi and the maximal proper prefix of x is a factor of yj. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set $\mathrm {M}^{\ell }_{y}$ M y ℓ of minimal absent words of a word y is equal to $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ for any decomposition of y into a collection of words y1, … , yk such that there is an overlap of length at least ℓ − 1 between any two consecutive words in the collection. This computation generally requires Ω(n) space for n = |y| using any of the plenty available $\mathcal {O}(n)$ O ( n ) -time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when $\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| =o(n)$ ∥ M { y 1 , … , y N } ℓ ∥ = o ( n ) , for all N ∈ [1,k], where ∥S∥ denotes the sum of the lengths of words in set S. For instance, in the human genome, n ≈ 3 × 109 but $\| \mathrm {M}^{12}_{\{y_1,\ldots ,y_k\}}\| \approx 10^{6}$ ∥ M { y 1 , … , y k } 12 ∥ ≈ 1 0 6 . We consider a constant-sized alphabet for stating our results. We show that all$\mathrm {M}^{\ell }_{y_{1}},\ldots ,\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M y 1 ℓ , … , M { y 1 , … , y k } ℓ can be computed in $\mathcal {O}(kn+{\sum }^{k}_{N=1}\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| )$ O ( k n + ∑ N = 1 k ∥ M { y 1 , … , y N } ℓ ∥ ) total time using $\mathcal {O}(\textsc {MaxIn}+\textsc {MaxOut})$ O ( MaxIn + MaxOut ) space, where MaxIn is the length of the longest word in {y1, … , yk} and $\textsc {MaxOut}=\max \limits \{\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| :N\in [1,k]\}$ MaxOut = max { ∥ M { y 1 , … , y N } ℓ ∥ : N ∈ [ 1 , k ] } . Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.
A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y1, … , yk over an alphabet Σ, we are asked to compute the set $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ of minimal absent words of length at most ℓ of the collection {y1, … , yk}. The set $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ contains all the words x such that x is absent from all the words of the collection while there exist i,j, such that the maximal proper suffix of x is a factor of yi and the maximal proper prefix of x is a factor of yj. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set $\mathrm {M}^{\ell }_{y}$ M y ℓ of minimal absent words of a word y is equal to $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ for any decomposition of y into a collection of words y1, … , yk such that there is an overlap of length at least ℓ − 1 between any two consecutive words in the collection. This computation generally requires Ω(n) space for n = |y| using any of the plenty available $\mathcal {O}(n)$ O ( n ) -time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when $\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| =o(n)$ ∥ M { y 1 , … , y N } ℓ ∥ = o ( n ) , for all N ∈ [1,k], where ∥S∥ denotes the sum of the lengths of words in set S. For instance, in the human genome, n ≈ 3 × 109 but $\| \mathrm {M}^{12}_{\{y_1,\ldots ,y_k\}}\| \approx 10^{6}$ ∥ M { y 1 , … , y k } 12 ∥ ≈ 1 0 6 . We consider a constant-sized alphabet for stating our results. We show that all$\mathrm {M}^{\ell }_{y_{1}},\ldots ,\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M y 1 ℓ , … , M { y 1 , … , y k } ℓ can be computed in $\mathcal {O}(kn+{\sum }^{k}_{N=1}\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| )$ O ( k n + ∑ N = 1 k ∥ M { y 1 , … , y N } ℓ ∥ ) total time using $\mathcal {O}(\textsc {MaxIn}+\textsc {MaxOut})$ O ( MaxIn + MaxOut ) space, where MaxIn is the length of the longest word in {y1, … , yk} and $\textsc {MaxOut}=\max \limits \{\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| :N\in [1,k]\}$ MaxOut = max { ∥ M { y 1 , … , y N } ℓ ∥ : N ∈ [ 1 , k ] } . Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.
An absent factor of a string w is a string u which does not occur as a contiguous substring (a.k.a. factor) inside w. We extend this well-studied notion and define absent subsequences: a string u is an absent subsequence of a string w if u does not occur as subsequence (a.k.a. scattered factor) inside w. Of particular interest to us are minimal absent subsequences, i.e., absent subsequences whose every subsequence is not absent, and shortest absent subsequences, i.e., absent subsequences of minimal length. We show a series of combinatorial and algorithmic results regarding these two notions. For instance: we give combinatorial characterisations of the sets of minimal and, respectively, shortest absent subsequences in a word, as well as compact representations of these sets; we show how we can test efficiently if a string is a shortest or minimal absent subsequence in a word, and we give efficient algorithms computing the lexicographically smallest absent subsequence of each kind; also, we show how a data structure for answering shortest absent subsequencequeries for the factors of a given string can be efficiently computed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.