Alignment-free sequence comparison using absent words

Charalampopoulos, Panagiotis; Fici, Gabriele; Mercaş, Robert; Pissis, Solon P.

doi:10.1016/j.ic.2018.06.002

Cited by 27 publications

(22 citation statements)

References 37 publications

(57 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A tight upper bound on the number of MAWs of a word y of length n over an alphabet of size σ is known to be O(σ n) [13,22,7]. It was also shown that the set of all MAWs of y is sufficient to uniquely reconstruct y [13,15].…”

Section: Introductionmentioning

confidence: 99%

“…This problem can be viewed as a variant of the classic approximate pattern-matching problem in which the distance of the pattern of length m to a factor of length m of the text is the LWI distance. Note that LWI verifies metric conditions [7]. The problem of approximate pattern matching admits many different formulations and has been the subject of many works (see [18,11,24]).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Absent words in a sliding window with applications

Crochemore¹,

Héliou²,

Kucherov³

et al. 2020

Information and Computation

Self Cite

View full text Add to dashboard Cite

An absent word of a word y is a word that does not occur in y. It is then called minimal if all its proper factors occur in y. In fact, minimal absent words (MAWs) provide useful information about y and thus have several applications. In this paper, we propose an algorithm that maintains the set of MAWs of a fixed-length window sliding over y online. Our algorithm represents MAWs through nodes of the suffix tree. Specifically, the suffix tree of the sliding window is maintained using modified Senft's algorithm (Senft, 2005), itself generalizing Ukkonen's online algorithm (Ukkonen, 1995). We then apply this algorithm to the approximate pattern-matching problem under the Length Weighted Index distance (Chairungsee and Crochemore, 2012). This results in an online O(σ |y|)-time algorithm for finding approximate occurrences of a word x in y, |x| ≤ |y|, where σ is the alphabet size.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Absent words in a sliding window with applications

Crochemore¹,

Héliou²,

Kucherov³

et al. 2020

Information and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…The set of all minimal absent words of length at most of a word y is denoted by M y . For example, if y = abaab, then M y = {aaa, aaba, bab, bb} and M 3 y = {aaa, bab, bb}. The upper bound on the number of minimal absent words is O(σ n) [2], where σ is the size of the alphabet and n is the length of y, and this bound is tight for integer alphabets [3]; in fact, for large alphabets, such as when σ ≥ √ n, this bound is tight even for minimal absent words having the same length [4,5].…”

Section: Introductionmentioning

confidence: 99%

“…There also exist space-efficient data structures based on the Burrows-Wheeler transform of y that can be applied for this computation [10,11]. In many real-world applications of minimal absent words, such as in data compression [12][13][14][15], in sequence comparison [3,9], in on-line pattern matching [16], or in identifying pathogen-specific signatures [17], only a subset of minimal absent words may be considered, and, in particular, the minimal absent words of length (at most) . Since, in the worst case, the number of minimal absent words of y is Θ(σ n), Ω(σ n) space is required to represent them explicitly.…”

Section: Introductionmentioning

confidence: 99%

Constructing Antidictionaries of Long Texts in Output-Sensitive Space

et al. 2020

Self Cite

View full text Add to dashboard Cite

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y1, … , yk over an alphabet Σ, we are asked to compute the set $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ of minimal absent words of length at most ℓ of the collection {y1, … , yk}. The set $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ contains all the words x such that x is absent from all the words of the collection while there exist i,j, such that the maximal proper suffix of x is a factor of yi and the maximal proper prefix of x is a factor of yj. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set $\mathrm {M}^{\ell }_{y}$ M y ℓ of minimal absent words of a word y is equal to $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M { y 1 , … , y k } ℓ for any decomposition of y into a collection of words y1, … , yk such that there is an overlap of length at least ℓ − 1 between any two consecutive words in the collection. This computation generally requires Ω(n) space for n = |y| using any of the plenty available $\mathcal {O}(n)$ O ( n ) -time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when $\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| =o(n)$ ∥ M { y 1 , … , y N } ℓ ∥ = o ( n ) , for all N ∈ [1,k], where ∥S∥ denotes the sum of the lengths of words in set S. For instance, in the human genome, n ≈ 3 × 109 but $\| \mathrm {M}^{12}_{\{y_1,\ldots ,y_k\}}\| \approx 10^{6}$ ∥ M { y 1 , … , y k } 12 ∥ ≈ 1 0 6 . We consider a constant-sized alphabet for stating our results. We show that all$\mathrm {M}^{\ell }_{y_{1}},\ldots ,\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$ M y 1 ℓ , … , M { y 1 , … , y k } ℓ can be computed in $\mathcal {O}(kn+{\sum }^{k}_{N=1}\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| )$ O ( k n + ∑ N = 1 k ∥ M { y 1 , … , y N } ℓ ∥ ) total time using $\mathcal {O}(\textsc {MaxIn}+\textsc {MaxOut})$ O ( MaxIn + MaxOut ) space, where MaxIn is the length of the longest word in {y1, … , yk} and $\textsc {MaxOut}=\max \limits \{\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| :N\in [1,k]\}$ MaxOut = max { ∥ M { y 1 , … , y N } ℓ ∥ : N ∈ [ 1 , k ] } . Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

show abstract

“…For example, if y = abaab, then M y = {aaa, aaba, bab, bb} and M 3 y = {aaa, bab, bb}. The upper bound on the number of minimal absent words is O(σn) [10], where σ is the size of the alphabet and n is the length of y, and this is tight for integer alphabets [6]; in fact, for large alphabets, such as when σ ≥ √ n, this bound is also tight even for minimal absent words having the same length [1].…”

Section: Introductionmentioning

confidence: 99%

Constructing Antidictionaries in Output-Sensitive Space

Ayad

Badkobeh

Fici

et al. 2019

2019 Data Compression Conference (DCC)

Self Cite

View full text Add to dashboard Cite

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y 1 , y 2 , . . . , y k over an alphabet Σ, we are asked to compute the set M y1#...#y k of minimal absent words of length at most of word y = y 1 #y 2 # . . . #y k , # / ∈ Σ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n = |y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when || M y1#...#y N || = o(n), for all N ∈ [1, k]. For instance, in the human genome, n ≈ 3 × 10 9 but || M 12 y1#...#y k || ≈ 10 6 . We consider a constantsized alphabet for stating our results. We show that all M y1 , . . . , M y1#...#y k can be computed in O(kn + k N =1 || M y1#...#y N ||) total time using O(MaxIn + MaxOut) space, where MaxIn is the length of the longest word in {y 1 , . . . , y k } and MaxOut = max{|| M y1#...#y N || : N ∈ [1, k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

show abstract

Alignment-free sequence comparison using absent words

Cited by 27 publications

References 37 publications

Absent words in a sliding window with applications

Absent words in a sliding window with applications

Constructing Antidictionaries of Long Texts in Output-Sensitive Space

Constructing Antidictionaries in Output-Sensitive Space

Contact Info

Product

Resources

About