2021
DOI: 10.1101/2021.01.15.426881
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The statistics ofk-mers from a sequence undergoing a simple mutation process without spurious matches

Abstract: K-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g. a genome or a read) undergoes a simple mutation process whereby each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the n… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
30
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
1

Relationship

3
2

Authors

Journals

citations
Cited by 9 publications
(30 citation statements)
references
References 42 publications
0
30
0
Order By: Relevance
“…We say that a match between two sequences s 1 and s 2 occur at position i and i in the two strings respectively, if the the k-mer (strobemer) extracted from position i in s and i in t produce the same k-mer (strobemer). Furthermore, we say that this match covers positions [i, i + k] for k-mers, and [i, i + k 1 ] for strobemers in s. We adapt similar terminology as in (14) and denote a maximal interval of consecutive positions without matches between s and t as an island. To evaluate the ability to preserve matches under different error rates, we compare (i) the number of matches, (ii) the total fraction of covered positions across the strings, and (iii) the distribution of islands.…”
Section: Resultsmentioning
confidence: 99%
See 4 more Smart Citations
“…We say that a match between two sequences s 1 and s 2 occur at position i and i in the two strings respectively, if the the k-mer (strobemer) extracted from position i in s and i in t produce the same k-mer (strobemer). Furthermore, we say that this match covers positions [i, i + k] for k-mers, and [i, i + k 1 ] for strobemers in s. We adapt similar terminology as in (14) and denote a maximal interval of consecutive positions without matches between s and t as an island. To evaluate the ability to preserve matches under different error rates, we compare (i) the number of matches, (ii) the total fraction of covered positions across the strings, and (iii) the distribution of islands.…”
Section: Resultsmentioning
confidence: 99%
“…The total sequence coverage and match coverage of a string s is calculated as the union of all positions covered under the definitions of sequence coverage and match coverage, respectively. We adopt similar terminology as in (18) and denote a maximal interval of consecutive positions without matches as an island .…”
Section: Resultsmentioning
confidence: 99%
See 3 more Smart Citations