The statistics of<i>k</i>-mers from a sequence undergoing a simple mutation process without spurious matches

Blanca, Antonio; Harris, Robert S.; Koslicki, David; Medvedev, Paul

doi:10.1101/2021.01.15.426881

Cited by 9 publications

(30 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We say that a match between two sequences s 1 and s 2 occur at position i and i in the two strings respectively, if the the k-mer (strobemer) extracted from position i in s and i in t produce the same k-mer (strobemer). Furthermore, we say that this match covers positions [i, i + k] for k-mers, and [i, i + k 1 ] for strobemers in s. We adapt similar terminology as in (14) and denote a maximal interval of consecutive positions without matches between s and t as an island. To evaluate the ability to preserve matches under different error rates, we compare (i) the number of matches, (ii) the total fraction of covered positions across the strings, and (iii) the distribution of islands.…”

Section: Resultsmentioning

confidence: 99%

“…The total sequence coverage and match coverage of a string s is calculated as the union of all positions covered under the definitions of sequence coverage and match coverage, respectively. We adopt similar terminology as in (18) and denote a maximal interval of consecutive positions without matches as an island .…”

Section: Resultsmentioning

confidence: 99%

“…While our study provides an experimental evaluation of strobemers under some commonly used values of k and mutation rates, the statistics of strobemers remains to be explored. In (18), the authors derived the mean and variance of islands for k-mers and the number of mutated k-mers under given mutation rate. If we can derive analytic expressions for strobemers, it may suggest us how to optimize parameters of the strobemer protocols under various mutation rates, which will be useful for similarity comparison algorithms.…”

Section: Discussionmentioning

confidence: 99%

“…We used python v3.8 for the experiments. We simulated 1000 strings of length 100,000nt and computed the runtime to extract k-mers and strobemers under different subsequence sizes (18,36,54,60,72) and window sizes (1,10,20,30,40,50,100). The time to construct the strobemers is normalized with the time to construct k-mers.…”

Section: R a F Tmentioning

confidence: 99%

“…As the mutation rate increases, the number of matching k-mers between two sequences quickly reduces. In (18), the distribution of mutated k-mers was studied in detail. The authors provided closed-form expressions for the mean and variance estimates on the number of mutated k-mers under a random mutation model.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Strobemers: an alternative to k-mers for sequence comparison

Sahlin

2021

Preprint

View full text Add to dashboard Cite

K-mer-based methods are widely used in bioinformatics for various types of sequence comparison. However, a single mutation will mutate k consecutive k-mers and makes most k-mer based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, e.g., spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches due to the size of k. Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consists of linked minimizers. We show that under a certain minimizer selection technique, strobemers provide more evenly distributed sequence matches than k-mers and are less sensitive to different mutation rates and distributions. Strobemers also produce a higher total match coverage across sequences. Strobemers are a useful alternative to k-mers for performing sequence comparisons as commonly used in sequence alignment, clustering, classification, and error-correction. A reference implementation with code for analyses is available at https://github.com/ksahlin/strobemers.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: R a F Tmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Strobemers: an alternative to k-mers for sequence comparison

Sahlin

2021

Preprint

View full text Add to dashboard Cite

show abstract

The minimizer Jaccard estimator is biased and inconsistent

et al. 2022

Self Cite

View full text Add to dashboard Cite

Motivation Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequences. Results We show that the minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e. the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow. We derive an analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. We show both theoretically and empirically that there are families of sequences where the bias can be substantial (e.g. the true Jaccard can be more than double the estimate). Finally, we demonstrate that this bias affects the accuracy of the widely used mashmap read mapping tool. Availability and implementation Scripts to reproduce our experiments are available at https://github.com/medvedevgroup/minimizer-jaccard-estimator/tree/main/reproduce. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Theory of local k-mer selection with applications to long-read alignment

Shaw

2021

Preprint

View full text Add to dashboard Cite

Motivation: Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the "lowest-ordered" k-mer in a sliding window. Recently, it has been shown that minimizers are a sub-optimal method for selecting subsets of k-mers when mutations are present. There is however a lack of understanding behind the theory of why certain methods perform well. Results: We first theoretically investigate the conservation metric for k-mer selection methods. We derive an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers. As a demonstration of our results, we modified the minimap2 read aligner to use a more optimal k-mer selection method and demonstrate that there is up to an 8.2% relative increase in number of mapped reads. Availability and supplementary information: Simulations and supplementary methods available at https://github.com/bluenote-1577/local-kmer-selection-results. os-minimap2 is a modified version of minimap2 and available at https://github.com/bluenote-1577/os-minimap2.

show abstract

The statistics ofk-mers from a sequence undergoing a simple mutation process without spurious matches

Cited by 9 publications

References 42 publications

Strobemers: an alternative to k-mers for sequence comparison

Strobemers: an alternative to k-mers for sequence comparison

The minimizer Jaccard estimator is biased and inconsistent

Theory of local k-mer selection with applications to long-read alignment

Contact Info

Product

Resources

About