2019
DOI: 10.1101/527515
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The number of spaced-word matches between twoDNAsequences as a function of the underlying pattern weight

Abstract: We study the number N k of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences -i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor -can be estimated from the slope of a function F that depends on N k and that is affine-linear within a certain range of k. Integers k min and k max can be calculated depending on the length of the input sequences, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5

Relationship

2
3

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 51 publications
0
3
0
Order By: Relevance
“…The tool can estimate distances between samples with high accuracy from low-coverage and mixed-coverage genome skims with no prior knowledge of the coverage or the sequencing error. Slope-SpaM [83] estimates the phylogenetic distance between two DNA sequences by calculating the number Nk of k-mer matches for a range of values of k. The distance between the sequences can then be accurately estimated from the slope of a certain function that depends on Nk. Instead of exact word matches, the program can also use SpaMs w.r.t.…”
Section: Multi-spammentioning
confidence: 99%
“…The tool can estimate distances between samples with high accuracy from low-coverage and mixed-coverage genome skims with no prior knowledge of the coverage or the sequencing error. Slope-SpaM [83] estimates the phylogenetic distance between two DNA sequences by calculating the number Nk of k-mer matches for a range of values of k. The distance between the sequences can then be accurately estimated from the slope of a certain function that depends on Nk. Instead of exact word matches, the program can also use SpaMs w.r.t.…”
Section: Multi-spammentioning
confidence: 99%
“…For example, if a match-pattern “ 11 0 11 ” is used, then “ CT G AC ” versus “ CT T AC ” constitutes a match. Spaced k -mers have been shown to improve mapping sensitivity, the accuracy of phylogenies, and the performance of sequence classification [5,7,13,15]. Analogously, substring-based methods can be relaxed to allow for some mismatches [6].…”
Section: Introductionmentioning
confidence: 99%
“…In recent years, a large number of alignment-free approaches to phylogeny reconstruction have been developed and applied, since these methods are much faster than traditional, alignment-based phylogenetic methods, see [51,39,3,25] for recent review papers and [50] for a systematic evaluation of alignment-free software tools. Most alignment-free approaches are based on k-mer statistics [21,44,7,48,17], but there are also approaches based on the length of common substrings [47,8,27,37,32,46], on word or spaced-word matches [38,33,35,34,1,41] or on so-called micro-alignments [49,20,29,28]. As has been mentioned by various authors, an additional advantage of many alignment-free methods is that they can be applied not only to complete genome sequences, but also to unassembled reads.…”
Section: Introductionmentioning
confidence: 99%