2014
DOI: 10.1093/bioinformatics/btu177
|View full text |Cite
|
Sign up to set email alerts
|

Fast alignment-free sequence comparison using spaced-word frequencies

Abstract: Motivation: Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well-known problem with these methods is that neigh… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
137
0
1

Year Published

2015
2015
2020
2020

Publication Types

Select...
5
3
2

Relationship

5
5

Authors

Journals

citations
Cited by 127 publications
(140 citation statements)
references
References 44 publications
1
137
0
1
Order By: Relevance
“…However, as focussing on selected longer sequence motifs can still be beneficial for classification, we also recorded the frequencies of the 100 most abundant 8-mers in an independent set of bacterial genomes, scanning both strands and allowing for one mismatch. Spaced words were introduced for the alignment of dissimilar sequences4142. Thus, their incorporation is useful in the context of novel species discovery.…”
Section: Methodsmentioning
confidence: 99%
“…However, as focussing on selected longer sequence motifs can still be beneficial for classification, we also recorded the frequencies of the 100 most abundant 8-mers in an independent set of bacterial genomes, scanning both strands and allowing for one mismatch. Spaced words were introduced for the alignment of dissimilar sequences4142. Thus, their incorporation is useful in the context of novel species discovery.…”
Section: Methodsmentioning
confidence: 99%
“…Below is a spaced-word match between two DNA sequences S 1 and S 2 at (5, 2) with respect to the pattern P  = 1100101: S1:GCTGTATACGTCS2:GTACACTTATP:1100101 By definition, nucleotides in S 1 and S 2 corresponding to a match position of P are identical, while at the don’t-care positions mismatches are possible. Throughout this paper, we use a single pattern P if two sequences are compared, as opposed to the multiple-pattern approach that we previously used (Leimeister et al , 2014). …”
Section: Algorithmmentioning
confidence: 99%
“…a predefined binary pattern of "match positions" and "don't care positions". spaced [84][85][86] is similar to previous methods that compare the k-mer composition of DNA or protein sequences. However, the program uses so-called "spaced words" instead of k-mers.…”
Section: Multi-spammentioning
confidence: 99%