2017
DOI: 10.1186/s12859-017-1658-0
|View full text |Cite
|
Sign up to set email alerts
|

A greedy alignment-free distance estimator for phylogenetic inference

Abstract: BackgroundAlignment-free sequence comparison approaches have been garnering increasing interest in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. While k-mer based methods are predominantly used in real applications, the average common substring (ACS) approach is emerging as one of the prominent alignment-free approaches. This ACS approach has been further generalized by some recent work, either greedily or exactly, by allowing a bounded number of mis… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
30
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 28 publications
(31 citation statements)
references
References 29 publications
(30 reference statements)
1
30
0
Order By: Relevance
“…For such data, standard two-phase methods that first compute an alignment and then compute a tree do not have acceptable accuracy, while PASTA [32], BAli-Phy [33], and other co-estimation methods are not fast. It is possible that alignment-free methods (see [34,35,36] for an entry into this topic) might provide good starting trees, but these have not been tested on ultra-large datasets (with thousands of species), and have instead mainly been focused on genome-scale analyses of tens of genomes. However, for any large dataset on which the starting trees cannot be reasonably accurately estimated quickly, blended DTM divide-and-conquer strategies may provide the best accuracy.…”
Section: Discussionmentioning
confidence: 99%
“…For such data, standard two-phase methods that first compute an alignment and then compute a tree do not have acceptable accuracy, while PASTA [32], BAli-Phy [33], and other co-estimation methods are not fast. It is possible that alignment-free methods (see [34,35,36] for an entry into this topic) might provide good starting trees, but these have not been tested on ultra-large datasets (with thousands of species), and have instead mainly been focused on genome-scale analyses of tens of genomes. However, for any large dataset on which the starting trees cannot be reasonably accurately estimated quickly, blended DTM divide-and-conquer strategies may provide the best accuracy.…”
Section: Discussionmentioning
confidence: 99%
“…In this study, the results based on 14 dissimilarity measures are evaluated. ALFRED-G [57] uses an efficient algorithm to calculate the length of maximal k-mismatch common substrings between two sequences. Specifically, to measure the degree of dissimilarity between two nucleic acid or protein sequences, the program calculates the length of maximal word pairs -one word from each of the sequences -with up to k mismatches.…”
Section: Alignment-free Toolsmentioning
confidence: 99%
“…These approaches are also very efficient, since common substrings can be rapidly found using suffix trees or related data structures [55,27]. As a generalization of this approach, some methods use longest common substrings with a certain number of mismatches [30,54,53,33,3].…”
Section: Introductionmentioning
confidence: 99%