2020
DOI: 10.1093/sysbio/syaa058
|View full text |Cite|
|
Sign up to set email alerts
|

Phylogeny Estimation Given Sequence Length Heterogeneity

Abstract: Phylogeny estimation is a major step in many biological studies, and has many well known challenges. With the dropping cost of sequencing technologies, biologists now have increasingly large datasets available for use in phylogeny estimation. Here we address the challenge of estimating a tree given large datasets with a combination of full-length sequences and fragmentary sequences, which can arise due to a variety of reasons, including sample collection, sequencing technologies, and analytical pipelines. We c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
35
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 34 publications
(37 citation statements)
references
References 46 publications
(54 reference statements)
1
35
0
Order By: Relevance
“…Percent gapped is the percentage of the alignment matrix occupied by dashes. 1000M1 has replicate 16 removed due to being identified as an outlier in Smirnov and Warnow (2020b ). Statistics regarding fragments are computed after introducing fragments to the datasets, while the rest are computed prior to fragmentation.…”
Section: Experimental Study Designmentioning
confidence: 99%
See 1 more Smart Citation
“…Percent gapped is the percentage of the alignment matrix occupied by dashes. 1000M1 has replicate 16 removed due to being identified as an outlier in Smirnov and Warnow (2020b ). Statistics regarding fragments are computed after introducing fragments to the datasets, while the rest are computed prior to fragmentation.…”
Section: Experimental Study Designmentioning
confidence: 99%
“…Each condition has 20 replicates and each replicate contains 1000 sequences with ∼ 1000 nucleotides. We explore fragmentary versions of the 1000M1 and 1000M2 conditions (also studied by Smirnov and Warnow, 2020b ), which have high and moderately high rates of evolution, respectively, and medium indel lengths. The RNASim ( Mirarab et al , 2015 ) datasets evolve under a complex evolutionary process that reflects selective pressures needed to conserve rRNA structure; hence, this simulation condition is more complex than the standard GTR+indel simulations (such as for the ROSE datasets).…”
Section: Experimental Study Designmentioning
confidence: 99%
“…Let N F P be the number of splits onT but not T (false positives), and let N F N be the number of splits on T but notT (false negatives). When both trees are binary, N F P = N F N (Berry and Gascuel, 1996;Smirnov and Warnow, 2021); otherwise they can contribute differentially to error. The Robinson-Foulds (RF) distance (Robinson and Foulds, 1981), d RF = N F P + N F N , combines both errors in one measure of overall accuracy.…”
Section: Definitions Of Accuracymentioning
confidence: 99%
“…The Robinson-Foulds (RF) distance (Robinson and Foulds, 1981), d RF = N F P + N F N , combines both errors in one measure of overall accuracy. Here we distinguish between these errors explicitly by defining false positive and negative rates (Smirnov and Warnow, 2021):…”
Section: Definitions Of Accuracymentioning
confidence: 99%
“…Most commonly used methods further assume orthology, and errors in orthology detection are common (Laurin-Lemay et al, 2012;Salichos & Rokas, 2011). Alignment errors are also ubiquitous and can impact tree accuracy (Fletcher & Yang, 2010;Liu et al, 2009;Ogdenw & Rosenberg, 2006;Smirnov & Warnow, 2020;Wang et al, 2011). The prevalence of these errors in phylogenomic datasets has been appreciated (Hosner et al, 2016;Laurin-Lemay et al, 2012;Philippe et al, 2017;Sayyari et al, 2017;Springer & Gatesy, 2016, and several phylogenomics studies have now been criticized (Gatesy & Springer, 2014;Jeffroy et al, 2006;Salichos & Rokas, 2013;Shen et al, 2017;Springer & Gatesy, 2016.…”
Section: Introductionmentioning
confidence: 99%