Alignment quality may have as much impact on phylogenetic reconstruction as the phylogenetic methods used. Not only the alignment algorithm, but also the method used to deal with the most problematic alignment regions, may have a critical effect on the final tree. Although some authors remove such problematic regions, either manually or using automatic methods, in order to improve phylogenetic performance, others prefer to keep such regions to avoid losing any information. Our aim in the present work was to examine whether phylogenetic reconstruction improves after alignment cleaning or not. Using simulated protein alignments with gaps, we tested the relative performance in diverse phylogenetic analyses of the whole alignments versus the alignments with problematic regions removed with our previously developed Gblocks program. We also tested the performance of more or less stringent conditions in the selection of blocks. Alignments constructed with different alignment methods (ClustalW, Mafft, and Probcons) were used to estimate phylogenetic trees by maximum likelihood, neighbor joining, and parsimony. We show that, in most alignment conditions, and for alignments that are not too short, removal of blocks leads to better trees. That is, despite losing some information, there is an increase in the actual phylogenetic signal. Overall, the best trees are obtained by maximum-likelihood reconstruction of alignments cleaned by Gblocks. In general, a relaxed selection of blocks is better for short alignment, whereas a stringent selection is more adequate for longer ones. Finally, we show that cleaned alignments produce better topologies although, paradoxically, with lower bootstrap. This indicates that divergent and problematic alignment regions may lead, when present, to apparently better supported although, in fact, more biased topologies.
Butterflies (Papilionoidea), with over 18,000 described species [1], have captivated naturalists and scientists for centuries. They play a central role in the study of speciation, community ecology, biogeography, climate change, and plant-insect interactions and include many model organisms and pest species [2, 3]. However, a robust higher-level phylogenetic framework is lacking. To fill this gap, we inferred a dated phylogeny by analyzing the first phylogenomic dataset, including 352 loci (> 150,000 bp) from 207 species representing 98% of tribes, a 35-fold increase in gene sampling and 3-fold increase in taxon sampling over previous studies [4]. Most data were generated with a new anchored hybrid enrichment (AHE) [5] gene kit (BUTTERFLY1.0) that includes both new and frequently used (e.g., [6]) informative loci, enabling direct comparison and future dataset merging with previous studies. Butterflies originated around 119 million years ago (mya) in the late Cretaceous, but most extant lineages diverged after the Cretaceous-Paleogene (K-Pg) mass-extinction 65 mya. Our analyses support swallowtails (Papilionidae) as sister to all other butterflies, followed by skippers (Hesperiidae) + the nocturnal butterflies (Hedylidae) as sister to the remainder, indicating a secondary reversal from diurnality to nocturnality. The whites (Pieridae) were strongly supported as sister to brush-footed butterflies (Nymphalidae) and blues + metalmarks (Lycaenidae and Riodinidae). Ant association independently evolved once in Lycaenidae and twice in Riodinidae. This study overturns prior notions of the taxon's evolutionary history, as many long-recognized subfamilies and tribes are para- or polyphyletic. It also provides a much-needed backbone for a revised classification of butterflies and for future comparative studies including genome evolution and ecology.
Summary1. The generalized mixed Yule-coalescent (GMYC) model has become one of the most popular approaches for species delimitation based on single-locus data, and it is widely used in biodiversity assessments and phylogenetic community ecology. We here examine an array of factors affecting GMYC resolution (tree reconstruction method, taxon sampling coverage/taxon richness and geographic sampling intensity/geographic scale). 2. We test GMYC performance based on empirical data (DNA barcoding of the Romanian butterflies) on a solid taxonomic framework (i.e. all species are thought to be described and can be determined with independent sources of evidence). The data set is comprehensive (176 species), and intensely and homogeneously sampled (1303 samples representing the main populations of butterflies in this country). Taxonomy was assessed based on morphology, including linear and geometric morphometry when needed. 3. The number of GMYC entities obtained constantly exceeds the total number of morphospecies in the data set. We show that c. 80% of the species studied are recognized as entities by GMYC. Interestingly, we show that this percentage is practically the maximum that a single-threshold method can provide for this data set. Thus, the c. 20% of failures are attributable to intrinsic properties of the COI polymorphism: overlap in inter-and intraspecific divergences and non-monophyly of the species likely because of introgression or lack of independent lineage sorting. 4. Our results demonstrate that this method is remarkably stable under a wide array of circumstances, including most phylogenetic reconstruction methods, high singleton presence (up to 95%), taxon richness (above five species) and the presence of gaps in intraspecific sampling coverage (removal of intermediate haplotypes). Hence, the method is useful to designate an optimal divergence threshold in an objective manner and to pinpoint potential cryptic species that are worth being studied in detail. However, the existence of a substantial percentage of species wrongly delimited indicates that GMYC cannot be used as sufficient evidence for evaluating the specific status of particular cases without additional data. 5. Finally, we provide a set of guidelines to maximize efficiency in GMYC analyses and discuss the range of studies that can take advantage of the method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.