PhyML is a phylogeny software based on the maximum-likelihood principle. Early PhyML versions used a fast algorithm performing nearest neighbor interchanges to improve a reasonable starting tree topology. Since the original publication (Guindon S., Gascuel O. 2003. A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696-704), PhyML has been widely used (>2500 citations in ISI Web of Science) because of its simplicity and a fair compromise between accuracy and speed. In the meantime, research around PhyML has continued, and this article describes the new algorithms and methods implemented in the program. First, we introduce a new algorithm to search the tree space with user-defined intensity using subtree pruning and regrafting topological moves. The parsimony criterion is used here to filter out the least promising topology modifications with respect to the likelihood function. The analysis of a large collection of real nucleotide and amino acid data sets of various sizes demonstrates the good performance of this method. Second, we describe a new test to assess the support of the data for internal branches of a phylogeny. This approach extends the recently proposed approximate likelihood-ratio test and relies on a nonparametric, Shimodaira-Hasegawa-like procedure. A detailed analysis of real alignments sheds light on the links between this new approach and the more classical nonparametric bootstrap method. Overall, our tests show that the last version (3.0) of PhyML is fast, accurate, stable, and ready to use. A Web server and binary files are available from http://www.atgc-montpellier.fr/phyml/.
We revisit statistical tests for branches of evolutionary trees reconstructed upon molecular data. A new, fast, approximate likelihood-ratio test (aLRT) for branches is presented here as a competitive alternative to nonparametric bootstrap and Bayesian estimation of branch support. The aLRT is based on the idea of the conventional LRT, with the null hypothesis corresponding to the assumption that the inferred branch has length 0. We show that the LRT statistic is asymptotically distributed as a maximum of three random variables drawn from the chi(0)2 + chi(1)2 distribution. The new aLRT of interior branch uses this distribution for significance testing, but the test statistic is approximated in a slightly conservative but practical way as 2(l1- l2), i.e., double the difference between the maximum log-likelihood values corresponding to the best tree and the second best topological arrangement around the branch of interest. Such a test is fast because the log-likelihood value l2 is computed by optimizing only over the branch of interest and the four adjacent branches, whereas other parameters are fixed at their optimal values corresponding to the best ML tree. The performance of the new test was studied on simulated 4-, 12-, and 100-taxon data sets with sequences of different lengths. The aLRT is shown to be accurate, powerful, and robust to certain violations of model assumptions. The aLRT is implemented within the algorithm used by the recent fast maximum likelihood tree estimation program PHYML (Guindon and Gascuel, 2003).
Phylogenetic inference and evaluating support for inferred relationships is at the core of many studies testing evolutionary hypotheses. Despite the popularity of nonparametric bootstrap frequencies and Bayesian posterior probabilities, the interpretation of these measures of tree branch support remains a source of discussion. Furthermore, both methods are computationally expensive and become prohibitive for large data sets. Recent fast approximate likelihood-based measures of branch supports (approximate likelihood ratio test [aLRT] and Shimodaira–Hasegawa [SH]-aLRT) provide a compelling alternative to these slower conventional methods, offering not only speed advantages but also excellent levels of accuracy and power. Here we propose an additional method: a Bayesian-like transformation of aLRT (aBayes). Considering both probabilistic and frequentist frameworks, we compare the performance of the three fast likelihood-based methods with the standard bootstrap (SBS), the Bayesian approach, and the recently introduced rapid bootstrap. Our simulations and real data analyses show that with moderate model violations, all tests are sufficiently accurate, but aLRT and aBayes offer the highest statistical power and are very fast. With severe model violations aLRT, aBayes and Bayesian posteriors can produce elevated false-positive rates. With data sets for which such violation can be detected, we recommend using SH-aLRT, the nonparametric version of aLRT based on a procedure similar to the Shimodaira–Hasegawa tree selection. In general, the SBS seems to be excessively conservative and is much slower than our approximate likelihood-based methods.
The selective pressure at the protein level is usually measured by the nonsynonymous/synonymous rate ratio (omega = dN/dS), with omega < 1, omega = 1, and omega > 1 indicating purifying (or negative) selection, neutral evolution, and diversifying (or positive) selection, respectively. The omega ratio is commonly calculated as an average over sites. As every functional protein has some amino acid sites under selective constraints, averaging rates across sites leads to low power to detect positive selection. Recently developed models of codon substitution allow the omega ratio to vary among sites and appear to be powerful in detecting positive selection in empirical data analysis. In this study, we used computer simulation to investigate the accuracy and power of the likelihood ratio test (LRT) in detecting positive selection at amino acid sites. The test compares two nested models: one that allows for sites under positive selection (with omega > 1), and another that does not, with the chi2 distribution used for significance testing. We found that use of the chi(2) distribution makes the test conservative, especially when the data contain very short and highly similar sequences. Nevertheless, the LRT is powerful. Although the power can be low with only 5 or 6 sequences in the data, it was nearly 100% in data sets of 17 sequences. Sequence length, sequence divergence, and the strength of positive selection also were found to affect the power of the LRT. The exact distribution assumed for the omega ratio over sites was found not to affect the effectiveness of the LRT.
BackgroundRalstonia solanacearum is a soil-borne beta-proteobacterium that causes bacterial wilt disease in many food crops and is a major problem for agriculture in intertropical regions. R. solanacearum is a heterogeneous species, both phenotypically and genetically, and is considered as a species complex. Pathogenicity of R. solanacearum relies on the Type III secretion system that injects Type III effector (T3E) proteins into plant cells. T3E collectively perturb host cell processes and modulate plant immunity to enable bacterial infection.ResultsWe provide the catalogue of T3E in the R. solanacearum species complex, as well as candidates in newly sequenced strains. 94 T3E orthologous groups were defined on phylogenetic bases and ordered using a uniform nomenclature. This curated T3E catalog is available on a public website and a bioinformatic pipeline has been designed to rapidly predict T3E genes in newly sequenced strains. Systematical analyses were performed to detect lateral T3E gene transfer events and identify T3E genes under positive selection. Our analyses also pinpoint the RipF translocon proteins as major discriminating determinants among the phylogenetic lineages.ConclusionsEstablishment of T3E repertoires in strains representatives of the R. solanacearum biodiversity allowed determining a set of 22 T3E present in all the strains but provided no clues on host specificity determinants. The definition of a standardized nomenclature and the optimization of predictive tools will pave the way to understanding how variation of these repertoires is correlated to the diversification of this species complex and how they contribute to the different strain pathotypes.
Bayes prediction quantifies uncertainty by assigning posterior probabilities. It was used to identify amino acids in a protein under recurrent diversifying selection indicated by higher nonsynonymous (d(N)) than synonymous (d(S)) substitution rates or by omega = d(N)/d(S) > 1. Parameters were estimated by maximum likelihood under a codon substitution model that assumed several classes of sites with different omega ratios. The Bayes theorem was used to calculate the posterior probabilities of each site falling into these site classes. Here, we evaluate the performance of Bayes prediction of amino acids under positive selection by computer simulation. We measured the accuracy by the proportion of predicted sites that were truly under selection and the power by the proportion of true positively selected sites that were predicted by the method. The accuracy was slightly better for longer sequences, whereas the power was largely unaffected by the increase in sequence length. Both accuracy and power were higher for medium or highly diverged sequences than for similar sequences. We found that accuracy and power were unacceptably low when data contained only a few highly similar sequences. However, sampling a large number of lineages improved the performance substantially. Even for very similar sequences, accuracy and power can be high if over 100 taxa are used in the analysis. We make the following recommendations: (1) prediction of positive selection sites is not feasible for a few closely related sequences; (2) using a large number of lineages is the best way to improve the accuracy and power of the prediction; and (3) multiple models of heterogeneous selective pressures among sites should be applied in real data analysis.
Six extant species of non-human great apes are currently recognized: Sumatran and Bornean orangutans, eastern and western gorillas, and chimpanzees and bonobos [1]. However, large gaps remain in our knowledge of fine-scale variation in hominoid morphology, behavior, and genetics, and aspects of great ape taxonomy remain in flux. This is particularly true for orangutans (genus: Pongo), the only Asian great apes and phylogenetically our most distant relatives among extant hominids [1]. Designation of Bornean and Sumatran orangutans, P. pygmaeus (Linnaeus 1760) and P. abelii (Lesson 1827), as distinct species occurred in 2001 [1, 2]. Here, we show that an isolated population from Batang Toru, at the southernmost range limit of extant Sumatran orangutans south of Lake Toba, is distinct from other northern Sumatran and Bornean populations. By comparing cranio-mandibular and dental characters of an orangutan killed in a human-animal conflict to those of 33 adult male orangutans of a similar developmental stage, we found consistent differences between the Batang Toru individual and other extant Ponginae. Our analyses of 37 orangutan genomes provided a second line of evidence. Model-based approaches revealed that the deepest split in the evolutionary history of extant orangutans occurred ∼3.38 mya between the Batang Toru population and those to the north of Lake Toba, whereas both currently recognized species separated much later, about 674 kya. Our combined analyses support a new classification of orangutans into three extant species. The new species, Pongo tapanuliensis, encompasses the Batang Toru population, of which fewer than 800 individuals survive. VIDEO ABSTRACT.
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.