Making automated multiple alignments of very large numbers of protein sequences

Sievers, Fabian; Dineen, David; Wilm, Andreas; Higgins, Desmond G.

doi:10.1093/bioinformatics/btt093

Cited by 52 publications

(49 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In all cases, the quality scores for the default guide trees fall off as the number of sequences increases, as was found in ref. 20. For chained trees, however, the quality scores fall off much more slowly than for either default or balanced trees.…”

Section: Resultsmentioning

confidence: 96%

Simple chained guide trees give high-quality protein multiple sequence alignments

Boyce

Sievers

Higgins

2014

Proc. Natl. Acad. Sci. U.S.A.

Self Cite

View full text Add to dashboard Cite

Guide trees are used to decide the order of sequence alignment in the progressive multiple sequence alignment heuristic. These guide trees are often the limiting factor in making large alignments, and considerable effort has been expended over the years in making these quickly or accurately. In this article we show that, at least for protein families with large numbers of sequences that can be benchmarked with known structures, simple chained guide trees give the most accurate alignments. These also happen to be the fastest and simplest guide trees to construct, computationally. Such guide trees have a striking effect on the accuracy of alignments produced by some of the most widely used alignment packages. There is a marked increase in accuracy and a marked decrease in computational time, once the number of sequences goes much above a few hundred. This is true, even if the order of sequences in the guide tree is random.he generation of a multiple sequence alignment (MSA) is standard practice during most comparative analyses of homologous genes or proteins. Since the mid-1980s, most automated MSAs have been made using a heuristic approach that Feng and Doolittle (1) called "progressive alignment." This involves clustering the sequences into a tree or dendrogram-like structure, called a "guide tree" in Higgins et al. (2). This guide tree is then used to align the sequences into progressively larger and larger alignments, following the branching order in the tree. Variations on the method were described by various groups in the 1980s [e.g., Taylor (3) and Barton and Sternberg (4)], but the earliest clear description of the approach is from Hogeweg and Hesper (5). Progressive alignment is a heuristic approach and is not guaranteed to find the best possible alignment for any given scoring scheme. It does, however, allow alignments of many sequences to be made quickly, even on personal computers (6). The quality of the alignments is good enough for the alignments to be used automatically in many analysis pipelines.The computational complexity of the alignment process, once a guide tree is created, is approximately OðNÞ for N sequences of the same length. The creation of the guide tree involves comparing all N sequences to each other to generate a distance matrix, which is clearly going to require OðN 2 Þ time and computer memory. Once the distance matrix is made, it will require a further clustering step that is usually OðN 2 Þ but can be more expensive. For large N, the construction of the guide tree becomes limiting and prevents the routine alignment of more than a few thousand sequences. Over the years, various attempts have been made to get around this problem. One solution is to quickly make a crude guide tree initially and to iterate that from an initial MSA. This approach is adopted in the widely used Muscle (7) and Mafft (8) packages. Barton and Sternberg were the first authors to use iteration, but they used a simple "chained" guide tree topology, effectively aligning the sequences one at a time to a growing...

show abstract

Section: Resultsmentioning

confidence: 96%

Simple chained guide trees give high-quality protein multiple sequence alignments

Boyce

Sievers

Higgins

2014

Proc. Natl. Acad. Sci. U.S.A.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Performance studies have shown that some MSA methods can produce highly accurate alignments for large slowly evolving datasets (e.g., Sievers et al, (2013). However, studies focusing on phylogeny estimation with up to 28,000 sequences have shown that only SATé-I (Liu et al, 2009) and SATé-II (Liu et al, 2011) produced sufficiently accurate analyses of sequence datasets that are large and evolve under high rates of evolution.…”

Section: Introductionmentioning

confidence: 99%

PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences

Mirarab

Nguyen

Guo

et al. 2015

Journal of Computational Biology

391

380

View full text Add to dashboard Cite

We introduce PASTA, a new multiple sequence alignment algorithm. PASTA uses a new technique to produce an alignment given a guide tree that enables it to be both highly scalable and very accurate. We present a study on biological and simulated data with up to 200,000 sequences, showing that PASTA produces highly accurate alignments, improving on the accuracy and scalability of the leading alignment methods (including SATé). We also show that trees estimated on PASTA alignments are highly accurate-slightly better than SATé trees, but with substantial improvements relative to other methods. Finally, PASTA is faster than SATé, highly parallelizable, and requires relatively little memory.

show abstract

“…We adapted the SeqTrack algorithm (17) to perform graph construction. Sequences were aligned using Clustal Omega 1.2.1 (34), and the resultant distance matrix was converted into a similarity matrix by taking 1 − distance. Affinity propagation (35) clustering was performed on each segment's similarity matrix to determine a threshold cutoff similarity value, defined as the minimum (across all clusters for that segment) of minimum in-cluster pairwise identities, below which we deemed it implausible for an evolutionary descent (clonal or reassortment) to have occurred (Fig.…”

Section: Methodsmentioning

confidence: 99%

“…Phylogenetic reconstruction was done for a subset of H3N8 viruses isolated from Minto Flats, Alaska, between 2009 and 2010 as part of a separate study. Briefly, each segment of the viral genomes was individually aligned, using Clustal Omega (34), and their genealogies were reconstructed, using BEAST 1.8.0 (36). A minimum of three Markov chain Monte Carlo runs that converged on a single optimal tree were chosen to compute the maximum clade credibility tree.…”

Section: Methodsmentioning

confidence: 99%

Reticulate evolution is favored in influenza niche switching

Hill

Zabilansky

et al. 2016

Proc. Natl. Acad. Sci. U.S.A.

View full text Add to dashboard Cite

Reticulate evolution is thought to accelerate the process of evolution beyond simple genetic drift and selection, helping to rapidly generate novel hybrids with combinations of adaptive traits. However, the long-standing dogma that reticulate evolutionary processes are likewise advantageous for switching ecological niches, as in microbial pathogen host switch events, has not been explicitly tested. We use data from the influenza genome sequencing project and a phylogenetic heuristic approach to show that reassortment, a reticulate evolutionary mechanism, predominates over mutational drift in transmission between different host species. Moreover, as host evolutionary distance increases, reassortment is increasingly favored. We conclude that the greater the quantitative difference between ecological niches, the greater the importance of reticulate evolutionary processes in overcoming niche barriers.ecology | reticulate evolution | influenza | host switch | reassortment

show abstract

Making automated multiple alignments of very large numbers of protein sequences

Abstract: Supplementary data are available at Bioinformatics online.

Cited by 52 publications

References 37 publications

Simple chained guide trees give high-quality protein multiple sequence alignments

Simple chained guide trees give high-quality protein multiple sequence alignments

PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences

Reticulate evolution is favored in influenza niche switching

Contact Info

Product

Resources

About