Guide trees are used to decide the order of sequence alignment in the progressive multiple sequence alignment heuristic. These guide trees are often the limiting factor in making large alignments, and considerable effort has been expended over the years in making these quickly or accurately. In this article we show that, at least for protein families with large numbers of sequences that can be benchmarked with known structures, simple chained guide trees give the most accurate alignments. These also happen to be the fastest and simplest guide trees to construct, computationally. Such guide trees have a striking effect on the accuracy of alignments produced by some of the most widely used alignment packages. There is a marked increase in accuracy and a marked decrease in computational time, once the number of sequences goes much above a few hundred. This is true, even if the order of sequences in the guide tree is random.he generation of a multiple sequence alignment (MSA) is standard practice during most comparative analyses of homologous genes or proteins. Since the mid-1980s, most automated MSAs have been made using a heuristic approach that Feng and Doolittle (1) called "progressive alignment." This involves clustering the sequences into a tree or dendrogram-like structure, called a "guide tree" in Higgins et al. (2). This guide tree is then used to align the sequences into progressively larger and larger alignments, following the branching order in the tree. Variations on the method were described by various groups in the 1980s [e.g., Taylor (3) and Barton and Sternberg (4)], but the earliest clear description of the approach is from Hogeweg and Hesper (5). Progressive alignment is a heuristic approach and is not guaranteed to find the best possible alignment for any given scoring scheme. It does, however, allow alignments of many sequences to be made quickly, even on personal computers (6). The quality of the alignments is good enough for the alignments to be used automatically in many analysis pipelines.The computational complexity of the alignment process, once a guide tree is created, is approximately OðNÞ for N sequences of the same length. The creation of the guide tree involves comparing all N sequences to each other to generate a distance matrix, which is clearly going to require OðN 2 Þ time and computer memory. Once the distance matrix is made, it will require a further clustering step that is usually OðN 2 Þ but can be more expensive. For large N, the construction of the guide tree becomes limiting and prevents the routine alignment of more than a few thousand sequences. Over the years, various attempts have been made to get around this problem. One solution is to quickly make a crude guide tree initially and to iterate that from an initial MSA. This approach is adopted in the widely used Muscle (7) and Mafft (8) packages. Barton and Sternberg were the first authors to use iteration, but they used a simple "chained" guide tree topology, effectively aligning the sequences one at a time to a growing...
BackgroundProgressive alignment is the standard approach used to align large numbers of sequences. As with all heuristics, this involves a tradeoff between alignment accuracy and computation time.ResultsWe examine this tradeoff and find that, because of a loss of information in the early steps of the approach, the alignments generated by the most common multiple sequence alignment programs are inherently unstable, and simply reversing the order of the sequences in the input file will cause a different alignment to be generated. Although this effect is more obvious with larger numbers of sequences, it can also be seen with data sets in the order of one hundred sequences. We also outline the means to determine the number of sequences in a data set beyond which the probability of instability will become more pronounced.ConclusionsThis has major ramifications for both the designers of large-scale multiple sequence alignment algorithms, and for the users of these alignments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.