We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.
Dynamic programming algorithms guarantee to find the optimal alignment between two sequences. For more than a few sequences, exact algorithms become computationally impractical, and progressive algorithms iterating pairwise alignments are widely used. These heuristic methods have a serious drawback because pairwise algorithms do not differentiate insertions from deletions and end up penalizing single insertion events multiple times. Such an unrealistically high penalty for insertions typically results in overmatching of sequences and an underestimation of the number of insertion events. We describe a modification of the traditional alignment algorithm that can distinguish insertion from deletion and avoid repeated penalization of insertions and illustrate this method with a pair hidden Markov model that uses an evolutionary scoring function. In comparison with a traditional progressive alignment method, our algorithm infers a greater number of insertion events and creates gaps that are phylogenetically consistent but spatially less concentrated. Our results suggest that some insertion͞deletion ''hot spots'' may actually be artifacts of traditional alignment algorithms.insertion͞deletion ͉ progressive algorithm ͉ sequence alignment S equence alignment is a central tool in molecular biology.High sequence similarity between a pair of molecules usually implies significant structural and functional similarities, such that information on a known molecule can often be assigned to an unknown molecule that shows high sequence conservation in a pairwise alignment. Also, related molecules represent semiindependent realizations of an evolutionary process and possess information regarding the structural constraints enabling the element to maintain its function. The reconstruction of the evolutionary history of a set of molecules requires an assessment of homology among their characters, i.e., a multiple alignment. Pairwise and Multiple AlignmentThe comparison of two biological sequences closely resembles the edit transcript problem in computer science (1), although biologists traditionally focus more on the product than the process and call the result an alignment. The first dynamic programming algorithm for pairwise alignment of biological sequences was described by Needleman and Wunsch (2), and modifications reducing its time complexity from O(L 3 ) to O(L 2 ) (where L is the sequence length) soon followed (see ref. 3 for a review). In real life, insertion͞deletion (indel) events affect sequence regions of very different lengths, and the early methods' gap costs (proportional to gap length) were unsatisfactory: the gap cost is either so high that long gaps never appear or so low that the alignment gets fragmented by numerous smalllength gaps. An elegant O(3L 2 )-complexity solution was proposed by Gotoh (4) by the separation of the gap opening and gap extension costs (leading to so-called affine gap scores). Importantly, by using Hirschberg's divide-and-conquer recursion (5, 6), these algorithms can all be implemented in m...
Genetic sequence alignment is the basis of many evolutionary and comparative studies, and errors in alignments lead to errors in the interpretation of evolutionary information in genomes. Traditional multiple sequence alignment methods disregard the phylogenetic implications of gap patterns that they create and infer systematically biased alignments with excess deletions and substitutions, too few insertions, and implausible insertion-deletion-event histories. We present a method that prevents these systematic errors by recognizing insertions and deletions as distinct evolutionary events. We show theoretically and practically that this improves the quality of sequence alignments and downstream analyses over a wide range of realistic alignment problems. These results suggest that insertions and sequence turnover are more common than is currently thought and challenge the conventional picture of sequence evolution and mechanisms of functional and structural changes.
Evolutionary analyses require sequence alignments that correctly represent evolutionary homology. Evolutionary and structural homology are not the same and sequence alignments generated with methods designed for structural matching can be seriously misleading in comparative and phylogenetic analyses. The phylogeny-aware alignment algorithm implemented in the program PRANK has been shown to produce good alignments for evolutionary inferences. Unlike other alignment programs, PRANK makes use of phylogenetic information to distinguish alignment gaps caused by insertions or deletions and, thereafter, handles the two types of events differently. As a by-product of the correct handling of insertions and deletions, PRANK can provide the inferred ancestral sequences as a part of the output and mark the alignment gaps differently depending on their origin in insertion or deletion events. As the algorithm infers the evolutionary history of the sequences, PRANK can be sensitive to errors in the guide phylogeny and violations on the underlying assumptions about the origin and patterns of gaps. These issues are discussed in detail and practical advice for the use of PRANK in evolutionary analysis is provided. The PRANK software and other methods discussed here can be found from the program home page at http://code.google.com/p/prank-msa/.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.