We address the problem of efficiently aligning a transcribed and spliced DNA sequence with a genomic sequence containing that gene, allowing for introns in the genomic sequence and a relatively small number of sequencing errors. A freely available computer program, described herein, solves the problem for a 100-kb genomic sequence in a few seconds on a workstation.
The ability to determine important features within DNA sequences from the sequences alone is becoming essential as large-scale sequencing projects are being undertaken. We present a method that can be applied to the problem of identifying the recognition pattern for a DNA-binding protein given only a collection of sequenced DNA fragments, each known to contain somewhere within it a binding site for that protein. Information about the position or orientation of the binding sites within those fragments is not needed. The method compares the "information content" of a large number of possible binding site alignments to arrive at a matrix representation of the binding site pattern. The specificity of the protein is represented as a matrix, rather than a consensus sequence, allowing patterns that are typical of regulatory protein-binding sites to be identified. The reliability of the method improves as the number of sequences increases, but the time required increases only linearly with the number of sequences. An example, using known cAMP receptor proteinbinding sites, illustrates the method.Gene expression is often controlled by protein factors that interact with DNA regions to affect transcription. Understanding the regulation of the expression requires knowing both the protein factors and the DNA sites at which they act. The sites have traditionally been determined by isolating cis-acting mutations that affect expression and then determining the changes in the DNA that accompany the mutant phenotypes (1). More recently, regulatory proteins have been used to affinity purify the DNA regions to which they bind, and "footprinting" techniques have further delimited the binding sites (2). Each of these methods is time consuming and gives only partial information about the binding site. The final determination ofthe binding site pattern usually includes a comparison of many example sites. A method to determine the binding sites from the DNA sequences alone could greatly facilitate the process. Current sequencing technology is rapid enough that the most efficient means of determining the binding specificity of a protein may be to sequence a collection of regions known to contain binding sites. These may be a set of restriction fragments that are shown to bind the protein or a collection of DNA segments to which binding sites have been mapped. Since each ofthe fragments contains a binding site, the pattern of bases recognized by the protein should be discernible as the most significant pattern in the collection. Regulatory PatternsThe difficulty arises that binding site patterns are not usually simple strings of bases. For example, Escherichia coli promoter sequences have two highly conserved parts, called the -35 and -10 regions (3,4). The consensus sequences for those are TTGACA and TATAAT, respectively. The consensus spacing between those regions is 17 bases, but other spacings are also allowed. An individual promoter may match the consensus at only a few positions and, while some positions are more conserved t...
Computational methods for automated genome annotation are critical to our community's ability to make full use of the large volume of genomic sequence being generated and released. To explore the accuracy of these automated feature prediction tools in the genomes of higher organisms, we evaluated their performance on a large, well-characterized sequence contig from the Adh region of Drosophila melanogaster. This experiment, known as the Genome Annotation Assessment Project (GASP), was launched in May 1999. Twelve groups, applying state-of-the-art tools, contributed predictions for features including gene structure, protein homologies, promoter sites, and repeat elements. We evaluated these predictions using two standards, one based on previously unreleased high-quality full-length cDNA sequences and a second based on the set of annotations generated as part of an in-depth study of the region by a group of Drosophila experts. Although these standard sets only approximate the unknown distribution of features in this region, we believe that when taken in context the results of an evaluation based on them are meaningful. The results were presented as a tutorial at the conference on Intelligent Systems in Molecular Biology (ISMB-99) in August 1999. Over 95% of the coding nucleotides in the region were correctly identified by the majority of the gene finders, and the correct intron/exon structures were predicted for >40% of the genes. Homology-based annotation techniques recognized and associated functions with almost half of the genes in the region; the remainder were only identified by the ab initio techniques. This experiment also presents the first assessment of promoter prediction techniques for a significant number of genes in a large contiguous region. We discovered that the promoter predictors' high false-positive rates make their predictions difficult to use. Integrating gene finding and cDNA/EST alignments with promoter predictions decreases the number of false-positive classifications but discovers less than one-third of the promoters in the region. We believe that by establishing standards for evaluating genomic annotations and by assessing the performance of existing automated genome annotation tools, this experiment establishes a baseline that contributes to the value of ongoing large-scale annotation projects and should guide further research in genome informatics.Genome annotation is a rapidly evolving field in genomics made possible by the large-scale generation of genomic sequences and driven predominantly by computational tools. The goal of the annotation process is to assign as much information as possible to the raw sequence of complete genomes with an emphasis on the location and structure of the genes. This can be accomplished by ab initio gene finding, by identifying homologies to known genes from other organisms, by the alignment of full-length or partial mRNA sequences to the genomic DNA, or through combinations of such methods. Related techniques can also be used to identify other features, ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.