The well-established inaccuracy of purely computational methods for annotating genome sequences necessitates an interactive tool to allow biological experts to refine these approximations by viewing and independently evaluating the data supporting each annotation. Apollo was developed to meet this need, enabling curators to inspect genome annotations closely and edit them. FlyBase biologists successfully used Apollo to annotate the Drosophila melanogaster genome and it is increasingly being used as a starting point for the development of customized annotation editing tools for other genome projects. RationaleUnadorned genomic sequence data is simply a string of As, Ts, Gs, and Cs, with perhaps an associated confidence value for each base. In this raw state, sequence data provides very little biological insight. To utilize any sequence it must be interpreted in the context of other biological knowledge. This is the process of annotation, the task of adding explanatory notations to the sequence text. We define an annotation as the biological evaluation and explanation of a specific region on a nucleic acid sequence that includes, but is not limited to, gene transcripts. Any feature that can be anchored to the sequence -for example, an exon, a promoter, a transposable element, a regulatory region, or a CpG island -is an annotation. The genomic sequence will stabilize and reach a finite endpoint, but the annotations will continue to evolve indefinitely, as biological knowledge increases. To understand the genetic legacy of an organism we must interpret its genomic sequence, translating the information it contains in molecular form into humanreadable annotations.Part of this process is purely computational, and in its simplest terms can be described as a process of recognition: can anything be located that is somehow already familiar? The first obvious tactic is to collect sequences that may represent interesting biological features and to search the genomic sequence in order to discover the presence or absence of similar sequences. The principle is the same whether the sequences used in this comparison are expressed sequence tags (ESTs), full-length cDNAs, repeated elements or highly conserved sequences, and whether the sequences come from the same species, a closely related species or a distantly
We describe here our experience in annotating the Drosophila melanogaster genome sequence, in the course of which we developed several new open-source software tools and a database schema to support large-scale genome annotation. We have developed these into an integrated and reusable software system for whole-genome annotation. The key contributions to overall annotation quality are the marshalling of high-quality sequences for alignments and the design of a system with an adaptable and expandable flexible architecture. RationaleThe information held in genomic sequence is encoded and highly compressed; to extract biologically interesting data we must decrypt this primary data computationally. This generates results that provide a measure of biologically relevant characteristics, such as coding potential or sequence similarity, present in the sequence. Because of the amount of sequence to be examined and the volume of data generated, these results must be automatically processed and carefully filtered.There are essentially three different strategies for wholegenome analysis. The first is a purely automatic synthesis from a combination of analyses to predict gene models. The second aggregates analyses contributed by the research community that the user is then required to integrate visually on a public website. The third is curation by experts using a full trail of evidence to support an integrated assessment. Several groups charged with rapidly providing a dispersed community with genome annotations have chosen the purely computational route; examples are Ensembl [1] and the National Center for Biotechnology Information (NCBI) [2]. Approaches using aggregation adapt well to the dynamics of collaborative groups which are focused on sharing results as they accrue; examples are the University of California Santa Cruz (UCSC) genome browser [3] and the Distributed Annotation System (DAS) [4]. For organisms with well established and cohesive communities the demand is for carefully reviewed and qualified annotations; this approach was adopted by three of the oldest genome-community databases, SGD for Saccharomyces cerevisiae [5], ACeDB for Caenorhabditis elegans (documentation, code and data available from anonymous FTP servers at [6]) and FlyBase for Drosophila melanogaster [7].We decided to examine every gene and feature of the Drosophila genome and manually improve the quality of the
In order to take full advantage of next generation genomics data, I need informatics methods to be based on agreed upon formally specified standards that can be implemented easily in a uniform fashion without ambiguity. These standards should be encoded as logical formulae, so that provably correct and efficient decision procedures can be used for query answering and validation.In this paper I present the core of such a standard for sequence data: a collection of definitions of relations that hold between genomic intervals, and an alegbra for performing operations upon these intervals. I show how these relations can be used to extend formalize concepts in the Sequence Ontology (SO).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.