The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.
The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of annotated genomic, transcript and protein sequence records derived from data in public sequence archives and from computation, curation and collaboration (http://www.ncbi.nlm.nih.gov/refseq/). We report here on growth of the mammalian and human subsets, changes to NCBI’s eukaryotic annotation pipeline and modifications affecting transcript and protein records. Recent changes to NCBI’s eukaryotic genome annotation pipeline provide higher throughput, and the addition of RNAseq data to the pipeline results in a significant expansion of the number of transcripts and novel exons annotated on mammalian RefSeq genomes. Recent annotation changes include reporting supporting evidence for transcript records, modification of exon feature annotation and the addition of a structured report of gene and sequence attributes of biological interest. We also describe a revised protein annotation policy for alternatively spliced transcripts with more divergent predicted proteins and we summarize the current status of the RefSeqGene project.
Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.[Supplemental material is available online at www.genome.org. Data sets and documentation are available in the CCDS database at http://www.ncbi.nlm.nih.gov/CCDS.]One key goal of genome projects is to identify and accurately annotate all protein-coding genes. The resulting annotations add functional context to the sequence data and make it easier to traverse to other rich sources of gene and protein information. Accurately annotating known genes, identifying novel genes, and tracking annotations over time are complex processes that are best achieved through a combination of large-scale computational analyses and expert curation. These methods must (1) process repetitive sequences in multiple categories including retrotransposons, segmental duplications, and paralogs; (2) process variation including copy number variation (CNV) (Feuk et al. 2006) and microsatellites; (3) distinguish functional genes and alleles from pseudogenes; (4) define alternate splice products; and (5) avoid erroneous interpretation based on experimental error.
Pheromones are water-soluble chemicals released and sensed by individuals of the same species to elicit social and reproductive behaviors or physiological changes; they are perceived primarily by the vomeronasal organ (VNO) in terrestrial vertebrates. Humans and some related primates possess only vestigial VNOs and have no or significantly reduced ability to detect pheromones, a phenomenon not well understood at the molecular level. Here we show that genes encoding the TRP2 ion channel and V1R pheromone receptors, two components of the vomeronasal pheromone signal transduction pathway, have been impaired and removed from functional constraints since shortly before the separation of hominoids and Old World monkeys Ϸ23 million years ago, and that the random inactivation of pheromone receptor genes is an ongoing process even in present-day humans. The phylogenetic distribution of vomeronasal pheromone insensitivity is concordant with those of conspicuous female sexual swelling and male trichromatic color vision, suggesting that a vision-based signaling-sensory mechanism may have in part replaced the VNO-mediated chemical-based system in the social͞reproductive activities of hominoids and Old World monkeys (catarrhines). P heromones are water-soluble chemicals used in intraspecific communications to elicit social and reproductive behaviors or physiological changes such as male-male aggression, onset of puberty, estrus, and induction of mating. Pheromones are perceived primarily by the vomeronasal organ (VNO), which is at the base of the nasal cavity and separated from the main olfactory epithelium that senses thousands of volatile odorants (1). It has been known for decades that some primate species, including humans, do not possess functional VNOs, and these organisms lack vomeronasal chemoreception to pheromones (1-3). This insensitivity has likely had important impacts on the sexual and social behaviors of many primates. On the other hand, behavioral changes may have also altered natural selection on vomeronasal chemoreception. It is therefore of interest to find when the vomeronasal pheromone insensitivity occurred in evolution, how it occurred, and why it occurred.Vomeronasal pheromone perception begins by the binding of pheromones to pheromone receptors located on the cell membrane of sensory neurons of the VNO, which triggers a signal transduction pathway that ultimately leads to the activation of the hypothalamus. Several components in the pathway have been identified, including GTP-binding proteins, phospholipase C, inositol 1,4,5-trisphosphate (IP3), and an ion channel of the transient receptor potential family named TRP2 (also known as TRPC2) (4). However, among members of this pathway, only TRP2 (5) and pheromone receptors of the V1R and V2R families (6-9) are unique to pheromone transduction and are not known to be used in other physiological processes. Disruption of either of these two components in mice hampers pheromone perception and causes dramatic changes in sexual and social behaviors (10-12). Th...
Comprehensive genome annotation is essential to understand the impact of clinically relevant variants. However, the absence of a standard for clinical reporting and browser display complicates the process of consistent interpretation and reporting. To address these challenges, Ensembl/GENCODE1 and RefSeq2 launched a joint initiative, the Matched Annotation from NCBI and EMBL-EBI (MANE) collaboration, to converge on human gene and transcript annotation and to jointly define a high-value set of transcripts and corresponding proteins. Here, we describe the MANE transcript sets for use as universal standards for variant reporting and browser display. The MANE Select set identifies a representative transcript for each human protein-coding gene, whereas the MANE Plus Clinical set provides additional transcripts at loci where the Select transcripts alone are not sufficient to report all currently known clinical variants. Each MANE transcript represents an exact match between the exonic sequences of an Ensembl/GENCODE transcript and its counterpart in RefSeq such that the identifiers can be used synonymously. We have now released MANE Select transcripts for 97% of human protein-coding genes, including all American College of Medical Genetics and Genomics Secondary Findings list v3.0 (ref. 3) genes. MANE transcripts are accessible from major genome browsers and key resources. Widespread adoption of these transcript sets will increase the consistency of reporting, facilitate the exchange of data regardless of the annotation source and help to streamline clinical interpretation.
The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.
Human cytidine deaminase APOBEC3G and the virion infectivity factor (vif) of the human immunodeficiency virus (HIV) are a pair of antagonistic molecules. In the absence of vif, APOBEC3G induces a high rate of dC to dU mutations in the nascent reverse transcripts of HIV that leads to the degradation of the HIV genome. HIV vif, on the other hand, can suppress the translation and trigger the degradation of human APOBEC3G. Here, we studied the rate of APOBEC3G gene evolution from five hominoids and two Old World monkeys. Averaged across the entire coding region, the rate of non-synonymous nucleotide substitutions is approximately 1.4 times the rate of synonymous substitutions, strongly suggesting that APOBEC3G has been under positive Darwinian selection. A comparison between the nucleotide polymorphisms within humans and the substitutions among the seven primates reveals a significant excess of non-synonymous substitutions. Furthermore, the rate of charge-altering non-synonymous substitution is approximately 1.8 times that of charge-conserving substitution, indicating that the selection is promoting the diversity of the protein charge profile. However, no difference in selective pressure on APOBEC3G is detected between hosts and non-hosts of HIV or simian immunodeficiency virus (SIV). These results, together with recent findings that the antiviral activity of APOBEC3G is not limited to HIV/SIV, suggest that the selective pressure on APOBEC3G is not solely from HIV/SIV and that APOBEC3G is a broad antiviral enzyme. The identification of pervasive positive selection for charge-altering amino acid substitutions supports the hypothesis of electrostatic interactions between APOBEC3G and vif or its functional equivalents.
We describe a genome reference of the African green monkey or vervet (Chlorocebus aethiops). This member of the Old World monkey (OWM) superfamily is uniquely valuable for genetic investigations of simian immunodeficiency virus (SIV), for which it is the most abundant natural host species, and of a wide range of health-related phenotypes assessed in Caribbean vervets (C. a. sabaeus), whose numbers have expanded dramatically since Europeans introduced small numbers of their ancestors from West Africa during the colonial era. We use the reference to characterize the genomic relationship between vervets and other primates, the intra-generic phylogeny of vervet subspecies, and genome-wide structural variations of a pedigreed C. a. sabaeus population. Through comparative analyses with human and rhesus macaque, we characterize at high resolution the unique chromosomal fission events that differentiate the vervets and their close relatives from most other catarrhine primates, in whom karyotype is highly conserved. We also provide a summary of transposable elements and contrast these with the rhesus macaque and human. Analysis of sequenced genomes representing each of the main vervet subspecies supports previously hypothesized relationships between these populations, which range across most of sub-Saharan Africa, while uncovering high levels of genetic diversity within each. Sequence-based analyses of major histocompatibility complex (MHC) polymorphisms reveal extremely low diversity in Caribbean C. a. sabaeus vervets, compared to vervets from putatively ancestral West African regions. In the C. a. sabaeus research population, we discover the first structural variations that are, in some cases, predicted to have a deleterious effect; future studies will determine the phenotypic impact of these variations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.