The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.
The Virus Variation Resource is a value-added viral sequence data resource hosted by the National Center for Biotechnology Information. The resource is located at http://www.ncbi.nlm.nih.gov/genome/viruses/variation/ and includes modules for seven viral groups: influenza virus, Dengue virus, West Nile virus, Ebolavirus, MERS coronavirus, Rotavirus A and Zika virus. Each module is supported by pipelines that scan newly released GenBank records, annotate genes and proteins and parse sample descriptors and then map them to controlled vocabulary. These processes in turn support a purpose-built search interface where users can select sequences based on standardized gene, protein and metadata terms. Once sequences are selected, a suite of tools for downloading data, multi-sequence alignment and tree building supports a variety of user directed activities. This manuscript describes a series of features and functionalities recently added to the Virus Variation Resource.
Poxviruses are highly successful pathogens, known to infect a variety of hosts. The family Poxviridae includes Variola virus, the causative agent of smallpox, which has been eradicated as a public health threat but could potentially reemerge as a bioterrorist threat. The risk scenario includes other animal poxviruses and genetically engineered manipulations of poxviruses. Studies of orthologous gene sets have established the evolutionary relationships of members within the Poxviridae family. It is not clear, however, how variations between family members arose in the past, an important issue in understanding how these viruses may vary and possibly produce future threats. Using a newly developed poxvirus-specific tool, we predicted accurate gene sets for viruses with completely sequenced genomes in the genus Orthopoxvirus. Employing sensitive sequence comparison techniques together with comparison of syntenic gene maps, we established the relationships between all viral gene sets. These techniques allowed us to unambiguously identify the gene loss/gain events that have occurred over the course of orthopoxvirus evolution. It is clear that for all existing Orthopoxvirus species, no individual species has acquired protein-coding genes unique to that species. All existing species contain genes that are all present in members of the species Cowpox virus and that cowpox virus strains contain every gene present in any other orthopoxvirus strain. These results support a theory of reductive evolution in which the reduction in size of the core gene set of a putative ancestral virus played a critical role in speciation and confining any newly emerging virus species to a particular environmental (host or tissue) niche.
FDA proactively invests in tools to support innovation of emerging technologies, such as infectious disease next generation sequencing (ID-NGS). Here, we introduce FDA-ARGOS quality-controlled reference genomes as a public database for diagnostic purposes and demonstrate its utility on the example of two use cases. We provide quality control metrics for the FDA-ARGOS genomic database resource and outline the need for genome quality gap filling in the public domain. In the first use case, we show more accurate microbial identification of Enterococcus avium from metagenomic samples with FDA-ARGOS reference genomes compared to non-curated GenBank genomes. In the second use case, we demonstrate the utility of FDA-ARGOS reference genomes for Ebola virus target sequence comparison as part of a composite validation strategy for ID-NGS diagnostic tests. The use of FDA-ARGOS as an in silico target sequence comparator tool combined with representative clinical testing could reduce the burden for completing ID-NGS clinical trials.
The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. New resources include the Comparative Genome Resource (CGR) and the BLAST ClusteredNR database. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, IgBLAST, GDV, RefSeq, NCBI Virus, GenBank type assemblies, iCn3D, ClinVar, GTR, dbGaP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Poxviruses are composed of large double-stranded DNA (dsDNA) genomes coding for several hundred genes whose variation has supported virus adaptation to a wide variety of hosts over their long evolutionary history. Comparative genomics has suggested that the Orthopoxvirus genus in particular has undergone reductive evolution, with the most recent common ancestor likely possessing a gene complement consisting of all genes present in any existing modern-day orthopoxvirus species, similar to the current Cowpox virus species. As orthopoxviruses adapt to new environments, the selection pressure on individual genes may be altered, driving sequence divergence and possible loss of function. This is evidenced by accumulation of mutations and loss of protein-coding open reading frames (ORFs) that progress from individual missense mutations to gene truncation through the introduction of early stop mutations (ESMs), gene fragmentation, and in some cases, a total loss of the ORF. In this study, we have constructed a whole-genome alignment for representative isolates from each Orthopoxvirus species and used it to identify the nucleotide-level changes that have led to gene content variation. By identifying the changes that have led to ESMs, we were able to determine that short indels were the major cause of gene truncations and that the genome length is inversely proportional to the number of ESMs present. We also identified the number and types of protein functional motifs still present in truncated genes to assess their functional significance. IMPORTANCEThis work contributes to our understanding of reductive evolution in poxviruses by identifying genomic remnants such as single nucleotide polymorphisms (SNPs) and indels left behind by evolutionary processes. Our comprehensive analysis of the genomic changes leading to gene truncation and fragmentation was able to detect some of the remnants of these evolutionary processes still present in orthopoxvirus genomes and suggests that these viruses are under continual adaptation due to changes in their environment. These results further our understanding of the evolutionary mechanisms that drive virus variation, allowing orthopoxviruses to adapt to particular environmental niches. Understanding the evolutionary history of these virus pathogens may help predict their future evolutionary potential.
Background: GenBank contains over 3 million viral sequences. The National Center for Biotechnology Information (NCBI) previously made available a tool for validating and annotating influenza virus sequences that is used to check submissions to GenBank. Before this project, there was no analogous tool in use for non-influenza viral sequence submissions. Results: We developed a system called VADR (Viral Annotation DefineR) that validates and annotates viral sequences in GenBank submissions. The annotation system is based on analysis of the input nucleotide sequence using models built from curated RefSeqs. Hidden Markov models are used to classify sequences by determining the RefSeq they are most similar to, and feature annotation from the RefSeq is mapped based on a nucleotide alignment of the full sequence to a covariance model. Predicted proteins encoded by the sequence are validated with nucleotide-to-protein alignments using BLAST. The system identifies 43 types of "alerts" that (unlike the previous BLAST-based system) provide deterministic and rigorous feedback to researchers who submit sequences with unexpected characteristics. VADR has been integrated into GenBank's submission processing pipeline allowing for viral submissions passing all tests to be accepted and annotated automatically, without the need for any human (GenBank indexer) intervention. Unlike the previous submission-checking system, VADR is freely available (https://github.com/nawrockie/vadr) for local installation and use. VADR has been used for Norovirus submissions since May 2018 and for Dengue virus submissions since January 2019. Other viruses with high numbers of submissions will be added incrementally. Conclusion: VADR improves the speed with which non-flu virus submissions to GenBank can be checked and improves the content and quality of the GenBank annotations. The availability and portability of the software allows researchers to run the GenBank checks prior to submitting their viral sequences, and thereby gain confidence that their submissions will be accepted immediately without the need to correspond with GenBank staff. Reciprocally, the adoption of VADR frees GenBank staff to spend more time on services other than checking routine viral sequence submissions.
RNA viruses that contain single-stranded RNA genomes of positive sense make up the largest group of pathogens infecting honey bees. Sacbrood virus (SBV) is one of the most widely distributed honey bee viruses and infects the larvae of honey bees, resulting in failure to pupate and death. Among all of the viruses infecting honey bees, SBV has the greatest number of complete genomes isolated from both European honey bees Apis mellifera and Asian honey bees A. cerana worldwide. To enhance our understanding of the evolution and pathogenicity of SBV, in this study, we present the first report of whole genome sequences of two U.S. strains of SBV. The complete genome sequences of the two U.S. SBV strains were deposited in GenBank under accession numbers: MG545286.1 and MG545287.1. Both SBV strains show the typical genomic features of the Iflaviridae family. The phylogenetic analysis of the single polyprotein coding region of the U.S. strains, and other GenBank SBV submissions revealed that SBV strains split into two distinct lineages, possibly reflecting host affiliation. The phylogenetic analysis based on the 5′UTR revealed a monophyletic clade with the deep parts of the tree occupied by SBV strains from both A. cerane and A. mellifera, and the tips of branches of the tree occupied by SBV strains from A. mellifera. The study of the cold stress on the pathogenesis of the SBV infection showed that cold stress could have profound effects on sacbrood disease severity manifested by increased mortality of infected larvae. This result suggests that the high prevalence of sacbrood disease in early spring may be due to the fluctuating temperatures during the season. This study will contribute to a better understanding of the evolution and pathogenesis of SBV infection in honey bees, and have important epidemiological relevance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.