NCBI Taxonomy: a comprehensive update on curation, resources and tools

Schoch, Conrad L.; Ciufo, Stacy; Domrachev, Michael; Hotton, Carol L.; Kannan, Sivakumar; Khovanskaya, Rogneda; Leipe, Detlef D.; McVeigh, Richard; O’Neill, Kathleen; Robbertse, Barbara; Sharma, Shobha; Soussov, Vladimir; Sullivan, John P.; Sun, Lu; Turner, Seán; Karsch-Mizrachi, Ilene

doi:10.1093/database/baaa062

Cited by 1,239 publications

(1,050 citation statements)

References 66 publications

Supporting

Mentioning

884

Contrasting

Unclassified

Order By: Relevance

“…FamDB les contain family consensi/HMMs and the NCBI Taxonomy data related to these families in a format that allows for fast o ine access from the command line. The current release of FamDB includes all Dfam consensus sequences, HMMs, metadata, and 61,003 taxa from NCBI's taxonomy database [44] related to these families. Lookups for information on a single taxon or family complete in about a second; extraction of consensus sequences (FASTA, EMBL) or HMMs for all TE families found in Human (including ancestral repeats) complete in about 3 to 4 seconds.…”

Section: Software/tool Distribution Improvementsmentioning

confidence: 99%

The Dfam Community Resource of Transposable Element Families, Sequence Models, and Genome Annotations

Storer

Hubley

Rosen

et al. 2020

Preprint

View full text Add to dashboard Cite

The 3.0-3.2 releases of Dfam (https://dfam.org) represent an evolution from a proof-of-principle collection of transposable element families in model organisms into a community resource for a broad range of species and for both curated and uncurated datasets. In addition, releases since Dfam 3.0 provide auxiliary consensus sequence models, transposable element protein alignments, and a formalized classification system to support the growing diversity of organisms represented in the resource. The latest release includes 266,740 new de novo generated transposable element families from 336 species contributed by the EBI. This expansion demonstrates the utility of many of Dfam’s new features and provides insight into the long term challenges ahead for improving de novo generated transposable element datasets.

show abstract

Section: Software/tool Distribution Improvementsmentioning

confidence: 99%

The Dfam Community Resource of Transposable Element Families, Sequence Models, and Genome Annotations

Storer

Hubley

Rosen

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…We have already implemented functions in RESCRIPt to format the popular SILVA rRNA gene and NCBI GenBank databases, and are planning future support for parsing and editing other taxonomy formats, as well as mapping between these formats [71]. There are 4 codes of nomenclature as reviewed in [88] In recent years, the explosion of high throughput sequencing technologies has allowed researchers to generate genomic data on many as yet uncultured microbial taxa. In fact, the rate at which novel genomic data can be acquired [94], and rapidly placed within a phylogenetic context [23], has surpassed our ability to appropriately resolve any conflicts with traditional Linnaean taxonomy.…”

Section: The Curation Problemmentioning

confidence: 99%

“…We conclude that the size and taxonomic comprehensiveness of SILVA are major assets, though GTDB and NCBI-RefSeq may be more suitable for various applications that respectively require greater taxonomic and phylogenetic rigor. The use of genomes sequenced from type material provides these two databases with a robust taxonomic and phylogenetic backbone that enables users to link natural history and experimental science [88,99]. NCBI-RefSeq's species records are extracted from data submissions to the International Nucleotide Sequence Database Collaboration (INSDC), i.e., NCBI-GenBank, the European Nucleotide Archive (ENA), and the DNA Data Bank of Japan (DDBJ).…”

Section: The Evaluation Problemmentioning

confidence: 99%

“…Furthermore, NCBI-Taxonomy continually runs taxonomic consistency checks on assembled genomes with average nucleotide identity (ANI) [100]. These curational efforts result in a well integrated suite of biological information that can be interrogated through a variety of means and data types [25,88]. The GTDB extracts and curates data from both NCBI-RefSeq and NCBI-GenBank to generate a phylogeny of archaea and bacteria from roughly 120 ubiquitous single-copy proteins [23,87].…”

Section: The Evaluation Problemmentioning

confidence: 99%

See 1 more Smart Citation

RESCRIPt: Reproducible sequence taxonomy reference database management for the masses

Robeson

O’Rourke

Kaehler

et al. 2020

Preprint

103

View full text Add to dashboard Cite

BackgroundNucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardizations limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a software package for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases.ResultsTo highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA, and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes.ConclusionsRESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.

show abstract

“…It is, therefore, not surprising that no unified, joint classification underpins the many online resources that house and curate mycological data. Indeed, a user is likely to find many differences when comparing the classifications used in, e.g., GenBank [6], MycoBank [7], UNITE [8], CoL/GBIF [9], and BOLD [10]. The classifications of each of these many resources evolve more or less independently over time; some resources seek to offer the latest developments and thus incorporate the results of all recent studies in systematics, whereas others prefer to adopt only the most well-vetted aspects of the new classifications.…”

Section: Introductionmentioning

confidence: 99%

The Taxon Hypothesis Paradigm—On the Unambiguous Detection and Communication of Taxa

et al. 2020

View full text Add to dashboard Cite

Here, we describe the taxon hypothesis (TH) paradigm, which covers the construction, identification, and communication of taxa as datasets. Defining taxa as datasets of individuals and their traits will make taxon identification and most importantly communication of taxa precise and reproducible. This will allow datasets with standardized and atomized traits to be used digitally in identification pipelines and communicated through persistent identifiers. Such datasets are particularly useful in the context of formally undescribed or even physically undiscovered species if data such as sequences from samples of environmental DNA (eDNA) are available. Implementing the TH paradigm will to some extent remove the impediment to hastily discover and formally describe all extant species in that the TH paradigm allows discovery and communication of new species and other taxa also in the absence of formal descriptions. The TH datasets can be connected to a taxonomic backbone providing access to the vast information associated with the tree of life. In parallel to the description of the TH paradigm, we demonstrate how it is implemented in the UNITE digital taxon communication system. UNITE TH datasets include rich data on individuals and their rDNA ITS sequences. These datasets are equipped with digital object identifiers (DOI) that serve to fix their identity in our communication. All datasets are also connected to a GBIF taxonomic backbone. Researchers processing their eDNA samples using UNITE datasets will, thus, be able to publish their findings as taxon occurrences in the GBIF data portal. UNITE species hypothesis (species level THs) datasets are increasingly utilized in taxon identification pipelines and even formally undescribed species can be identified and communicated by using UNITE. The TH paradigm seeks to achieve unambiguous, unique, and traceable communication of taxa and their properties at any level of the tree of life. It offers a rapid way to discover and communicate undescribed species in identification pipelines and data portals before they are lost to the sixth mass extinction.

show abstract

NCBI Taxonomy: a comprehensive update on curation, resources and tools

Cited by 1,239 publications

References 66 publications

The Dfam Community Resource of Transposable Element Families, Sequence Models, and Genome Annotations

The Dfam Community Resource of Transposable Element Families, Sequence Models, and Genome Annotations

RESCRIPt: Reproducible sequence taxonomy reference database management for the masses

The Taxon Hypothesis Paradigm—On the Unambiguous Detection and Communication of Taxa

Contact Info

Product

Resources

About