Mei‐Jie Yang scite author profile

As NLM’s Conserved Domain Database (CDD) enters its 20th year of operations as a publicly available resource, CDD curation staff continues to develop hierarchical classifications of widely distributed protein domain families, and to record conserved sites associated with molecular function, so that they can be mapped onto user queries in support of hypothesis-driven biomolecular research. CDD offers both an archive of pre-computed domain annotations as well as live search services for both single protein or nucleotide queries and larger sets of protein query sequences. CDD staff has continued to characterize protein families via conserved domain architectures and has built up a significant corpus of curated domain architectures in support of naming bacterial proteins in RefSeq. These architecture definitions are available via SPARCLE, the Subfamily Protein Architecture Labeling Engine. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

show abstract

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation

O’Neill

Haft

et al. 2020

633

493

View full text Add to dashboard Cite

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

show abstract

NCBI's Conserved Domain Database and Tools for Protein Domain Analysis

Yang

Derbyshire

Yamashita

et al. 2019

CP in Bioinformatics

166

107

View full text Add to dashboard Cite

The Conserved Domain Database (CDD) is a freely available resource for the annotation of sequences with the locations of conserved protein domain footprints, as well as functional sites and motifs inferred from these footprints. It includes protein domain and protein family models curated in house by CDD staff, as well as imported from a variety of other sources. The latest CDD release (v3.17, April 2019) contains more than 57,000 domain models, of which almost 15,000 were curated by CDD staff. The CDD curation effort increases coverage and provides finer‐grained classifications of common and widely distributed protein domain families, for which a wealth of functional and structural data have become available. The CDD maintains both live search capabilities and an archive of pre‐computed domain annotations for a selected subset of sequences tracked by the NCBI's Entrez protein database. These can be retrieved or computed for a single sequence using CD‐Search or in bulk using Batch CD‐Search, or computed via standalone RPS‐BLAST plus the rpsbproc software package. The CDD can be accessed via https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. The three protocols listed here describe how to perform a CD‐Search (Basic Protocol 1), a Batch CD‐Search (Basic Protocol 2), and a Standalone RPS‐BLAST and rpsbproc (Basic Protocol 3). © 2019 The Authors. Basic Protocol 1: CD‐search Basic Protocol 2: Batch CD‐search Basic Protocol 3: Standalone RPS‐BLAST and rpsbproc

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Mei‐Jie Yang

CDD/SPARCLE: the conserved domain database in 2020

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation

NCBI's Conserved Domain Database and Tools for Protein Domain Analysis

Contact Info

Product

Resources

About