Motivation Normalizing sequence variants on a reference, projecting them across congruent sequences, and aggregating their diverse representations are critical to the elucidation of the genetic basis of disease and biological function. Inconsistent representation of variants among variant callers, local databases, and tools results in discrepancies that complicate analysis. NCBI’s genetic variation resources, dbSNP and ClinVar, require a robust, scalable set of principles to manage asserted sequence variants. Results The SPDI data model defines variants as a sequence of four attributes: sequence, position, deletion and insertion, and can be applied to nucleotide and protein variants. NCBI web services convert representations among HGVS, VCF, and SPDI and provide two functions to aggregate variants. One, based on the NCBI Variant Overprecision Correction Algorithm (VOCA), returns a unique, normalized representation termed the “Contextual Allele”. The SPDI data model, with its four operations, defines exactly the reference subsequence affected by the variant, even in repeat regions such as homopolymer and other sequence repeats. The second function projects variants across congruent sequences and depends on an alignment dataset (ADS) of non-assembly NCBI RefSeq sequences (prefixed NM, NR, NG), as well as inter- and intra-assembly-associated genomic sequences (NCs, NTs, and NWs), supporting robust projection of variants across congruent sequences and assembly versions. The variant is projected to all congruent Contextual Alleles. One of these Contextual Alleles, typically the allele based on the latest assembly version, represents the entire set, is designated the unique “Canonical Allele” and is used directly to aggregate variants across congruent sequences. Availability The SPDI services are available for open access at: https://api.ncbi.nlm.nih.gov/variation/v0 Supplementary information Supplementary data are available at Bioinformatics online.
MotivationNormalizing diverse representations of sequence variants is critical to the elucidation of the genetic basis of disease and biological function. NCBI has long wrestled with integrating data from multiple submitters to build databases such as dbSNP and ClinVar. Inconsistent representation of variants among variant callers, local databases, and tools results in discrepancies and duplications that complicate analysis. Current tools are not robust enough to manage variants in different formats and different reference sequence coordinates. ResultsThe SPDI (pronounced "speedy") data model defines variants as a sequence of 4 operations: start at the boundary before the first position in the sequence S , advance P positions, delete D positions, then insert the sequence in the string I, giving the data 1 model its name, SPDI. The SPDI model can thus be applied to both nucleotide and protein variants, but the services discussed here are limited to the nucleotide. Current services convert representations between HGVS, VCF, and SPDI and provide two forms of normalization. The first, based on the NCBI Variant Overprecision Correction Algorithm, returns a unique, normalized representation termed the "Contextual Allele" for any input. The SPDI name, with its four operations, defines exactly the reference subsequence potentially affected by the variant, even in low complexity regions such as homopolymer and dinucleotide sequence repeats. The second level of normalization depends on alignment dataset (ADS). SPDI services perform remapping (AKA lift-over) of variants from the input reference sequence to return a list of all equivalent Contextual Alleles based on the transcript or genomic sequences that were aligned. One of these contextual alleles is selected to represent all, usually, that based on the latest genomic assembly such as GRCh38 and is designated as the unique "Canonical Allele". ADS includes alignments between non-assembly RefSeq sequences (prefixed NM, NR, NG), as well inter-and intra-assembly-associated genomic sequences (NCs, NTs, and NWs) and this allows for robust remapping and normalization of variants across sequences and assembly versions. Availability and implementationThe SPDI services are available for open access at: https://api.ncbi.nlm.nih.gov/variation/v0/
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.