An open resource for accurately benchmarking small variant and reference calls

Zook, Justin M.; McDaniel, Jennifer; Olson, Nathan D.; Wagner, Justin; Parikh, Hemang; Heaton, Haynes; Irvine, Sean A.; Trigg, Len; Truty, Rebecca; McLean, Cory Y.; Vega, Francisco; Xiao, Chunlin; Sherry, Stephen T.; Salit, Marc

doi:10.1038/s41587-019-0074-6

Cited by 304 publications

(377 citation statements)

References 29 publications

Supporting

Mentioning

360

Contrasting

Unclassified

Order By: Relevance

“…To enable the community to benchmark these methods, the Genome in a Bottle Consortium (GIAB) here developed benchmark SV calls and benchmark regions for the son (HG002/NA24385) in a broadly consented and available Ashkenazi Jewish trio from the Personal Genome Project, 7 which are disseminated as National Institute of Standards and Technology (NIST) Reference Material 8392. 8,9 Many approaches have been developed to detect SVs from different sequencing technologies.…”

Section: Introductionmentioning

confidence: 99%

“…18,19 Finally, optical mapping and electronic mapping provide an orthogonal approach capable of determining the approximate size and location of insertions, deletions, inversions, and translocations while spanning even very large SVs. [20][21][22] GIAB recently published benchmark sets for small variants for seven genomes, 9,23 and the Global Alliance for Genomics and Health Benchmarking Team established best practices for using these and other benchmark sets to benchmark germline variants. 24 These benchmark sets are widely used in developing, optimizing, and demonstrating new technologies and bioinformatics methods, as well as part of clinical laboratory validation.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A robust benchmark for germline structural variant detection

Zook¹,

Nf²,

Nd³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution, and comprehensiveness. Translating these methods to routine research and clinical practice requires robust benchmark sets. We developed the first benchmark set for identification of both false negative and false positive germline SVs, which complements recent efforts emphasizing increasingly comprehensive characterization of SVs. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle (GIAB) Consortium integrated 19 sequence-resolved variant calling methods, both alignment-and de novo assembly-based, from short-, linked-, and long-read sequencing, as well as optical and electronic mapping. The final benchmark set contains 12745 isolated, sequence-resolved insertion and deletion calls ≥50 base pairs (bp) discovered by at least 2 technologies or 5 callsets, genotyped as heterozygous or homozygous variants by long reads. The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.66 Gbp and 9641 SVs supported by at least one diploid assembly. Support for SVs was assessed using svviz with short-, linked-, and long-read sequence data. In general, there was strong support from multiple technologies for the benchmark SVs, with 90 % of the Tier 1 SVs having support in reads from more than one technology. The Mendelian genotype error rate was 0.3 %, and genotype concordance with manual curation was >98.7 %. We demonstrate the utility of the benchmark set by showing it reliably identifies both false negatives and false positives in high-quality SV callsets from short-, linked-, and long-read sequencing and optical mapping. GIAB is working towards a new version of the benchmark set that will use new technologies and methods such as PacBio Circular Consensus Sequencing and ultralong Oxford Nanopore sequencing to expand to more challenging genome regions and include more challenging SVs such as inversions. We are also developing a robust integration process to make calls on GRCh37 and GRCh38 for all seven GIAB samples.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A robust benchmark for germline structural variant detection

Zook¹,

Nf²,

Nd³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…We first evaluated assembly-based SNP and small-indel (<50bp) detection by comparing Aquila's calls against the Genome in a Bottle (GiaB) benchmark callsets (Zook et al 2019). The libraries with the best assembly statistics, L3 (from NA12878) and L5 (from NA24385), achieved 97.4% and 97.8% accuracy (F1 metric) for SNPs (Table 2; Supplemental Table S2) and >93% accuracy for the high-confidence set of GiaB small indels (Table 3; Supplemental Table S3).…”

Section: Assembly-based Detection Of Snps and Small Indelsmentioning

confidence: 99%

Aquila: diploid personal genome assembly and comprehensive variant detection based on linked reads

Zhou

Zhang

Weng

et al. 2019

Preprint

View full text Add to dashboard Cite

Variant discovery in personal, whole genome sequence data is critical for uncovering the genetic contributions to health and disease. We introduce a new approach, Aquila, that uses linked-read data for generating a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. Assemblies cover >95% of the human reference genome, with over 98% in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased VCF file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective evolution of wholegenome reconstruction that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity. Stancu et al. 2017). However, the drawback of both approaches is that they exhibit poor basepair level accuracy, leading to high error rates for SNPs and imprecise breakpoint estimation for small indels and SVs. A widely applied solution has been to supplement long reads with higher quality short read data, but these ensemble approaches are difficult to scale to larger cohorts due to the complexity of data generation, integration, and analysis, and have therefore been limited to small sample sizes in proof-of-principle studies (Rhoads and Au, 2015; Fan et al., 2017). A solution to making long reads more accurate is to sequence the same single molecule multiple times to reduce error, for example as implemented in the PacBio circular consensus sequencing (CCS) approach (Travers et al. 2010;Larsen et al. 2014;Wenger et al. 2019).However, CCS requires several-fold oversampling of the same molecule, a currently expensive proposition for anything but small sample sizes.

show abstract

“…Given that small variant callers today use only one type of sequencing data, and as a result consistently make erroneous calls in certain types of regions (e.g., indel calls in low-complexity regions) due to the error modes characteristic of a single sequencing technology, it is likely that the importance of variants in such regions may be less well-understood today. In addition, currently accepted benchmarks for variant calling such as Genome-In-A-Bottle (Zook, et al, 2019) have uncharacterized regions in the genome which may carry variants of significance. Some of these regions cannot be characterized due to the reliance, solely, on one type of sequencing data (namely short reads).…”

Section: Introductionmentioning

confidence: 99%

HELLO: A hybrid variant calling approach

Ramachandran

Lumetta

Klee

et al. 2020

Preprint

View full text Add to dashboard Cite

Next Generation Sequencing (NGS) technologies that cost-effectively characterize genomic regions and identify sequence variations using short reads are the current standard for genome sequencing. However, calling small indels in low-complexity regions of the genome using NGS is challenging. Recent advances in Third Generation Sequencing (TGS) provide long reads, which call largestructural variants accurately. However, these reads have context-dependent indel errors in lowcomplexity regions, resulting in lower accuracy of small indel calls compared to NGS reads. When both small and large-structural variants need to be called, both NGS and TGS reads may be available.Integration of the two data types with unique error profiles could improve robustness of small variant calling in challenging cases. However, there isn't currently such a method integrating both types of data.We present a novel method that integrates NGS and TGS reads to call small variants. We leverage the Mixture of Experts paradigm which uses an ensemble of Deep Neural Networks (DNN), each processing a different data type to make predictions. We present improvements in our DNN design compared to previous work such as sequence processing using one-dimensional convolutions instead of image processing using two-dimensional convolutions and an algorithm to efficiently process sites with many variant candidates, which help us reduce computations. Using our method to integrate Illumina and PacBio reads, we find a reduction in the number of erroneous small variant calls of up to ~30%, compared to the state-of-the-art using only Illumina data. We also find improvements in calling small indels in low-complexity regions.

show abstract

An open resource for accurately benchmarking small variant and reference calls

Cited by 304 publications

References 29 publications

A robust benchmark for germline structural variant detection

A robust benchmark for germline structural variant detection

Aquila: diploid personal genome assembly and comprehensive variant detection based on linked reads

HELLO: A hybrid variant calling approach

Contact Info

Product

Resources

About