Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery

Barbitoff, Yury A.; Abasov, Ruslan; Tvorogova, Varvara E.; Glotov, Andrey S.; Predeus, Alexander V.

doi:10.1186/s12864-022-08365-3

Cited by 45 publications

(48 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The hap.py [ 24 ] tool was used for benchmarking, which is a reference implementation of the GA4GH recommendations for variant caller benchmarking with the “vcfeval” engine for comparison; it generated metrics as “False positive”, “False negative”, “True positive”, “Precision”, “Recall”, and “F1 score”. It was found that three metrics are the most important for variant caller performance evaluations, which are “Precision”, “Recall”, and, most importantly, “F1 score”, which is the mean of precision and recall and is commonly used to test the performance of the callers [ 56 , 57 , 58 ].…”

Section: Discussionmentioning

confidence: 99%

“…Another benefit of MinION over second-generation sequencers is its mobility and ease of use for library preparation and sequencing, as well as its low cost. There are currently many custom/academic or commercial BRCA1/2 target panels that have been established in recent years because of investigations on the use and impact of NGS in breast/ovarian cancer [ 56 , 60 , 61 , 62 ], the majority of which are based on the amplicon sequencing technique. There are currently many commercial short-read amplicon-based BRCA gene panels available that detect SNV and/or copy number variation.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Evaluation of the Available Variant Calling Tools for Oxford Nanopore Sequencing in Breast Cancer

Helal¹,

Saad²,

Saad³

et al. 2022

Genes

View full text Add to dashboard Cite

The goal of biomarker testing, in the field of personalized medicine, is to guide treatments to achieve the best possible results for each patient. The accurate and reliable identification of everyone’s genome variants is essential for the success of clinical genomics, employing third-generation sequencing. Different variant calling techniques have been used and recommended by both Oxford Nanopore Technologies (ONT) and Nanopore communities. A thorough examination of the variant callers might give critical guidance for third-generation sequencing-based clinical genomics. In this study, two reference genome sample datasets (NA12878) and (NA24385) and the set of high-confidence variant calls provided by the Genome in a Bottle (GIAB) were used to allow the evaluation of the performance of six variant calling tools, including Human-SNP-wf, Clair3, Clair, NanoCaller, Longshot, and Medaka, as an integral step in the in-house variant detection workflow. Out of the six variant callers understudy, Clair3 and Human-SNP-wf that has Clair3 incorporated into it achieved the highest performance rates in comparison to the other variant callers. Evaluation of the results for the tool was expressed in terms of Precision, Recall, and F1-score using Hap.py tools for the comparison. In conclusion, our findings give important insights for identifying accurate variants from third-generation sequencing of personal genomes using different variant detection tools available for long-read sequencing.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Evaluation of the Available Variant Calling Tools for Oxford Nanopore Sequencing in Breast Cancer

Helal¹,

Saad²,

Saad³

et al. 2022

Genes

View full text Add to dashboard Cite

show abstract

“…Regarding the actual potential reduction in false positives, an estimation can be provided based on recent studies where multiple variant callers are evaluated ( Barbitoff et al., 2022 ; Lin et al., 2018 ) and combined ( Zhao et al., 2020 ). In ( Zhao et al., 2020 ), the authors benchmarked GATK, the Illumina DRAGEN-based caller and DeepVariant using human genome data and it was shown that the average F1-score for SNP detection across 4 datasets is 0.990 for GATK without Variant Quality Score Recalibration and 0.969 for GATK with Variant Quality Score Recalibration.…”

Section: Expected Outcomesmentioning

confidence: 99%

“…In addition, in ( Lin et al., 2018 ), the comparison of GATK with DeepVariant, when applied on the analysis of trios, showed that DeepVariant made fewer calls, but with a lower false positive rate. In addition, in ( Barbitoff et al., 2022 ), the F1-scores calculated for the three methods when applied on Whole Exome Sequencing were 0.996 for DeepVariant, 0.985 for GATK and 0.987 for FreeBayes. Based on these results and the aforementioned results regarding algorithm combination, we expect the overall F1-score to be >0.996.…”

Section: Expected Outcomesmentioning

confidence: 99%

Protocol for unbiased, consolidated variant calling from whole exome sequencing data

2022

View full text Add to dashboard Cite

“…In concert with the recent advances in high-throughput sequencing technology, many software tools have been developed for the computational processing of genomic sequencing data, each with their own distinct errors and biases. In fact, systematic comparisons of computational pipelines across multiple high-throughput sequencing platforms indicated high divergence and low concordance among the identified variants (O’Rawe et al 2013 ; Pirooznia et al 2014 ; Hwang et al 2015 ; Chen et al 2019 ; Kumaran et al 2019 ; Krishnan et al 2021 ; Barbitoff et al 2022 ). Such differences in performance are particularly problematic for the identification of spontaneous (de novo) mutations as well as rare variants, with differences in pipeline design leading to several-fold variation in estimated mutation rates (Pfeifer 2021 ; Bergeron et al 2022 ) as well as high rates of missed variants (Peng et al 2013 ).…”

Section: Introductionmentioning

confidence: 99%

Performance evaluation of six popular short-read simulators

Milhaven

Pfeifer

2022

Heredity

View full text Add to dashboard Cite

High-throughput sequencing data enables the comprehensive study of genomes and the variation therein. Essential for the interpretation of this genomic data is a thorough understanding of the computational methods used for processing and analysis. Whereas “gold-standard” empirical datasets exist for this purpose in humans, synthetic (i.e., simulated) sequencing data can offer important insights into the capabilities and limitations of computational pipelines for any arbitrary species and/or study design—yet, the ability of read simulator software to emulate genomic characteristics of empirical datasets remains poorly understood. We here compare the performance of six popular short-read simulators—ART, DWGSIM, InSilicoSeq, Mason, NEAT, and wgsim—and discuss important considerations for selecting suitable models for benchmarking.

show abstract

Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery

Cited by 45 publications

References 45 publications

Evaluation of the Available Variant Calling Tools for Oxford Nanopore Sequencing in Breast Cancer

Evaluation of the Available Variant Calling Tools for Oxford Nanopore Sequencing in Breast Cancer

Protocol for unbiased, consolidated variant calling from whole exome sequencing data

Performance evaluation of six popular short-read simulators

Contact Info

Product

Resources

About