To evaluate and compare the performance of variant calling methods and their confidence scores, comparisons between a test call set and a "gold standard" need to be carried out. Unfortunately, these comparisons are not straightforward with the current Variant Call Files (VCF), which are the standard output of most variant calling algorithms for high-throughput sequencing data. Comparisons of VCFs are often confounded by the different representations of indels, MNPs, and combinations thereof with SNVs in complex regions of the genome, resulting in misleading results. A variant caller is inherently a classification method designed to score putative variants with confidence scores that could permit controlling the rate of false positives (FP) or false negatives (FN) for a given application. Receiver operator curves (ROC) and the area under the ROC (AUC) are efficient metrics to evaluate a test call set versus a gold standard. However, in the case of VCF data this also requires a special accounting to deal with discrepant representations. We developed a novel algorithm for comparing variant call sets that deals with complex call representation discrepancies and through a dynamic programing method that minimizes false positives and negatives globally across the entire call sets for accurate performance evaluation of VCFs.
The analysis of whole-genome or exome sequencing data from trios and pedigrees has been successfully applied to the identification of disease-causing mutations. However, most methods used to identify and genotype genetic variants from next-generation sequencing data ignore the relationships between samples, resulting in significant Mendelian errors, false positives and negatives. Here we present a Bayesian network framework that jointly analyzes data from all members of a pedigree simultaneously using Mendelian segregation priors, yet providing the ability to detect de novo mutations in offspring, and is scalable to large pedigrees. We evaluated our method by simulations and analysis of whole-genome sequencing (WGS) data from a 17-individual, 3-generation CEPH pedigree sequenced to 50× average depth. Compared with singleton calling, our family caller produced more high-quality variants and eliminated spurious calls as judged by common quality metrics such as Ti/Tv, Het/Hom ratios, and dbSNP/SNP array data concordance, and by comparing to ground truth variant sets available for this sample. We identify all previously validated de novo mutations in NA12878, concurrent with a 7× precision improvement. Our results show that our method is scalable to large genomics and human disease studies.
The analysis of whole-genome or exome sequencing data from trios and pedigrees has being successfully applied to the identification of disease-causing mutations. However, most methods used to identify and genotype genetic variants from nextgeneration sequencing data ignore the relationships between samples, resulting in significant Mendelian errors, false positives and negatives. Here we present a Bayesian network framework that jointly analyses data from all members of a pedigree simultaneously using Mendelian segregation priors, yet providing the ability to detect de novo mutations in offspring, and is scalable to large pedigrees. We evaluated our method by simulations and analysis of WGS data from a 17 individual, 3-generation CEPH pedigree sequenced to 50X average depth. Compared to singleton calling, our family caller produced more high quality variants and eliminated spurious calls as judged by common quality metrics such as Ti/Tv, Het/Hom ratios, and dbSNP/SNP array data concordance. We developed a ground truth dataset to further evaluate our calls by identifying recombination cross-overs in the pedigree and testing variants for consistency with the inferred phasing, and we show that our method significantly outperforms singleton and population variant calling in pedigrees. We identify all previously validated de novo mutations in NA12878, concurrent with a 7X precision improvement. Our results show that our method is scalable to large genomics and human disease studies and allows cost optimization by rational sequencing capacity distribution.
Tumor molecular profiling is rapidly becoming the standard clinical test for selecting targeted therapies in refractory cancer patients. DNA extracted from patient samples is enriched for cancer genes and sequenced to identify actionable somatic mutations therein. A major challenge arises when tumor-derived data is analyzed in the absence of normal tissue data, as it is common in clinical scenarios. The distinction between somatic and germline variants become difficult, leaving clinicians to resort to crude heuristic filtering. We present here a variant calling software, developed under quality system regulation protocols, capable of accurately identifying somatic mutations from targeted next-generation sequencing data. A novel Bayesian Network approach models the distribution of reads harboring germline and somatic mutations, estimates the contamination from normal tissue in the sample, scores somatic mutations, and imputes germline variants, without matching normal tissue data. This approach also allows joint analysis of multiple specimens from the same patient (e.g. FFPE and ctDNA), when available, improving the limit of detection. To improve specificity, our caller can also utilize prior information from different databases including somatic mutations, germline variation, and healthy controls data, in a principled fashion. We validated our method by analyzing data from the TOMA OS-Seq 131 cancer gene panel using the Illumina platform. Sample inputs ranging from 2-600ng of DNA were sequenced to a depth of >1000X, achieving on target rates ≤73% and uniformity ≥ 3.2 fold 80 penalty. Through adaptors with molecular barcodes we measured a median duplicate rate <2. We analyzed somatic mutations simulated at various variant allele fractions on a background of data from reference samples from the Genome-in-a-Bottle consortium, data on a dilution series from two reference samples, and several commercial control and clinical samples, including matched FFPE, PBMC, and ctDNA specimens. In the absence of normal tissue, our method scores each variant with respect to their likelihood of being somatic or germline. We show that, as compared to other commonly used methods, our algorithm can achieve a higher true positive rate whilst controlling a false discovery rate of 1%. We also show that jointly analyzing serial samples (e.g. ctDNA), we can improve sensitivity of shared variants. In conclusion, in contrast to currently used academic software developed for research projects, we observe that our caller outperforms these software and is particularly well suited for the clinical use cases. Note: This abstract was not presented at the meeting. Citation Format: Francisco M. De La Vega, Sean Irvine, David Ware, Kurt Gaastra, Yannick Pouiliot, Len Trigg. Accurate identification of somatic mutations in cancer patient specimens in the lack of normal tissue by targeted high-throughput sequencing [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2017; 2017 Apr 1-5; Washington, DC. Philadelphia (PA): AACR; Cancer Res 2017;77(13 Suppl):Abstract nr 3576. doi:10.1158/1538-7445.AM2017-3576
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.