Advantages and diagnostic effectiveness of the two most widely used resequencing approaches, whole exome (WES) and whole genome (WGS) sequencing, are often debated. WES dominated largescale resequencing projects because of lower cost and easier data storage and processing. Rapid development of 3 rd generation sequencing methods and novel exome sequencing kits predicate the need for a robust statistical framework allowing informative and easy performance comparison of the emerging methods. In our study we developed a set of statistical tools to systematically assess coverage of coding regions provided by several modern WES platforms, as well as PCR-free WGS. We identified a substantial problem in most previously published comparisons which did not account for mappability limitations of short reads. Using regression analysis and simple machine learning, as well as several novel metrics of coverage evenness, we analyzed the contribution from the major determinants of CDS coverage. Contrary to a common view, most of the observed bias in modern WES stems from mappability limitations of short reads and exome probe design rather than sequence composition. We also identified the ~ 500 kb region of human exome that could not be effectively characterized using short read technology and should receive special attention during variant analysis. Using our novel metrics of sequencing coverage, we identified main determinants of WES and WGS performance. Overall, our study points out avenues for improvement of enrichment-based methods and development of novel approaches that would maximize variant discovery at optimal cost. Next-generation sequencing (NGS) is rapidly becoming an invaluable tool in human genetics research and clinical diagnostics 1-3. Practical use of NGS methods has dramatically increased with the development of targeted sequencing approaches, such as whole-exome sequencing (WES) or targeted sequencing of gene panels. WES emerged as an efficient alternative to whole-genome sequencing (WGS) due to both lower sequencing cost and simplification of variant analysis and data storage 4. More than 80% of all variants reported in ClinVar, and more than 89% of variants reported to be pathogenic, come from the protein-coding part of the genome; this number increases to 99% when immediate CDS vicinity is included. Even allowing for the sampling bias, there is an overall agreement that most heritable diseases appear to be caused by alterations in the protein-coding regions of the
BackgroundAllele frequency data from large exome and genome aggregation projects such as the Genome Aggregation Database (gnomAD) are of ultimate importance to the interpretation of medical resequencing data. However, allele frequencies might significantly differ in poorly studied populations that are underrepresented in large‐scale projects, such as the Russian population.MethodsIn this work, we leveraged our access to a large dataset of 694 exome samples to analyze genetic variation in the Northwest Russia. We compared the spectrum of genetic variants to the dbSNP build 151, and made estimates of ClinVar‐based autosomal recessive (AR) disease allele prevalence as compared to gnomAD r. 2.1.ResultsAn estimated 9.3% of discovered variants were not present in dbSNP. We report statistically significant overrepresentation of pathogenic variants for several Mendelian disorders, including phenylketonuria (PAH, rs5030858), Wilson's disease (ATP7B, rs76151636), factor VII deficiency (F7, rs36209567), kyphoscoliosis type of Ehlers‐Danlos syndrome (FKBP14, rs542489955), and several other recessive pathologies. We also make primary estimates of monogenic disease incidence in the population, with retinal dystrophy, cystic fibrosis, and phenylketonuria being the most frequent AR pathologies.ConclusionOur observations demonstrate the utility of population‐specific allele frequency data to the diagnosis of monogenic disorders using high‐throughput technologies.
PurposeWe comprehensively assessed the influence of reference minor alleles (RMAs), one of the inherent problems of the human reference genome sequence.MethodsThe variant call format (VCF) files provided by the 1000 Genomes and Exome Aggregation Consortium (ExAC) consortia were used to identify RMA sites. All coding RMA sites were checked for concordance with UniProt and the presence of same codon variants. RMA-corrected predictions of functional effect were obtained with SIFT, PolyPhen-2, and PROVEAN standalone tools and compared with dbNSFP v2.9 for consistency.ResultsWe systematically characterized the problem of RMAs and identified several possible ways in which RMA could interfere with accurate variant discovery and annotation. We have discovered a systematic bias in the automated variant effect prediction at the RMA loci, as well as widespread switching of functional consequences for variants located in the same codon as the RMA. As a convenient way to address the problem of RMAs we have developed a simple bioinformatic tool that identifies variation at RMA sites and provides correct annotations for all such substitutions. The tool is available free of charge at http://rmahunter.bioinf.me.ConclusionCorrection of RMA annotation enhances the accuracy of next-generation sequencing-based methods in clinical practice.
Next generation DNA sequencing technologies are rapidly transforming the world of human genomics. Advantages and diagnostic effectiveness of the two most widely used resequencing approaches, whole exome (WES) and whole genome (WGS) sequencing, are still frequently debated. In our study we developed a set of statistical tools to systematically assess coverage of CDS regions provided by several modern WES platforms, as well as PCR-free WGS. Using several novel metrics to characterize exon coverage in WES and WGS, we showed that some of the WES platforms achieve substantially less biased CDS coverage than others, with lower within-and between-interval variation and virtually absent GC-content bias. We discovered that, contrary to a common view, most of the coverage bias in WES stems from mappability limitations of short reads, as well as exome probe design. We identified the ~ 500 kb region of human exome that could not be effectively characterized using short read technology.We also showed that the overall power for SNP and indel discovery in CDS region is virtually indistinguishable for WGS and best WES platforms. Our results indicate that deep WES (100x) using least biased technologies provides similar effective coverage (97% of 10x q10+ bases) and CDS variant discovery to the standard 30x WGS, suggesting that WES remains an efficient alternative to WGS in many applications. Our work could serve as a guide for selection of an upto-date resequencing approach in human genomic studies.
The present study reports on the frequency and the spectrum of genetic variants causative of monogenic diabetes in russian children with non-type 1 diabetes mellitus. The present study included 60 unrelated russian children with non-type 1 diabetes mellitus diagnosed before the age of 18 years. Genetic variants were screened using whole-exome sequencing (WeS) in a panel of 35 genes causative of maturity onset diabetes of the young (ModY) and transient or permanent neonatal diabetes. Verification of the WeS results was performed using Pcr-direct sequencing. a total of 38 genetic variants were identified in 33 out of 60 patients (55%). The majority of patients (27/33, 81.8%) had variants in ModY-related genes: GCK (n=19), HNF1A (n=2), PAX4 (n=1), ABCC8 (n=1), KCNJ11 (n=1), GCK+HNF1A (n=1), GCK+BLK (n=1) and GCK+BLK+WFS1 (n=1). a total of 6 patients (6/33, 18.2%) had variants in ModY-unrelated genes: GATA6 (n=1), WFS1 (n=3), EIF2AK3 (n=1) and SLC19A2 (n=1). a total of 15 out of 38 variants were novel, including GCK, HNF1A, BLK, WFS1, EIF2AK3 and SLC19A2. To summarize, the present study demonstrates a high frequency and a wide spectrum of genetic variants causative of monogenic diabetes in russian children with non-type 1 diabetes mellitus. The spectrum includes previously known and novel variants in ModY-related and unrelated genes, with multiple variants in a number of patients. The prevalence of GCK variants indicates that diagnostics of monogenic diabetes in russian children may begin with testing for ModY2. However, the remaining variants are present at low frequencies in 9 different genes, altogether amounting to ~50% of the cases and highlighting the efficiency of using WES in non-GCK-ModY cases.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.