Advantages and diagnostic effectiveness of the two most widely used resequencing approaches, whole exome (WES) and whole genome (WGS) sequencing, are often debated. WES dominated largescale resequencing projects because of lower cost and easier data storage and processing. Rapid development of 3 rd generation sequencing methods and novel exome sequencing kits predicate the need for a robust statistical framework allowing informative and easy performance comparison of the emerging methods. In our study we developed a set of statistical tools to systematically assess coverage of coding regions provided by several modern WES platforms, as well as PCR-free WGS. We identified a substantial problem in most previously published comparisons which did not account for mappability limitations of short reads. Using regression analysis and simple machine learning, as well as several novel metrics of coverage evenness, we analyzed the contribution from the major determinants of CDS coverage. Contrary to a common view, most of the observed bias in modern WES stems from mappability limitations of short reads and exome probe design rather than sequence composition. We also identified the ~ 500 kb region of human exome that could not be effectively characterized using short read technology and should receive special attention during variant analysis. Using our novel metrics of sequencing coverage, we identified main determinants of WES and WGS performance. Overall, our study points out avenues for improvement of enrichment-based methods and development of novel approaches that would maximize variant discovery at optimal cost. Next-generation sequencing (NGS) is rapidly becoming an invaluable tool in human genetics research and clinical diagnostics 1-3. Practical use of NGS methods has dramatically increased with the development of targeted sequencing approaches, such as whole-exome sequencing (WES) or targeted sequencing of gene panels. WES emerged as an efficient alternative to whole-genome sequencing (WGS) due to both lower sequencing cost and simplification of variant analysis and data storage 4. More than 80% of all variants reported in ClinVar, and more than 89% of variants reported to be pathogenic, come from the protein-coding part of the genome; this number increases to 99% when immediate CDS vicinity is included. Even allowing for the sampling bias, there is an overall agreement that most heritable diseases appear to be caused by alterations in the protein-coding regions of the
Although endogenous retroviruses (ERVs) are known to harbor cis-regulatory elements, their role in modulating cellular immune responses remains poorly understood. Using an RNA-seq approach, we show that several members of the ERV9 lineage, particularly LTR12C elements, are activated upon HIV-1 infection of primary CD4+ T cells. Intriguingly, HIV-1-induced ERVs harboring transcription start sites are primarily found in the vicinity of immunity genes. For example, HIV-1 infection activates LTR12C elements upstream of the interferon-inducible genes GBP2 and GBP5 that encode for broad-spectrum antiviral factors. Reporter assays demonstrated that these LTR12C elements drive gene expression in primary CD4+ T cells. In line with this, HIV-1 infection triggered the expression of a unique GBP2 transcript variant by activating a cryptic transcription start site within LTR12C. Furthermore, stimulation with HIV-1-induced cytokines increased GBP2 and GBP5 expression in human cells, but not in macaque cells that naturally lack the GBP5 gene and the LTR12C element upstream of GBP2. Finally, our findings suggest that GBP2 and GBP5 have already been active against ancient viral pathogens as they suppress the maturation of the extinct retrovirus HERV-K (HML-2). In summary, our findings uncover how human cells can exploit remnants of once-infectious retroviruses to regulate antiviral gene expression.
BackgroundAllele frequency data from large exome and genome aggregation projects such as the Genome Aggregation Database (gnomAD) are of ultimate importance to the interpretation of medical resequencing data. However, allele frequencies might significantly differ in poorly studied populations that are underrepresented in large‐scale projects, such as the Russian population.MethodsIn this work, we leveraged our access to a large dataset of 694 exome samples to analyze genetic variation in the Northwest Russia. We compared the spectrum of genetic variants to the dbSNP build 151, and made estimates of ClinVar‐based autosomal recessive (AR) disease allele prevalence as compared to gnomAD r. 2.1.ResultsAn estimated 9.3% of discovered variants were not present in dbSNP. We report statistically significant overrepresentation of pathogenic variants for several Mendelian disorders, including phenylketonuria (PAH, rs5030858), Wilson's disease (ATP7B, rs76151636), factor VII deficiency (F7, rs36209567), kyphoscoliosis type of Ehlers‐Danlos syndrome (FKBP14, rs542489955), and several other recessive pathologies. We also make primary estimates of monogenic disease incidence in the population, with retinal dystrophy, cystic fibrosis, and phenylketonuria being the most frequent AR pathologies.ConclusionOur observations demonstrate the utility of population‐specific allele frequency data to the diagnosis of monogenic disorders using high‐throughput technologies.
Next generation DNA sequencing technologies are rapidly transforming the world of human genomics. Advantages and diagnostic effectiveness of the two most widely used resequencing approaches, whole exome (WES) and whole genome (WGS) sequencing, are still frequently debated. In our study we developed a set of statistical tools to systematically assess coverage of CDS regions provided by several modern WES platforms, as well as PCR-free WGS. Using several novel metrics to characterize exon coverage in WES and WGS, we showed that some of the WES platforms achieve substantially less biased CDS coverage than others, with lower within-and between-interval variation and virtually absent GC-content bias. We discovered that, contrary to a common view, most of the coverage bias in WES stems from mappability limitations of short reads, as well as exome probe design. We identified the ~ 500 kb region of human exome that could not be effectively characterized using short read technology.We also showed that the overall power for SNP and indel discovery in CDS region is virtually indistinguishable for WGS and best WES platforms. Our results indicate that deep WES (100x) using least biased technologies provides similar effective coverage (97% of 10x q10+ bases) and CDS variant discovery to the standard 30x WGS, suggesting that WES remains an efficient alternative to WGS in many applications. Our work could serve as a guide for selection of an upto-date resequencing approach in human genomic studies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.