Dmitrii E. Polev scite author profile

Advantages and diagnostic effectiveness of the two most widely used resequencing approaches, whole exome (WES) and whole genome (WGS) sequencing, are often debated. WES dominated largescale resequencing projects because of lower cost and easier data storage and processing. Rapid development of 3 rd generation sequencing methods and novel exome sequencing kits predicate the need for a robust statistical framework allowing informative and easy performance comparison of the emerging methods. In our study we developed a set of statistical tools to systematically assess coverage of coding regions provided by several modern WES platforms, as well as PCR-free WGS. We identified a substantial problem in most previously published comparisons which did not account for mappability limitations of short reads. Using regression analysis and simple machine learning, as well as several novel metrics of coverage evenness, we analyzed the contribution from the major determinants of CDS coverage. Contrary to a common view, most of the observed bias in modern WES stems from mappability limitations of short reads and exome probe design rather than sequence composition. We also identified the ~ 500 kb region of human exome that could not be effectively characterized using short read technology and should receive special attention during variant analysis. Using our novel metrics of sequencing coverage, we identified main determinants of WES and WGS performance. Overall, our study points out avenues for improvement of enrichment-based methods and development of novel approaches that would maximize variant discovery at optimal cost. Next-generation sequencing (NGS) is rapidly becoming an invaluable tool in human genetics research and clinical diagnostics 1-3. Practical use of NGS methods has dramatically increased with the development of targeted sequencing approaches, such as whole-exome sequencing (WES) or targeted sequencing of gene panels. WES emerged as an efficient alternative to whole-genome sequencing (WGS) due to both lower sequencing cost and simplification of variant analysis and data storage 4. More than 80% of all variants reported in ClinVar, and more than 89% of variants reported to be pathogenic, come from the protein-coding part of the genome; this number increases to 99% when immediate CDS vicinity is included. Even allowing for the sampling bias, there is an overall agreement that most heritable diseases appear to be caused by alterations in the protein-coding regions of the

show abstract

Disruption of Transcriptional Coactivator Sub1 Leads to Genome-Wide Re-distribution of Clustered Mutations Induced by APOBEC in Active Yeast Genes

Lada

Kliver

Dhar

et al. 2015

PLoS Genet

View full text Add to dashboard Cite

Mutations in genomes of species are frequently distributed non-randomly, resulting in mutation clusters, including recently discovered kataegis in tumors. DNA editing deaminases play the prominent role in the etiology of these mutations. To gain insight into the enigmatic mechanisms of localized hypermutagenesis that lead to cluster formation, we analyzed the mutational single nucleotide variations (SNV) data obtained by whole-genome sequencing of drug-resistant mutants induced in yeast diploids by AID/APOBEC deaminase and base analog 6-HAP. Deaminase from sea lamprey, PmCDA1, induced robust clusters, while 6-HAP induced a few weak ones. We found that PmCDA1, AID, and APOBEC1 deaminases preferentially mutate the beginning of the actively transcribed genes. Inactivation of transcription initiation factor Sub1 strongly reduced deaminase-induced can1 mutation frequency, but, surprisingly, did not decrease the total SNV load in genomes. However, the SNVs in the genomes of the sub1 clones were re-distributed, and the effect of mutation clustering in the regions of transcription initiation was even more pronounced. At the same time, the mutation density in the protein-coding regions was reduced, resulting in the decrease of phenotypically detected mutants. We propose that the induction of clustered mutations by deaminases involves: a) the exposure of ssDNA strands during transcription and loss of protection of ssDNA due to the depletion of ssDNA-binding proteins, such as Sub1, and b) attainment of conditions favorable for APOBEC action in subpopulation of cells, leading to enzymatic deamination within the currently expressed genes. This model is applicable to both the initial and the later stages of oncogenic transformation and explains variations in the distribution of mutations and kataegis events in different tumor cells.

show abstract

Sequencing, biochemical characterization, crystal structure and molecular dynamics of cellobiohydrolase Cel7A from Geotrichum candidum 3C

et al. 2015

View full text Add to dashboard Cite

The ascomycete Geotrichum candidum is a versatile and efficient decay fungus that is involved, for example, in biodeterioration of compact discs; notably, the 3C strain was previously shown to degrade filter paper and cotton more efficiently than several industrial enzyme preparations. Glycoside hydrolase (GH) family 7 cellobiohydrolases (CBHs) are the primary constituents of industrial cellulase cocktails employed in biomass conversion, and feature tunnel-enclosed active sites that enable processive hydrolytic cleavage of cellulose chains. Understanding the structure-function relationships defining the activity and stability of GH7 CBHs is thus of keen interest. Accordingly, we report the comprehensive characterization of the GH7 CBH secreted by G. candidum (GcaCel7A). The bimodular cellulase consists of a family 1 cellulosebinding module (CBM) and linker connected to a GH7 catalytic domain that shares 64% sequence identity with the archetypal industrial GH7 CBH of Hypocrea jecorina (HjeCel7A). GcaCel7A shows activity on Avicel cellulose similar to HjeCel7A, with less product inhibition, but has a lower temperature optimum (50°C versus 60-65°C, respectively). Five crystal structures, with and without bound thio-oligosaccharides, show conformational diversity of tunnel-enclosing loops, including a form with partial tunnel collapse at subsite -4 not reported previously in GH7. Also, the first O-glycosylation site in a GH7 crystal structure is reported -on a loop where the glycan probably influences loop contacts across the active site and interactions with the cellulose surface. The GcaCel7A structures indicate higher loop flexibility than HjeCel7A, in accordance with sequence modifications. However, GcaCel7A retains small fluctuations in molecular simulations, suggesting high processivity and low endo-initiation probability, similar to HjeCel7A.

show abstract

Whole‐exome sequencing provides insights into monogenic disease prevalence in Northwest Russia

Barbitoff

Skitchenko

Poleshchuk

et al. 2019

Molec Gen & Gen Med

View full text Add to dashboard Cite

BackgroundAllele frequency data from large exome and genome aggregation projects such as the Genome Aggregation Database (gnomAD) are of ultimate importance to the interpretation of medical resequencing data. However, allele frequencies might significantly differ in poorly studied populations that are underrepresented in large‐scale projects, such as the Russian population.MethodsIn this work, we leveraged our access to a large dataset of 694 exome samples to analyze genetic variation in the Northwest Russia. We compared the spectrum of genetic variants to the dbSNP build 151, and made estimates of ClinVar‐based autosomal recessive (AR) disease allele prevalence as compared to gnomAD r. 2.1.ResultsAn estimated 9.3% of discovered variants were not present in dbSNP. We report statistically significant overrepresentation of pathogenic variants for several Mendelian disorders, including phenylketonuria (PAH, rs5030858), Wilson's disease (ATP7B, rs76151636), factor VII deficiency (F7, rs36209567), kyphoscoliosis type of Ehlers‐Danlos syndrome (FKBP14, rs542489955), and several other recessive pathologies. We also make primary estimates of monogenic disease incidence in the population, with retinal dystrophy, cystic fibrosis, and phenylketonuria being the most frequent AR pathologies.ConclusionOur observations demonstrate the utility of population‐specific allele frequency data to the diagnosis of monogenic disorders using high‐throughput technologies.

show abstract

Genome-wide sequence analyses of ethnic populations across Russia

et al. 2020

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Dmitrii E. Polev

Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage

Disruption of Transcriptional Coactivator Sub1 Leads to Genome-Wide Re-distribution of Clustered Mutations Induced by APOBEC in Active Yeast Genes

Sequencing, biochemical characterization, crystal structure and molecular dynamics of cellobiohydrolase Cel7A from Geotrichum candidum 3C

Whole‐exome sequencing provides insights into monogenic disease prevalence in Northwest Russia

Genome-wide sequence analyses of ethnic populations across Russia

Contact Info

Product

Resources

About