The UK Biobank Exome Sequencing Consortium (UKB-ESC) is a unique private/public partnership between the UK Biobank and eight biopharma companies that will sequence the exomes of all ~500,000 UK Biobank participants. Here we describe early results from the exome sequence data generated by this consortium for the first ~200,000 UKB subjects and the key features of this project that enabled the UKB-ESC to come together and generate this data.
Exome sequencing data from the first 200,643 UKB enrollees are now accessible to the research community. Approximately 10M variants were observed within the targeted regions, including: 8,086,176 SNPs, 370,958 indels and 1,596,984 multi-allelic variants. Of the ~8M variants observed, 84.5% are coding variants and include 2,139,318 (25.3%) synonymous, 4,549,694 (53.8%) missense, 453,733 (5.4%) predicted loss-of-function (LOF) variants (initiation codon loss, premature stop codons, stop codon loss, splicing and frameshift variants) affecting at least one coding transcript. This open access data provides a rich resource of coding variants for rare variant genetic studies and is particularly valuable for drug discovery efforts that utilize rare, functionally consequential variants.
The UKB-ESC was formed to address the need for large-scale human genetics data to drive drug discovery, and to enhance the UK Biobank with a valuable data resource that will be available to the broad biomedical research community. We describe the rationale for the use of human genetics in drug discovery as well as lessons learned from the formation and implementation of the UKB-ESC.
Hundreds of thousands of human whole genome sequencing (WGS) datasets will be generated over the next few years. These data are more valuable in aggregate: joint analysis of genomes from many sources increases sample size and statistical power. A central challenge for joint analysis is that different WGS data processing pipelines cause substantial differences in variant calling in combined datasets, necessitating computationally expensive reprocessing. This approach is no longer tenable given the scale of current studies and data volumes. Here, we define WGS data processing standards that allow different groups to produce functionally equivalent (FE) results, yet still innovate on data processing pipelines. We present initial FE pipelines developed at five genome centers and show that they yield similar variant calling results and produce significantly less variability than sequencing replicates. This work alleviates a key technical bottleneck for genome aggregation and helps lay the foundation for community-wide human genetics studies.
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) enters human host cells via angiotensin-converting enzyme 2 (ACE2) and causes coronavirus disease 2019 (COVID-19). Here, through a genome-wide association study, we identify a variant (rs190509934, minor allele frequency 0.2–2%) that downregulates ACE2 expression by 37% (P = 2.7 × 10−8) and reduces the risk of SARS-CoV-2 infection by 40% (odds ratio = 0.60, P = 4.5 × 10−13), providing human genetic evidence that ACE2 expression levels influence COVID-19 risk. We also replicate the associations of six previously reported risk variants, of which four were further associated with worse outcomes in individuals infected with the virus (in/near LZTFL1, MHC, DPP9 and IFNAR2). Lastly, we show that common variants define a risk score that is strongly associated with severe disease among cases and modestly improves the prediction of disease severity relative to demographic and clinical factors alone.
Riboswitches are important gene regulatory elements frequently encountered in bacterial mRNAs. The recently discovered nadA riboswitch contains two similar, tandemly arrayed aptamer domains, with the first domain possessing high affinity for nicotinamide adenine dinucleotide (NAD+). The second domain which comprises the ribosomal binding site in a putative regulatory helix, however, has withdrawn from detection of ligand-induced structural modulation thus far, and therefore, the identity of the cognate ligand and the regulation mechanism have remained unclear. Here, we report crystal structures of both riboswitch domains, each bound to NAD+. Furthermore, we demonstrate that ligand binding to domain 2 requires significantly higher concentrations of NAD+ (or ADP retaining analogs) compared to domain 1. Using a fluorescence spectroscopic approach, we further shed light on the structural features which are responsible for the different ligand affinities, and describe the Mg2+-dependent, distinct folding and pre-organization of their binding pockets. Finally, we speculate about possible scenarios for nadA RNA gene regulation as a putative two-concentration sensor module for a time-controlled signal that is primed and stalled by the gene regulation machinery at low ligand concentrations (domain 1), and finally triggers repression of translation as soon as high ligand concentrations are reached in the cell (domain 2).
Abstract. Hundreds of thousands of human whole genome sequencing (WGS) datasets will be generated over the next few years to interrogate a broad range of traits, across diverse populations.These data are more valuable in aggregate: joint analysis of genomes from many sources increases sample size and statistical power for trait mapping, and will enable studies of genome biology, population genetics and genome function at unprecedented scale. A central challenge for joint analysis is that different WGS data processing and analysis pipelines cause substantial batch effects in combined datasets, necessitating computationally expensive reprocessing and harmonization prior to variant calling. This approach is no longer tenable given the scale of current studies and data volumes.Here, in a collaboration across multiple genome centers and NIH programs, we define WGS data processing standards that allow different groups to produce "functionally equivalent" (FE) results suitable for joint variant calling with minimal batch effects. Our approach promotes broad harmonization of upstream data processing steps, while allowing for diverse variant callers. Importantly, it allows each group to continue innovating on data processing pipelines, as long as results remain compatible. We present initial FE pipelines developed at five genome centers and show that they yield similar variant calling results -including single nucleotide (SNV), insertion/deletion (indel) and structural variation (SV) -and produce significantly less variability than sequencing replicates. Residual inter-pipeline variability is concentrated at low quality sites and repetitive genomic regions prone to stochastic effects. This work alleviates a key technical bottleneck for genome aggregation and helps lay the foundation for broad data sharing and community-wide "big-data" human genetics studies.
Main textOver the past few years, a wave of large-scale WGS-based human genetics studies have been launched by various institutes and funding programs worldwide, aimed at elucidating the genetic basis of a variety of human traits. These projects will generate hundreds of thousands of publicly available deep (>20x) WGS datasets from diverse human populations. Indeed, at the time of writing, >150,000 human genomes have already been sequenced by three NIH programs: NHGRI Centers for Common Disease Genomics 1 (CCDG), NHLBI Trans-Omics for Precision Medicine 2 (TOPMed), and NIMH Whole Genome Sequencing in Psychiatric Disorders 3 (WGSPD). Systematic aggregation and co-analysis of these (and other) genomic datasets will enable increasingly well-powered studies of human traits, population history and genome evolution, and will provide population-scale reference databases that expand upon the groundbreaking efforts of the 1000 Genomes Project 4,5 , Haplotype Reference Consortium 6 , ExAC 7 and GnomAD 8 .Our ability as a field to harness these collective data to their full analytic potential depends on the availability of high quality variant calls from large populations of in...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.