Clinical laboratory tests are a critical component of the continuum of care. We evaluate the genetic basis of 35 blood and urine laboratory measurements in the UK Biobank (n=363,228 individuals). We identify 1,857 loci associated with at least one trait, containing 3,374 fine-mapped associations, and additional sets of large-effect (> 0.1 sd) protein-altering, HLA, and copy-number variant associations. Through Mendelian Randomization analysis, we discover 51 causal relationships, including previously known agonistic effects of urate on gout and cystatin C on stroke. Finally, we develop polygenic risk scores for each biomarker and built ‘multi-PRS’ models for diseases using 35 PRSs simultaneously, which improved chronic kidney disease, type 2 diabetes, gout, and alcoholic cirrhosis genetic risk stratification in an independent dataset (FinnGen; n=135,500) relative to single-disease PRSs. Together, our results delineate the genetic basis of biomarkers, their causal influences on diseases, and improve genetic risk stratification for common diseases.
The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ 1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ 1 /ℓ 2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.
Population-based biobanks with genomic and dense phenotype data provide opportunities for generating effective therapeutic hypotheses and understanding the genomic role in disease predisposition. To characterize latent components of genetic associations, we apply truncated singular value decomposition (DeGAs) to matrices of summary statistics derived from genome-wide association analyses across 2,138 phenotypes measured in 337,199 White British individuals in the UK Biobank study. We systematically identify key components of genetic associations and the contributions of variants, genes, and phenotypes to each component. As an illustration of the utility of the approach to inform downstream experiments, we report putative loss of function variants, rs114285050 ( GPR151 ) and rs150090666 ( PDE3B ), that substantially contribute to obesity-related traits and experimentally demonstrate the role of these genes in adipocyte biology. Our approach to dissect components of genetic associations across the human phenome will accelerate biomedical hypothesis generation by providing insights on previously unexplored latent structures.
Suicide accounts for nearly 800,000 deaths per year worldwide with rates of both deaths and attempts rising. Family studies have estimated substantial heritability of suicidal behavior; however, collecting the sample sizes necessary for successful genetic studies has remained a challenge. We utilized two different approaches in independent datasets to characterize the contribution of common genetic variation to suicide attempt. The first is a patient reported suicide attempt phenotype asked as part of an online mental health survey taken by a subset of participants (n = 157,366) in the UK Biobank. After quality control, we leveraged a genotyped set of unrelated, white British ancestry participants including 2433 cases and 334,766 controls that included those that did not participate in the survey or were not explicitly asked about attempting suicide. The second leveraged electronic health record (EHR) data from the Vanderbilt University Medical Center (VUMC, 2.8 million patients, 3250 cases) and machine learning to derive probabilities of attempting suicide in 24,546 genotyped patients. We identified significant and comparable heritability estimates of suicide attempt from both the patient reported phenotype in the UK Biobank (h2SNP = 0.035, p = 7.12 × 10−4) and the clinically predicted phenotype from VUMC (h2SNP = 0.046, p = 1.51 × 10−2). A significant genetic overlap was demonstrated between the two measures of suicide attempt in these independent samples through polygenic risk score analysis (t = 4.02, p = 5.75 × 10−5) and genetic correlation (rg = 1.073, SE = 0.36, p = 0.003). Finally, we show significant but incomplete genetic correlation of suicide attempt with insomnia (rg = 0.34–0.81) as well as several psychiatric disorders (rg = 0.26–0.79). This work demonstrates the contribution of common genetic variation to suicide attempt. It points to a genetic underpinning to clinically predicted risk of attempting suicide that is similar to the genetic profile from a patient reported outcome. Lastly, it presents an approach for using EHR data and clinical prediction to generate quantitative measures from binary phenotypes that can improve power for genetic studies.
Since its first proposal in statistics (Tibshirani, 1996), the lasso has been an effective method for simultaneous variable selection and estimation. A number of packages have been developed to solve the lasso efficiently. However as large datasets become more prevalent, many algorithms are constrained by efficiency or memory bounds. In this paper, we propose a meta algorithm batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and build a scalable lasso solution for large datasets. We also introduce snpnet, an R package that implements the proposed algorithm on top of glmnet (Friedman et al., 2010a) for largescale single nucleotide polymorphism (SNP) datasets that are widely studied in genetics. We demonstrate results on a large genotype-phenotype dataset from the UK Biobank, where we achieve state-of-the-art heritability estimation on quantitative and qualitative traits including height, body mass index, asthma and high cholesterol.
Summary Large biobanks linking phenotype to genotype have led to an explosion of genetic association studies across a wide range of phenotypes. Sharing the knowledge generated by these resources with the scientific community remains a challenge due to patient privacy and the vast amount of data. Here, we present Global Biobank Engine (GBE), a web-based tool that enables exploration of the relationship between genotype and phenotype in biobank cohorts, such as the UK Biobank. GBE supports browsing for results from genome-wide association studies, phenome-wide association studies, gene-based tests and genetic correlation between phenotypes. We envision GBE as a platform that facilitates the dissemination of summary statistics from biobanks to the scientific and clinical communities. Availability and implementation GBE currently hosts data from the UK Biobank and can be found freely available at biobankengine.stanford.edu.
Clinical laboratory tests are a critical component of the continuum of care and provide a means for rapid diagnosis and monitoring of chronic disease. In this study, we systematically evaluated the genetic basis of 38 blood and urine laboratory tests measured in 358,072 participants in the UK Biobank and identified 1,857 independent loci associated with at least one laboratory test, including 488 largeeffect protein truncating, missense, and copynumber variants. We tested these loci for enrichment in specific single cell types in kidney, liver, and pancreas relevant to disease aetiology. We then causally linked the biomarkers to medically relevant phenotypes through genetic correlation and Mendelian randomization. Finally, we developed polygenic risk scores (PRS) for each biomarker and built multiPRS models using all 38 PRSs simultaneously. We found substantially improved prediction of incidence in FinnGen (n=135,500) with the multiPRS relative to singledisease PRSs for renal failure, myocardial infarction, liver fat percentage, and alcoholic cirrhosis. Together, our results show the genetic basis of these biomarkers, which tissues contribute to the biomarker function, the causal influences of the biomarkers, and how we can use this to predict disease. 45 50 55 60 65 70 75 80 85 90 95 100
Copy-number variations (CNVs) represent a significant proportion of the genetic differences between individuals and many CNVs associate causally with syndromic disease and clinical outcomes. Here, we characterize the landscape of copy-number variation and their phenome-wide effects in a sample of 472,228 array-genotyped individuals from the UK Biobank. In addition to population-level selection effects against genic loci conferring high mortality, we describe genetic burden from potentially pathogenic and previously uncharacterized CNV loci across more than 3,000 quantitative and dichotomous traits, with separate analyses for common and rare classes of variation. Specifically, we highlight the effects of CNVs at two well-known syndromic loci 16p11.2 and 22q11.2, previously uncharacterized variation at 9p23, and several genic associations in the context of acute coronary artery disease and high body mass index. Our data constitute a deeply contextualized portrait of population-wide burden of copy-number variation, as well as a series of dosage-mediated genic associations across the medical phenome. types in a cohort of 332,584 participants from the UK Biobank. 18 We replicate well-established syndromic effects of common CNVs-namely 22q11.2 deletion (DiGeorge) syndrome and two variants of 16p11.2 deletion syndrome-and highlight associations for body mass index (BMI), acute coronary artery disease (CAD), and related adipose and cardiovascular phenotypes. Summary statistics from traditional genome-wide associations for common CNVs as well as from gene-level aggregate burden tests of rare variants across all phenotypes are available for download on the Global Biobank Engine. 19 Material and Methods CNVs were called using PennCNV v.1.0.4 on raw signal intensity data from each genotyping array. Phenotype data were derived from data-fields collected for UK Biobank corresponding to various body measurements, biomarkers, disease diagnoses, and medical procedures from medical records, as well as a questionnaire about lifestyle and medical history. Summary-level data from all statistical tests described here, as well as more thorough documentation on phenotyping, are available on the Global Biobank Engine 19 and can be found in related publications. 20 CNV Calling in UK Biobank Methods for genetic data acquisition and quality control as performed by the UK Biobank have been previously described. 18 In brief, two similar arrays were used for targeted genotyping within the study population: the UK BiLEVE Axiom Array (n ¼ 49,950) by Affymetrix and the UK Biobank Axiom Array (n ¼ 438,427), which was custom designed by Applied Biosystems. Samples and array markers were subject to threshold-based filtration and quality control prior to public release. Specifically, markers were tested for discordance across control replicates, departures from Hardy-Weinberg equilibrium, as well as effects due to batch, plate, array, and sex; affected markers were set as missing in affected batches or
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.