Fine-mapping aims to identify causal variants impacting complex traits. Several recent methods improve fine-mapping accuracy by prioritizing variants in enriched functional annotations. However, these methods can only use information at genome-wide significant loci (or a small number of functional annotations), severely limiting the benefit of functional data. We propose PolyFun, a computationally scalable framework to improve fine-mapping accuracy using genome-wide functional data for a broad set of coding, conserved, regulatory and LD-related annotations. PolyFun prioritizes variants in enriched functional annotations by specifying prior causal probabilities for fine-mapping methods such as SuSiE or FINEMAP, employing special procedures to ensure robustness to model misspecification and winner's curse. In simulations with in-sample LD, PolyFun + SuSiE and PolyFun + FINEMAP were well-calibrated and identified >20% more variants with posterior causal probability >0.95 than their non-functionally informed counterparts (and >33% more fine-mapped variants than previous functionally-informed fine-mapping methods). In simulations with mismatched reference LD, PolyFun + SuSiE remained well-calibrated when reducing the maximum number of assumed causal SNPs per locus, which reduces absolute power but still produces large relative improvements. In analyses of 49 UK Biobank traits (average N=318K) with insample LD, PolyFun + SuSiE identified 3,025 fine-mapped variant-trait pairs with posterior causal probability >0.95, a >32% improvement vs. SuSiE; 223 variants were fine-mapped for multiple genetically uncorrelated traits, indicating pervasive pleiotropy. We used posterior mean per-SNP heritabilities from PolyFun + SuSiE to perform polygenic localization, constructing minimal sets of common SNPs causally explaining 50% of common SNP heritability; these sets ranged in size from 28 (hair color) to 3,400 (height) to 2 million (number of children). In conclusion, PolyFun prioritizes variants for functional follow-up and provides insights into complex trait architectures..
Performing genetic studies in multiple human populations can identify disease risk alleles that are common in one population but rare in others1, with the potential to illuminate pathophysiology, health disparities, and the population genetic origins of disease alleles. We analyzed 9.2 million single nucleotide polymorphisms (SNPs) in each of 8,214 Mexicans and Latin Americans: 3,848 with type 2 diabetes (T2D) and 4,366 non-diabetic controls. In addition to replicating previous findings2–4, we identified a novel locus associated with T2D at genome-wide significance spanning the solute carriers SLC16A11 and SLC16A13 (P=3.9×10−13; odds ratio (OR)=1.29). The association was stronger in younger, leaner people with T2D, and replicated in independent samples (P=1.1×10−4; OR=1.20). The risk haplotype carries four amino acid substitutions, all in SLC16A11; it is present at ≈50% frequency in Native American samples and ≈10% in East Asian, but rare in European and African samples. Analysis of an archaic genome sequence indicated the risk haplotype introgressed into modern humans via admixture with Neandertals. The SLC16A11 mRNA is expressed in liver, and V5-tagged SLC16A11 protein localizes to the endoplasmic reticulum. Expression of SLC16A11 in heterologous cells alters lipid metabolism, most notably causing an increase in intracellular triacylglycerol levels. Despite T2D having been well studied by genome-wide association studies (GWAS) in other populations, analysis in Mexican and Latin American individuals identified SLC16A11 as a novel candidate gene for T2D with a possible role in triacylglycerol metabolism.
Methods for genetic risk prediction have been widely investigated in recent years. However, most available training data involves European samples, and it is currently unclear how to accurately predict disease risk in other populations. Previous studies have used either training data from European samples in large sample size or training data from the target population in small sample size, but not both. Here, we introduce a multi-ethnic polygenic risk score that combines training data from European samples and training data from the target population. We applied this approach to predict type 2 diabetes (T2D) in a Latino cohort using both publicly available European summary statistics in large sample size (Neff=40k) and Latino training data in small sample size (Neff=8k). Here, we attained a >70% relative improvement in prediction accuracy (from R2=0.027 to R2=0.047) compared to methods that use only one source of training data, consistent with large relative improvements in simulations. We observed a systematically lower load of T2D risk alleles in Latino individuals with more European ancestry, which could be explained by polygenic selection in ancestral European and/or Native American populations. We predict T2D in a South Asian UK Biobank cohort using European (Neff=40k) and South Asian (Neff=16k) training data and attained a >70% relative improvement in prediction accuracy, and application to predict height in an African UK Biobank cohort using European (N=113k) and African (N=2k) training data attained a 30% relative improvement. Our work reduces the gap in polygenic risk prediction accuracy between European and non-European target populations.
Fine-mapping aims to identify causal variants impacting complex traits. Several recent methods improve fine-mapping accuracy by prioritizing variants in enriched functional annotations. However, these methods can only use information at genome-wide significant loci (and/or a small number of functional annotations), severely limiting the benefit of functional data. We propose PolyFun, a computationally scalable framework to improve fine-mapping accuracy using genome-wide functional data for a broad set of coding, conserved, regulatory and LD-related annotations. PolyFun prioritizes variants in enriched functional annotations by specifying prior causal probabilities for fine-mapping methods such as SuSiE or FINEMAP, employing special procedures to ensure robustness to model misspecification and winner's curse. In simulations, PolyFun + SuSiE and PolyFun + FINEMAP were well-calibrated and identified >20% more variants with posterior causal probability >0.95 than their non-functionally informed counterparts (and >33% more fine-mapped variants than previous functionally-informed fine-mapping methods). In analyses of 47 UK Biobank traits (average N=317K), PolyFun + SuSiE identified 3,025 fine-mapped varianttrait pairs with posterior causal probability >0.95, a >32% improvement vs. SuSiE; 223 variants were finemapped for multiple genetically uncorrelated traits, indicating pervasive pleiotropy. We used posterior mean per-SNP heritabilities from PolyFun + SuSiE to perform polygenic localization, constructing minimal sets of common SNPs causally explaining 50% of common SNP heritability; these sets ranged in size from 25 (hair color) to 3,400 (height) to 550,000 (chronotype). In conclusion, PolyFun prioritizes variants for functional follow-up and provides insights into complex trait architectures.
Recent work has highlighted the importance of accounting for linkage disequilibrium (LD)dependent genetic architectures in analyses of heritability 1-5. Two models incorporating LDdependent architectures have been proposed for analyses of functional enrichment: the baseline-LD model 4 used by stratified LD score regression 4,6 (S-LDSC), and the LDAK model 1,3. Although both models include LD-dependent effects, they produce very different estimates of functional enrichment (e.g. 9.35x±0.80 in ref. 4 and 1.34x±0.26 in ref. 3 for conserved regions), leading to different interpretations of the functional architecture of complex traits. We performed a comprehensive set of formal model comparisons and empirical analyses to reconcile these findings. Each of these analyses supports the higher functional enrichment estimates of S-LDSC with the baseline-LD model; each paragraph below is detailed in a corresponding section of the Supplementary Note (also see Supplementary Tables 1-10 and Supplementary Figures 1-23 for detailed analyses). We defined six heritability models, including the infinitesimal model that ref. 3 called the "GCTA model" 7 , the baseline-LD model 4 combining functional annotations 6 with LDdependent and minor allele frequency (MAF)-dependent architectures, and the LDAK model 3 combining LD-dependent and MAF-dependent architectures; notably, the baseline-LD and LDAK models employ very different LD-dependent architectures. For comparison purposes, we also defined the "α-model" 1 comprising only the MAF-dependent part of the LDAK model, the "Gazal-LD model" comprising only the LD-dependent and MAFdependent parts of the baseline-LD model (analogous to the LDAK model), and the * Correspondence should be addressed to S.G.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.