Abstract:Since 2005, genome-wide association (GWA) datasets have been largely biased toward sampling European ancestry individuals, and recent studies have shown that GWA results estimated from European ancestry individuals apply heterogeneously in non-European ancestry individuals. Here, we argue that enrichment analyses which aggregate SNP-level association statistics at multiple genomic scales—to genes and pathways—have been overlooked and can generate biologically interpretable hypotheses regarding the genetic basi… Show more
“…As discussed in ref. 43 , it is more ideal to consider the ancestrytrait-specific Bonferroni-corrected significance threshold. In our study, we only consider the Taiwanese population, and the maximum number of tested SNPs is 5,981,581 for all traits.…”
To explore the complex genetic architecture of common diseases and traits, we conducted comprehensive PheWAS of ten diseases and 34 quantitative traits in the community-based Taiwan Biobank (TWB). We identified 995 significantly associated loci with 135 novel loci specific to Taiwanese population. Further analyses highlighted the genetic pleiotropy of loci related to complex disease and associated quantitative traits. Extensive analysis on glycaemic phenotypes (T2D, fasting glucose and HbA1c) was performed and identified 115 significant loci with four novel genetic variants (HACL1, RAD21, ASH1L and GAK). Transcriptomics data also strengthen the relevancy of the findings to metabolic disorders, thus contributing to better understanding of pathogenesis. In addition, genetic risk scores are constructed and validated for absolute risks prediction of T2D in Taiwanese population. In conclusion, our data-driven approach without a priori hypothesis is useful for novel gene discovery and validation on top of disease risk prediction for unique non-European population.
“…As discussed in ref. 43 , it is more ideal to consider the ancestrytrait-specific Bonferroni-corrected significance threshold. In our study, we only consider the Taiwanese population, and the maximum number of tested SNPs is 5,981,581 for all traits.…”
To explore the complex genetic architecture of common diseases and traits, we conducted comprehensive PheWAS of ten diseases and 34 quantitative traits in the community-based Taiwan Biobank (TWB). We identified 995 significantly associated loci with 135 novel loci specific to Taiwanese population. Further analyses highlighted the genetic pleiotropy of loci related to complex disease and associated quantitative traits. Extensive analysis on glycaemic phenotypes (T2D, fasting glucose and HbA1c) was performed and identified 115 significant loci with four novel genetic variants (HACL1, RAD21, ASH1L and GAK). Transcriptomics data also strengthen the relevancy of the findings to metabolic disorders, thus contributing to better understanding of pathogenesis. In addition, genetic risk scores are constructed and validated for absolute risks prediction of T2D in Taiwanese population. In conclusion, our data-driven approach without a priori hypothesis is useful for novel gene discovery and validation on top of disease risk prediction for unique non-European population.
“…Importantly, our method assumes only that causal genes for complex traits are shared across ancestries while making no assumptions on underlying eQTL architectures across ancestries. This is an important feature of our method considering recent findings that SNP-level replication across genetic ancestries is weaker than gene-level replication 36 , and that only ∼30% of SNP-gene expression associations are shared between European- and African-American ancestry 39 . Through extensive simulations, we demonstrate that MA-FOCUS’ ability to identify causal genes is superior to baseline approaches and is robust to data-dependent limitations (see Methods ).…”
Section: Discussionmentioning
confidence: 99%
“…Instead, MA-FOCUS assumes only that the causal genes for a focal trait or disease are shared across ancestries. It is expected that gene-level effects are likely more transferable across ancestry groups than SNP-level effects as genes are inherently a more meaningful biological unit 36 . As a result, MA-FOCUS leverages cross-ancestry heterogeneity in LD patterns and eQTL associations to identify causal genes with improved precision and accuracy when compared with alternative approaches.…”
Transcriptome-wide association studies (TWAS) are a powerful approach to identify genes whose expression associates with complex disease risk. However, non-causal genes can exhibit association signals due to confounding by linkage disequilibrium patterns (LD) and eQTL pleiotropy at genomic risk regions which necessitates fine-mapping of TWAS signals. Here, we present MA-FOCUS, a multi-ancestry framework for the improved identification of genes underlying traits of interest. We demonstrate that by leveraging differences in ancestry-specific patterns of LD and eQTL signals, MA-FOCUS consistently outperforms single-ancestry fine-mapping approaches with equivalent total sample size across multiple metrics. We perform 15 blood trait TWAS using genome-wide summary statistics (average NEA=511k, NAA=13k) and lymphoblastoid cell line eQTL data from cohorts of primarily European and African continental ancestries. We recapitulate evidence demonstrating shared genetic architectures for eQTL and blood traits between the two ancestry groups and observe that gene-level effects correlate 20% more strongly across ancestries compared with SNP-level effects. We perform fine-mapping using MA-FOCUS and find evidence that genes at TWAS risk regions are more likely to be shared across ancestries rather than ancestry-specific. Using multiple lines of evidence to validate our findings, we find gene sets produced by MA-FOCUS are more enriched in hematopoietic categories compared to alternative approaches (P=1.73e-16). Our work demonstrates that including, and appropriately accounting for, genetic diversity can drive deeper insights into the genetic architecture of complex traits.
“…Individuals with HBA1C readings of 42-48 mmol/mol, a range associated with prediabetes, were not included in the analysis. Ancestry Mismatch Experiment: Individuals were first divided on the basis of ancestry, as in Smith et al 2021, identifying 349,411 individuals of self-identified European descent, and 4,967 individuals of African descent. The latter of which were identified both by self-identification and by an ADMIXTURE analysis as described in (Smith et al 2021).…”
Section: Unknown Class Example: Wheat Seeds Datasetmentioning
confidence: 99%
“…Ancestry Mismatch Experiment: Individuals were first divided on the basis of ancestry, as in Smith et al 2021, identifying 349,411 individuals of self-identified European descent, and 4,967 individuals of African descent. The latter of which were identified both by self-identification and by an ADMIXTURE analysis as described in (Smith et al 2021). Applying the HBA1C filter described above resulted in 8,631 individuals in the European/elevated cohort, 268 individuals in the African/elevated cohort, 243,283 individuals in the European/normal cohort and 2,532 individuals in the African/normal cohort.…”
Section: Unknown Class Example: Wheat Seeds Datasetmentioning
Machine learning has become an important tool across biological disciplines, allowing researchers to draw conclusions from large datasets, and opening up new opportunities for interpreting complex and heterogeneous biological data. Alongside the rapid growth of machine learning, there have also been growing pains: some models that appear to perform well have later been revealed to rely on features of the data that are artifactual or biased; this feeds into the general criticism that machine learning models are designed to optimize model performance over the creation of new biological insights. A natural question thus arises: how do we develop machine learning models that are inherently interpretable or explainable? In this manuscript, we describe reliability scores, a new concept for scientific machine learning studies that assesses the ability of a classifier to produce a reliable classification for a given instance. We develop a specific implementation of a reliability score, based on our work in Sugden et al. 2018 in which we introduced SWIF(r), a generative classifier for detecting selection in genomic data. We call our implementation the SWIF(r) Reliability Score (SRS), and demonstrate the utility of the SRS when faced with common challenges in machine learning including: 1) an unknown class present in testing data that was not present in training data, 2) systemic mismatch between training and testing data, and 3) instances of testing data that are missing values for some attributes. We explore these applications of the SRS using a range of biological datasets, from agricultural data on seed morphology, to 22 quantitative traits in the UK Biobank, and population genetic simulations and 1000 Genomes Project data. With each of these examples, we demonstrate how interpretability tools for machine learning like the SRS can allow researchers to interrogate their data thoroughly, and to pair their domain-specific knowledge with powerful machine-learning frameworks. We hope that this tool, and the surrounding discussion, will aid researchers in the biological machine learning space as they seek to harness the power of machine learning without sacrificing rigor and biological understanding.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.