The methodology of transcriptome‐wide association studies (TWAS) has become popular in integrating a reference expression quantitative trait (eQTL) data set with an independent main GWAS data set to identify (putatively) causal genes, shedding mechanistic insights to biological pathways from genetic variants to a GWAS trait mediated by gene expression. Statistically TWAS is a (two‐sample) 2‐stage least squares (2SLS) method in the framework of instrumental variables analysis for causal inference: in Stage 1 it uses the reference eQTL data to impute a genes expression for the main GWAS data, then in Stage 2 it tests for association between the imputed gene expression and the GWAS trait; if an association is detected in Stage 2, a (putatively) causal relationship between the gene and the GWAS trait is claimed. If a nonlinear model or a generalized linear model (GLM) is fitted in Stage 2 (e.g., for a binary GWAS trait), it is known that using only imputed gene expression, as in standard TWAS, in general does not lead to a consistent (i.e., asymptotically unbiased) estimate for the causal effect; accordingly, a variation of 2SLS, called two‐stage residual inclusion (2SRI), has been proposed to yield better estimates (e.g., being consistent under suitable conditions). Our main goal is to investigate whether it is necessary or even better to apply 2SRI, instead of the standard 2SLS. In addition, due to the use of imputed gene expression (i.e., with measurement errors), it is known that in general some correction to the standard error estimate of the causal effect estimate has to be applied, while in the standard TWAS no correction is applied. Is this an issue? We also compare one‐sample 2SLS with two‐sample 2SLS (i.e., the standard TWAS). We used the Alzheimer's Disease Neuroimaging Initiative (ADNI) data and simulated data mimicking the ADNI data to address the above questions. At the end, we conclude that, in practice with the large sample sizes and small effect sizes of genetic variants, the standard TWAS performs well and is recommended.
Orienting the causal relationship between pairs of traits is a fundamental task in scientific research with significant implications in practice, such as in prioritizing molecular targets and modifiable risk factors for developing therapeutic and interventional strategies for complex diseases. A recent method, called Steiger’s method, using a single SNP as an instrument variable (IV) in the framework of Mendelian randomization (MR), has since been widely applied. We report the following new contributions. First, we propose a single SNP-based alternative, overcoming a severe limitation of Steiger’s method in simply assuming, instead of inferring, the existence of a causal relationship. We also clarify a condition necessary for the validity of the methods in the presence of hidden confounding. Second, to improve statistical power, we propose combining the results from multiple, and possibly correlated, SNPs as multiple instruments. Third, we develop three goodness-of-fit tests to check modeling assumptions, including those required for valid IVs. Fourth, by relaxing one of the three IV assumptions in MR, we propose several methods, including an Egger regression-like approach and its multivariable version (analogous to multivariable MR), to account for horizontal pleiotropy of the SNPs/IVs, which is often unavoidable in practice. All our methods can simultaneously infer both the existence and (if so) the direction of a causal relationship, largely expanding their applicability over that of Steiger’s method. Although we focus on uni-directional causal relationships, we also briefly discuss an extension to bi-directional relationships. Through extensive simulations and an application to infer the causal directions between low density lipoprotein (LDL) cholesterol, or high density lipoprotein (HDL) cholesterol, and coronary artery disease (CAD), we demonstrate the superior performance and advantage of our proposed methods over Steiger’s method and bi-directional MR. In particular, after accounting for horizontal pleiotropy, our method confirmed the well known causal direction from LDL to CAD, while other methods, including bi-directional MR, might fail.
Transcriptome-wide association study (TWAS) has become popular in integrating a reference eQTL dataset with an independent main GWAS dataset to identify (putatively) causal genes, shedding mechanistic insights to biological pathways from genetic variants to a GWAS trait mediated by gene expression. Statistically TWAS is a (two-sample) 2-stage least squares (2SLS) method in the framework of instrumental variables analysis for causal inference: in Stage 1 it uses the reference eQTL data to impute a gene’s expression for the main GWAS data, then in Stage 2 it tests for association between the imputed gene expression and the GWAS trait; if an association is detected in Stage 2, a (putatively) causal relationship between the gene and the GWAS trait is claimed. If a non-linear model or a generalized linear model (GLM) is fitted in Stage 2 (e.g. for a binary GWAS trait), it is known that using only imputed gene expression, as in standard TWAS, in general does not lead to a consistent (i.e. asymptotically unbiased) estimate for the causal effect; accordingly, a variation of 2SLS, called two-stage residual inclusion (2SRI), has been proposed to yield better estimates (e.g. being consistent under suitable conditions). Our main goal is to investigate whether it is necessary or even better to apply 2SRI, instead of the standard 2SLS. In addition, due to the use of imputed gene expression (i.e. with measurement errors), it is known that in general some correction to the standard error estimate of the causal effect estimate has to be applied, while in the standard TWAS no correction is applied. Is this an issue? We also compare one-sample 2SLS with two-sample 2SLS (i.e. the standard TWAS). We used the ADNI data and simulated data mimicking the ADNI data to address the above questions. At the end, we conclude that, in practice with the large sample sizes and small effect sizes of genetic variants, the standard TWAS performs well and is recommended.
Orienting the causal relationship between pairs of traits is a fundamental task in scientific research with significant implications in practice, such as in prioritizing molecular targets and modifiable risk factors for developing therapeutic and interventional strategies for complex diseases. A recent method, called Steiger’s method, using a single SNP as an instrument variable (IV) in the framework of Mendelian randomization (MR), has since been widely applied. We report the following new contributions. First, we propose a single SNP-based alternative, overcoming a severe limitation of Steiger’s method in simply assuming, instead of inferring, the existence of a causal relationship. We also clarify a condition necessary for the validity of the methods in the presence of hidden confounding. Second, to improve statistical power, we propose combining the results from multiple, and possibly correlated, SNPs. as multiple instruments. Third, we develop three goodness-of-fit tests to check modeling assumptions, including those required for valid IVs. Fourth, by relaxing one of the three IV assumptions in MR, we propose methods, including one Egger regression-like approach and its multivariable version (analogous to multivariable MR), to account for horizontal pleiotropy of the SNPs/IVs, which is often unavoidable in practice. All our methods can simultaneously infer both the existence and (if so) the direction of a causal relationship, largely expanding their applicability over that of Steiger’s method. Although we focus on uni-directional causal relationships, we also briefly discuss an extension to bi-directional relationships. Through extensive simulations and an application to infer the causal directions between low density lipoprotein (LDL) cholesterol, or high density lipoprotein (HDL) cholesterol, and coronary artery disease (CAD), we demonstrate the superior performance and advantage of our proposed methods over Steiger’s method and bi-directional MR. In particular, after accounting for horizontal pleiotropy, our method confirmed the well known causal direction from LDL to CAD, while other methods, including bi-directional MR, failed.Author SummaryIn spite of its importance, due to technical challenges, orienting causal relationships between pairs of traits has been largely under-studied. Mendelian randomization (MR) Steiger’s method has become increasingly used in the last two years. Here we point out several limitations with MR Steiger’s method and propose alternative approaches. First, MR Steiger’s method is based on using only one single SNP as the instrument variable (IV), for which we propose a correlation ratio-based method, called Causal Direction-Ratio, or simply CD-Ratio. An advantage of CD-Ratio is its inference of both the existence and (if so) the direction of a causal relationship, in contrast to MR Steiger’s prior assumption of the existence and its poor performance if the assumption is violated. Furthermore, CD-Ratio can be extended to combine the results from multiple, possibly correlated, SNPs with improved statistical power. Second, we propose two methods, called CD-Egger and CD-GLS, for multiple and possibly correlated SNPs while allowing horizontal pleiotropy. Third, we propose three goodness-of-fit tests to check modeling assumptions for the three proposed methods. Finally, we introduce multivariable CD-Egger, analogous to multivariable MR, as a more robust approach, and an extension of CD-Ratio to cases with possibly bi-directional causal relationships. Our numerical studies demonstrated superior performance of our proposed methods over MR Steiger and bi-directional MR. Our proposed methods, along with freely available software, are expected to be useful in practice for causal inference.
Transcriptomewide association studies (TWAS) integrate genomewide association study(GWAS) data with gene expression (GE) data to identify (putative) causal genes for complex traits. There are two stages in TWAS: in Stage 1, a model is built to impute gene expression from genotypes, and in Stage 2, genetrait association is tested using imputed gene expression. Despite many successes with TWAS, in the current practice one only assumes a linear relationship between GE and the trait, which however may not hold, leading to loss of power. In this study, we extend the standard TWAS by considering a quadratic effect of GE, in addition to the usual linear effect. We train imputation models for both linear and quadratic gene expression levels in Stage 1, then include both the imputed linear and quadratic expression levels in Stage 2. We applied both the standard TWAS and our approach first to the ADNI gene expression data and the IGAP Alzheimer’s disease GWAS summary data, then to the GTEx (V8) gene expression data and the UK Biobank individuallevel GWAS data for lipids, followed by validation with different GWAS data, suitable model checking and more robust TWAS methods. In all these applications, the new TWAS approach was able to identify additional genes associated with Alzheimer’s disease (AD), LDL and HDL cholesterol levels, suggesting its likely power gains and thus the need to account for potentially nonlinear effects of gene expression on complex traits.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.