A Fast and Accurate Algorithm to Test for Binary Phenotypes and Its Application to PheWAS

Dey, Rounak; Schmidt, Ellen M.; Abecasis, Gonçalo R.; Lee, Seunggeun

doi:10.1016/j.ajhg.2017.05.014

Cited by 128 publications

(132 citation statements)

References 40 publications

Supporting

Mentioning

113

Contrasting

Order By: Relevance

“…A common goal of EHR‐based analyses is to study the associations between specific phenotypes and variants at a particular gene region or across the genome, and this analysis is often performed using linear or logistic regression or using mixed linear model association (MLMA) analysis . Firth‐corrected logistic regression may prove useful for modeling rare binary outcomes or settings in which there is strong covariate separation, and its application to PheWAS is demonstrated in Fritsche et al Recently, Dey et al proposed a fast alternative to Firth‐penalized regression to stabilize estimation for PheWAS studies using saddle‐point approximation (SPA) that is useful for handling extremely unbalanced case‐control data . These methods can be applied in many other modeling settings as well.…”

Section: Statistical Issues Related To Biobank Researchmentioning

confidence: 99%

The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities

Beesley

Salvatore

Fritsche

et al. 2019

Statistics in Medicine

View full text Add to dashboard Cite

Biobanks linked to electronic health records provide rich resources for health‐related research. With improvements in administrative and informatics infrastructure, the availability and utility of data from biobanks have dramatically increased. In this paper, we first aim to characterize the current landscape of available biobanks and to describe specific biobanks, including their place of origin, size, and data types. The development and accessibility of large‐scale biorepositories provide the opportunity to accelerate agnostic searches, expedite discoveries, and conduct hypothesis‐generating studies of disease‐treatment, disease‐exposure, and disease‐gene associations. Rather than designing and implementing a single study focused on a few targeted hypotheses, researchers can potentially use biobanks' existing resources to answer an expanded selection of exploratory questions as quickly as they can analyze them. However, there are many obvious and subtle challenges with the design and analysis of biobank‐based studies. Our second aim is to discuss statistical issues related to biobank research such as study design, sampling strategy, phenotype identification, and missing data. We focus our discussion on biobanks that are linked to electronic health records. Some of the analytic issues are illustrated using data from the Michigan Genomics Initiative and UK Biobank, two biobanks with two different recruitment mechanisms. We summarize the current body of literature for addressing these challenges and discuss some standing open problems. This work complements and extends recent reviews about biobank‐based research and serves as a resource catalog with analytical and practical guidance for statisticians, epidemiologists, and other medical researchers pursuing research using biobanks.

show abstract

Section: Statistical Issues Related To Biobank Researchmentioning

confidence: 99%

The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities

Beesley

Salvatore

Fritsche

et al. 2019

Statistics in Medicine

View full text Add to dashboard Cite

show abstract

“…We consider

J

case–control studies, where the j th study has sample size

n_{j}

. Within each individual study, we follow the regression model and testing procedure described in Dey et al (). For the i th subject in the j th study, let

{Y_{i}}^{false(j false)} = 1

or 0 denote the case–control status,

X_{i}^{false(j false)}

denote the

k \times 1

vector of nongenetic covariates (including the intercept) and

{G_{i}}^{false(j false)} = 0, 1, 2

denote the number of minor alleles of the variant to be tested.…”

Section: Methodsmentioning

confidence: 99%

Robust meta‐analysis of biobank‐based genome‐wide association studies with unbalanced binary phenotypes

Dey

Nielsen

Fritsche

et al. 2019

Genetic Epidemiology

Self Cite

View full text Add to dashboard Cite

With the availability of large‐scale biobanks, genome‐wide scale phenome‐wide association studies are being instrumental in discovering novel genetic variants associated with clinical phenotypes. As increasing number of such association results from different biobanks become available, methods to meta‐analyse those association results is of great interest. Because the binary phenotypes in biobank‐based studies are mostly unbalanced in their case–control ratios, very few methods can provide well‐calibrated tests for associations. For example, traditional Z‐score‐based meta‐analysis often results in conservative or anticonservative Type I error rates in such unbalanced scenarios. We propose two meta‐analysis strategies that can efficiently combine association results from biobank‐based studies with such unbalanced phenotypes, using the saddlepoint approximation‐based score test method. Our first method involves sharing the overall genotype counts from each study, and the second method involves sharing an approximation of the distribution of the score test statistic from each study using cubic Hermite splines. We compare our proposed methods with a traditional Z‐score‐based meta‐analysis strategy using numerical simulations and real data applications, and demonstrate the superior performance of our proposed methods in terms of Type I error control.

show abstract

“…Recently, researchers at Geisinger Health System and the Michigan Genomics Initiative (MGI) conducted separate genome-wide PheWAS using clinical data from EHR[48,49]. Verma et al used PheWAS to investigate all common variants on the Illumina HumanCoreExome chip and clinical laboratory measures from ~12,000 European American individuals[48].…”

Section: Genome-wide Phewasmentioning

confidence: 99%

“…Subsequently, they tested the significant SNPs from the clinical lab PheWAS with 541 diagnosis codes[48]. Dey et al demonstrated the application of a new statistical method (Table 1) for PheWAS and tested ~30 million imputed SNPs with 1500 EHR based PheWAS codes[49]. Dey et al also proposed a new method for binary outcomes, called SPAtest, which is a variation of logistic regression that estimates p-values using saddlepoint approximation.…”

Section: Genome-wide Phewasmentioning

confidence: 99%

“…Dey et al also proposed a new method for binary outcomes, called SPAtest, which is a variation of logistic regression that estimates p-values using saddlepoint approximation. The authors demonstrates that this approximation method is computationally efficient than traditional regression methods[49]. This approach can be computationally efficient for large-scale genome-wide PheWAS, especially for studies with an unbalanced case-control ratio[49].…”

Section: Genome-wide Phewasmentioning

confidence: 99%

See 1 more Smart Citation

Current Scope and Challenges in Phenome-Wide Association Studies

Verma

Ritchie

2017

Curr Epidemiol Rep

View full text Add to dashboard Cite

Purpose of Review Over many decades, researchers have been designing studies to investigate the relationship between genotypes and phenotypes to gain an understanding about the effect of genetics on disease. Recently, a high-throughput approach called phenome-wide associations studies (PheWAS) have been extensively used to identify associations between genetic variants and many diseases and traits simultaneously. In this review, we describe the value of PheWAS along with methodological issues and challenges in interpretation for current applications of PheWAS. Recent findings PheWAS have uncovered a paradigm to identify new associations for genetic loci across many diseases. The application of PheWAS have been effective with phenotype data from electronic health records, epidemiological studies, and clinical trials data. Summary The key strength of a PheWAS is to identify the association of one or more genetic variants with multiple phenotypes, which can showcase interconnections among the phenotypes due to shared genetic associations. While the PheWAS approach appears promising, there are a number of challenges that need to be addressed to provide additional robustness to PheWAS findings.

show abstract

A Fast and Accurate Algorithm to Test for Binary Phenotypes and Its Application to PheWAS

Cited by 128 publications

References 40 publications

The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities

The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities

Robust meta‐analysis of biobank‐based genome‐wide association studies with unbalanced binary phenotypes

Current Scope and Challenges in Phenome-Wide Association Studies

Contact Info

Product

Resources

About