Motivation: Emergence of genetic data coupled to longitudinal electronic medical records (EMRs) offers the possibility of phenome-wide association scans (PheWAS) for disease–gene associations. We propose a novel method to scan phenomic data for genetic associations using International Classification of Disease (ICD9) billing codes, which are available in most EMR systems. We have developed a code translation table to automatically define 776 different disease populations and their controls using prevalent ICD9 codes derived from EMR data. As a proof of concept of this algorithm, we genotyped the first 6005 European–Americans accrued into BioVU, Vanderbilt's DNA biobank, at five single nucleotide polymorphisms (SNPs) with previously reported disease associations: atrial fibrillation, Crohn's disease, carotid artery stenosis, coronary artery disease, multiple sclerosis, systemic lupus erythematosus and rheumatoid arthritis. The PheWAS software generated cases and control populations across all ICD9 code groups for each of these five SNPs, and disease-SNP associations were analyzed. The primary outcome of this study was replication of seven previously known SNP–disease associations for these SNPs.Results: Four of seven known SNP–disease associations using the PheWAS algorithm were replicated with P-values between 2.8 × 10−6 and 0.011. The PheWAS algorithm also identified 19 previously unknown statistical associations between these SNPs and diseases at P < 0.01. This study indicates that PheWAS analysis is a feasible method to investigate SNP–disease associations. Further evaluation is needed to determine the validity of these associations and the appropriate statistical thresholds for clinical significance.Availability:The PheWAS software and code translation table are freely available at http://knowledgemap.mc.vanderbilt.edu/research.Contact: josh.denny@vanderbilt.edu
Candidate gene and genome-wide association studies (GWAS) have identified genetic variants that modulate risk for human disease; many of these associations require further study to replicate the results. Here we report the first large-scale application of the phenome-wide association study (PheWAS) paradigm within electronic medical records (EMRs), an unbiased approach to replication and discovery that interrogates relationships between targeted genotypes and multiple phenotypes. We scanned for associations between 3,144 single-nucleotide polymorphisms (previously implicated by GWAS as mediators of human traits) and 1,358 EMR-derived phenotypes in 13,835 individuals of European ancestry. This PheWAS replicated 66% (51/77) of sufficiently powered prior GWAS associations and revealed 63 potentially pleiotropic associations with P < 4.6 × 10−6 (false discovery rate < 0.1); the strongest of these novel associations were replicated in an independent cohort (n = 7,406). These findings validate PheWAS as a tool to allow unbiased interrogation across multiple phenotypes in EMR-based cohorts and to enhance analysis of the genomic basis of human disease.
Background. Immune-checkpoint-inhibitors (ICIs) have dramatically improved clinical outcomes in multiple cancer types and are increasingly being used in early disease settings and in combinations. However, ICIs can also cause severe or even fatal immune-mediated adverse-events (irAE). Here, we identify and characterize significant cardiovascular irAE (CV-irAEs) associated with ICIs. Methods. We used VigiBase, the WHO’s global Individual-Case-Safety-Report database to identify drug-AE related to ICIs (n:31,321) and related to other drugs (n:16,343,451) through 01/2018. We evaluated the association between ICI and CV events using Reporting-Odds-Ratio (ROR) and Information-Component (IC). IC is an indicator value for disproportionate Bayesian reporting that compares observed and expected values to find drug-AE associations. IC025 is the lower-end of IC 95% credibility-interval and an IC025>0 is considered statistically significant. Findings. Using this agnostic approach, we identified multiple CV entities over-reported after ICI treatment compared to the entire database. ICI treatment was associated with higher reporting of myocarditis (n:122, ROR: 11.21 [9.36–13.43], IC025:3.2), pericardial diseases (n:95, ROR: 3.8 [3.08–4.62], IC025:1.63), and vasculitis (n:82, ROR: 1.56 [1.25–1.94], IC025:0.03), including temporal-arteritis (n:18, ROR: 12.99 [8.12–20.77], IC025:2.59). These CV-irAE affected mostly men (58–67%), with a wide age range (20–90 years) and occurred early after ICI administration (40–80% within one month of first ICI administration). Pericardial disorders were reported more often in patients with lung cancer (56.3%) whereas myocarditis and vasculitis were more commonly reported in patients with melanoma (40.7% and 60%, respectively; p<0.001). Vision was impaired in 27.8% of temporal-arteritis cases. CV-irAE were serious in the majority of cases (>80%), with fatalities occurring in 50% of myocarditis cases, 21.1% of pericardial disorders and 6.1% of vasculitis (p<0.0001). Among myocarditis cases, fatality was most frequent in ICI combination therapy compared to ICI monotherapy (65.6% vs. 44.4%, p:0.04). Interpretation. ICI may lead to severe and disabling inflammatory CV-irAEs early during therapy. Besides life-threatening myocarditis, these toxicities include pericardial disorders, as well as temporal arteritis with a risk for blindness.
In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, the linear mixed model and the recently proposed logistic mixed model, perform poorly; they produce large type I error rates when used to analyze unbalanced case-control phenotypes. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation to calibrate the distribution of score test statistics. This method, SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), provides accurate P values even when case-control ratios are extremely unbalanced. SAIGE uses state-of-art optimization strategies to reduce computational costs; hence, it is applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 samples from white British participants with European ancestry for > 1,400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.
In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, linear mixed model and the recently proposed logistic mixed model, perform poorly --producing large type I error rates --in the analysis of phenotypes with unbalanced case-control ratios. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation (SPA) to calibrate the distribution of score test statistics. This method, SAIGE, provides accurate p-values even when case-control ratios are extremely unbalanced. It utilizes state-of-art optimization strategies to reduce computational time and memory cost of generalized mixed model. The computation cost linearly depends on sample size, and hence can be applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK-Biobank data of 408,961 white British European-ancestry samples, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.
This R package is freely available under the Gnu Public License (GPL-3) from http://phewascatalog.org. It is implemented in native R and is platform independent.
Many modern human genomes retain DNA inherited from interbreeding with archaic hominins, such as Neanderthals, yet the influence of this admixture on human traits is largely unknown. We analyzed the contribution of common Neanderthal variants to over 1,000 electronic health record (EHR)-derived phenotypes in ~28,000 adults of European ancestry. We discovered and replicated associations of Neanderthal alleles with neurological, psychiatric, immunological, and dermatological phenotypes. Neanderthal alleles together explain a significant fraction of the variation in risk for depression and skin lesions resulting from sun exposure (actinic keratosis), and individual Neanderthal alleles are significantly associated with specific human phenotypes, including hypercoagulation and tobacco use. Our results establish that archaic admixture influences disease risk in modern humans, provide hypotheses about the effects of hundreds of Neanderthal haplotypes and demonstrate the utility of EHR data in evolutionary analyses.
ObjectiveTo compare three groupings of Electronic Health Record (EHR) billing codes for their ability to represent clinically meaningful phenotypes and to replicate known genetic associations. The three tested coding systems were the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes, the Agency for Healthcare Research and Quality Clinical Classification Software for ICD-9-CM (CCS), and manually curated “phecodes” designed to facilitate phenome-wide association studies (PheWAS) in EHRs.Methods and materialsWe selected 100 disease phenotypes and compared the ability of each coding system to accurately represent them without performing additional groupings. The 100 phenotypes included 25 randomly-chosen clinical phenotypes pursued in prior genome-wide association studies (GWAS) and another 75 common disease phenotypes mentioned across free-text problem lists from 189,289 individuals. We then evaluated the performance of each coding system to replicate known associations for 440 SNP-phenotype pairs.ResultsOut of the 100 tested clinical phenotypes, phecodes exactly matched 83, compared to 53 for ICD-9-CM and 32 for CCS. ICD-9-CM codes were typically too detailed (requiring custom groupings) while CCS codes were often not granular enough. Among 440 tested known SNP-phenotype associations, use of phecodes replicated 153 SNP-phenotype pairs compared to 143 for ICD-9-CM and 139 for CCS. Phecodes also generally produced stronger odds ratios and lower p-values for known associations than ICD-9-CM and CCS. Finally, evaluation of several SNPs via PheWAS identified novel potential signals, some seen in only using the phecode approach. Among them, rs7318369 in PEPD was associated with gastrointestinal hemorrhage.ConclusionOur results suggest that the phecode groupings better align with clinical diseases mentioned in clinical practice or for genomic studies. ICD-9-CM, CCS, and phecode groupings all worked for PheWAS-type studies, though the phecode groupings produced superior results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.