FlashPCA2: principal component analysis of biobank-scale genotype datasets

Abraham, Gad; Qiu, Yixuan; Inouye, Michael

doi:10.1101/094714

Cited by 108 publications

(137 citation statements)

References 11 publications

Supporting

Mentioning

136

Contrasting

Order By: Relevance

“…We compared three different principal component analysis (PCA) methods using our simulated genotype data, namely flashPCA2 (or pruned PCA, with a recommended pruning step and a projection step, see URL and Ref. 19 ), exact PCA (implemented in GCTA using all the variants without pruning, see Ref. 20 ), and projection PCA (proj.…”

Section: Supplementary Note 6 Principal Component Analysismentioning

confidence: 99%

A resource-efficient tool for mixed model association analysis of large-scale data

et al. 2019

View full text Add to dashboard Cite

Here we reiterate the fastGWA model ! = # $%& ' $%& + ) * + * + , + -[S1]where ! is an . × 1 vector of mean centred phenotypes with . being the sample size; # $%& is a vector of mean-centred genotype variables of a variant of interest with its effect ' $%& ; ) * is the incidence matrix of fixed covariates with their corresponding coefficients + * ; , is a vector of the total genetic effects captured by pedigree relatedness with ,~2(0, 67 8 9 ); 6 is the family relatedness matrix based on pedigree structure; -is a vector of residuals with -~2(0, <7 = 9 ). The variance-covariance matrix of ! is > = 67 8 9 + ?7 = 9 and the generalized least squares estimate of. Therefore, to test whether ' $%& = 0, we first need to estimate the variance components 7 8 9 and 7 = 9 . As in most existing MLM-based association tools 1-7 , to avoid running the variance estimation analysis repeatedly for each target variant, we estimate 7 8 9 and 7 = 9 under the null modelassuming the effect of a single variant on 7 N 8 9 is negligible. The REML log-likelihood (L) function of model [2] can be written asConventional REML algorithms such as the average information (AI) 8 involve the computations of > WX , Y and Y6, which is computationally intensive when n is large even if 6 is sparse. Here we describe an algorithm (termed as fastGWA-REML) that uses grid search to estimate 7 8 9 without the need to compute > WX , Y and Y6. For ease of computation, we first adjust the phenotype for covariates by linear regression (let ! Z[\ denote a vector of phenotypes after adjustment). We can rewrite L as −with 1 being an . × 1 vector of 1's. All the elements in L including |>|, > WX X and > WX ! Z[\ can be computed efficiently by the Cholesky decomposition of V (without the need of computing > WX ) in sparse matrix setting. Because the computation of L is extremely fast, we can use a grid search to obtain an estimate of 7 8 9 (note that 7 N = 9 can be computed as 7 N ] 9 − 7 N 8 9 with 7 N ] 9 being the empirical variance of phenotype after adjustment).The rationale underlying this grid-search method is similar to that in Runcie et al. 9 . We compute the log-likelihood scores given a grid of possible values of 7 N 8 9 (e.g., 7 N 8 9 Î[0, 1.67 N ] 9 ] with 100 steps, i.e., a step size of 0.0167 N ] 9 ). Note that we define an upper limit to be large than 7 N ] 9 to accommodate rare scenarios where the estimate of 7 N 8 9 from the fastGWA model can be larger than 7 N ] 9 if the true heritability is large in the presence of substantial common environmental effects. Next, we refine the search in a window around the 7 N 8 9 value that produces the highest log-likelihood score (denoted by 7 N 8(bZG) 9) with a window size of 0.27 N 8(bZG) 9 and 16 steps. For example, if 7 N 8(bZG) 9 = 0.167 N ] 9 , we will refine the search in 7 N 8 9 Î[0.1447 N ] 9 , 0.1767 N ] 9 ] with 16 steps (i.e., a step size of 0.0027 N ] 9 ). We repeat this process iteratively until the difference in 7 N 8 9 with the highest log-likelihood score between two adjacent iterations is smalle...

show abstract

Section: Supplementary Note 6 Principal Component Analysismentioning

confidence: 99%

A resource-efficient tool for mixed model association analysis of large-scale data

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Genome-wide association analyses were conducted on the simulated data with 6 different methods. The simulated phenotypes were pre-adjusted by the top 10 PCs computed from a set of LD-pruned variants using flashPCA2 67 (Supplementary Note 6 and Supplementary Figure 20).…”

Section: Assessing False Positive Rate and Statistical Powermentioning

confidence: 99%

A resource-efficient tool for mixed model association analysis of large-scale data

Jiang

Zheng

et al. 2019

Preprint

View full text Add to dashboard Cite

The genome-wide association study (GWAS) has been widely used as an experimental design to detect associations between genetic variants and a phenotype. Two major confounding factors, population stratification and relatedness, could potentially lead to inflated GWAS test-statistics and thereby spurious associations. Mixed linear model (MLM)-based approaches can be used to account for sample structure. However, genome-wide association (GWA) analyses in biobank samples such as the UK Biobank (UKB) often exceed the capability of most existing MLM-based tools especially if the number of traits is large. Here, we developed an MLM-based tool (called fastGWA) that controls for population stratification by principal components and relatedness by a sparse genetic relationship matrix for GWA analyses of biobank-scale data. We demonstrated by extensive simulations that fastGWA is reliable, robust and highly resource-efficient. We then applied fastGWA to 2,173 traits on 456,422 array-genotyped and imputed individuals and 2,048 traits on 46,191 whole-exome-sequenced individuals in the UKB.

show abstract

“…We compare our method to GWAS using logistic regression, defining hypertension cases as belonging to Stage 2 or higher as done in Warren et al (). We recalculated principal components using FlashPCA after filtering individuals and SNPs through quality control (QC) filters, because the subset of individuals we analyze is exclusively of British ancestry, and the original principal components were calculated before filtering (Abraham, Qiu, & Inouye, ). Our hypertension GWAS analysis includes the following covariates: sex, center, age,

{age}^{2}

, body mass index (BMI), and the top 10 principal components to adjust for ancestry/relatedness.…”

Section: Resultsmentioning

confidence: 99%

Ordered multinomial regression for genetic association analysis of ordinal phenotypes at Biobank scale

German

Sinsheimer

Klimentidis

et al. 2019

Genetic Epidemiology

View full text Add to dashboard Cite

Logistic regression is the primary analysis tool for binary traits in genome‐wide association studies (GWAS). Multinomial regression extends logistic regression to multiple categories. However, many phenotypes more naturally take ordered, discrete values. Examples include (a) subtypes defined from multiple sources of clinical information and (b) derived phenotypes generated by specific phenotyping algorithms for electronic health records (EHR). GWAS of ordinal traits have been problematic. Dichotomizing can lead to a range of arbitrary cutoff values, generating inconsistent, hard to interpret results. Using multinomial regression ignores trait value hierarchy and potentially loses power. Treating ordinal data as quantitative can lead to misleading inference. To address these issues, we analyze ordinal traits with an ordered, multinomial model. This approach increases power and leads to more interpretable results. We derive efficient algorithms for computing test statistics, making ordinal trait GWAS computationally practical for Biobank scale data. Our method is available as a Julia package OrdinalGWAS.jl. Application to a COPDGene study confirms previously found signals based on binary case–control status, but with more significance. Additionally, we demonstrate the capability of our package to run on UK Biobank data by analyzing hypertension as an ordinal trait.

show abstract

FlashPCA2: principal component analysis of biobank-scale genotype datasets

Cited by 108 publications

References 11 publications

A resource-efficient tool for mixed model association analysis of large-scale data

A resource-efficient tool for mixed model association analysis of large-scale data

A resource-efficient tool for mixed model association analysis of large-scale data

Ordered multinomial regression for genetic association analysis of ordinal phenotypes at Biobank scale

Contact Info

Product

Resources

About