2020
DOI: 10.1371/journal.pgen.1008773
|View full text |Cite
|
Sign up to set email alerts
|

Scalable probabilistic PCA for large-scale genetic variation data

Abstract: Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
38
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 40 publications
(41 citation statements)
references
References 48 publications
0
38
0
Order By: Relevance
“…Even with a very large sample size, the inferred PCs are likely to be mostly noise beyond a few strong (often geographic) signals. Results from the UKB white British (WB) sample highlight this point (Figure 3): beyond the first 8 strongest PCs, PCs computed from a sample of 272,519 individuals (25) appear to be mainly driven by sampling noise and local LD within chromosomes. The noise can mask subtle population structure that can lead to confounding in GWAS even after PC adjustment (26).…”
Section: Introductionmentioning
confidence: 94%
“…Even with a very large sample size, the inferred PCs are likely to be mostly noise beyond a few strong (often geographic) signals. Results from the UKB white British (WB) sample highlight this point (Figure 3): beyond the first 8 strongest PCs, PCs computed from a sample of 272,519 individuals (25) appear to be mainly driven by sampling noise and local LD within chromosomes. The noise can mask subtle population structure that can lead to confounding in GWAS even after PC adjustment (26).…”
Section: Introductionmentioning
confidence: 94%
“…For tool developers, it would be beneficial to consider using binarized counts in addition to original counts for developing new analysis tools. For tool users, binarized counts can be used for exploratory data analysis because some efficient computational tools are applicable to binary counts but not original counts, e.g., scalable probabilistic principal component analysis [108]. Given the relative advantages and disadvantages of using original, imputed, and binarized counts in scRNA-seq data analysis, systematic benchmarking of the three strategies is critical [111].…”
Section: Input Data: Original Vs Imputed Vs Binarized Countsmentioning
confidence: 99%
“…it is not negative semidefinite and has imaginary PCs. A model-based framework based on probabilistic PCA (Hastie et al, 2015, Meisner et al, 2021, Agrawal et al, 2020 would likely be able to generate consistent F -statistics and PCs, while incorporating sampling error and missing data.…”
Section: A B D C Ementioning
confidence: 99%
“…Probabilistic PCA is one class of approaches that aim to separate the population structure from sampling noise (e.g. Agrawal et al, 2020). It seems likely that probabilistic PCA would yield a representation of the data that corresponds more closely aligned with F -statistics than regular PCA.…”
Section: Estimated Vs Observed Allele Frequenciesmentioning
confidence: 99%