TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes

Bose, Aritra; Kalantzis, Vassilis; Kontopoulou, Eugenia-Maria; Elkady, Mai; Paschou, Peristera; Drineas, Petros

doi:10.1093/bioinformatics/btz157

Cited by 35 publications

(41 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are also many focuses of recent PCA algorithms (Additional file 19). The randomized subspace iteration algorithm, which is a hybrid of Krylov and Rand methodologies, was developed based on randomized SVD [133,134]. In pass-efficient or one-pass randomized SVD, some tricks to reduce the number of passes have been considered [135,136].…”

Section: Future Perspectivementioning

confidence: 99%

Benchmarking principal component analysis for large-scale single-cell RNA-sequencing

Tsuyuzaki

Sato

et al. 2019

Preprint

View full text Add to dashboard Cite

Principal component analysis (PCA) is an essential method for analyzing single-cell RNA-seq (scRNA-seq) datasets, but large-scale scRNA-seq datasets require long computational times and a large memory capacity.In this work, we review 21 fast and memory-efficient PCA implementations (10 algorithms) and evaluate their application using 4 real and 18 synthetic datasets. Our benchmarking showed that some PCA algorithms are faster, more memory efficient, and more accurate than others. In consideration of the differences in the computational environments of users and developers, we have also developed guidelines to assist with selection of appropriate PCA implementations.

show abstract

Section: Future Perspectivementioning

confidence: 99%

Benchmarking principal component analysis for large-scale single-cell RNA-sequencing

Tsuyuzaki

Sato

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, the advent of large population-scale genetic datasets, such as the UK biobank data, has prompted research on developing scalable algorithms to compute PCA on very large data (Bycroft et al 2018). It is now possible to efficiently approximate PCA on very large datasets thanks to software such as FastPCA (fast mode of EIGENSOFT), FlashPCA2, PLINK 2.0 (approx mode), bigstatsr/bigsnpr, TeraPCA and ProPCA (Galinsky et al 2016; Abraham et al 2017; Chang et al 2015; Privé et al 2018; Bose et al 2019; Agrawal et al 2019).…”

Section: Introductionmentioning

confidence: 99%

Efficient toolkit implementing best practices for principal component analysis of population genetic data

Privé

Luu

Blum

et al. 2019

Preprint

View full text Add to dashboard Cite

Principal Component Analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (1) capturing Linkage Disequilibrium (LD) structure instead of population structure, (2) projected PCs that suffer from shrinkage bias when projecting PCA from a reference dataset to another independent dataset, (3) detecting sample outliers, and (4) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. For example, we show that PC19 to PC40 in the UK Biobank capture LD structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank. We provide evidence for a shrinkage bias when projecting PCs computed with data from the 1000 Genomes project. Although PC1 to PC4 suffer from only moderate shrinkage (1.01-1.09), PC5 (resp. PC10) for example suffers from a shrinkage factor of 1.50 (resp. 3.14). We provide a fast way to project new individuals that is not affected by this shrinkage bias. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.

show abstract

“…There are also many focuses of recent PCA algorithms (Additional file 23). The randomized subspace iteration algorithm, which is a hybrid of Krylov and Rand methodologies, was developed based on randomized SVD [133,134]. In pass-efficient or one-pass randomized SVD, some tricks to reduce the number of passes have been considered [135,136].…”

Section: Future Perspectivementioning

confidence: 99%

Benchmarking principal component analysis for large-scale single-cell RNA-sequencing

Tsuyuzaki

Sato

et al. 2020

Genome Biol

View full text Add to dashboard Cite

Background: Principal component analysis (PCA) is an essential method for analyzing single-cell RNA-seq (scRNA-seq) datasets, but for large-scale scRNA-seq datasets, computation time is long and consumes large amounts of memory. Results: In this work, we review the existing fast and memory-efficient PCA algorithms and implementations and evaluate their practical application to large-scale scRNA-seq datasets. Our benchmark shows that some PCA algorithms based on Krylov subspace and randomized singular value decomposition are fast, memory-efficient, and more accurate than the other algorithms. Conclusion: We develop a guideline to select an appropriate PCA implementation based on the differences in the computational environment of users and developers.

show abstract

TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes

Cited by 35 publications

References 30 publications

Benchmarking principal component analysis for large-scale single-cell RNA-sequencing

Benchmarking principal component analysis for large-scale single-cell RNA-sequencing

Efficient toolkit implementing best practices for principal component analysis of population genetic data

Benchmarking principal component analysis for large-scale single-cell RNA-sequencing

Contact Info

Product

Resources

About