2022
DOI: 10.1038/s41598-022-14395-4
|View full text |Cite
|
Sign up to set email alerts
|

Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated

Abstract: Principal Component Analysis (PCA) is a multivariate analysis that reduces the complexity of datasets while preserving data covariance. The outcome can be visualized on colorful scatterplots, ideally with only a minimal loss of information. PCA applications, implemented in well-cited packages like EIGENSOFT and PLINK, are extensively used as the foremost analyses in population genetics and related fields (e.g., animal and plant or medical genetics). PCA outcomes are used to shape study design, identify, and ch… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
53
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 78 publications
(67 citation statements)
references
References 103 publications
0
53
0
Order By: Relevance
“…Current incentive structures encourage simplified narratives of progress based on neat metrics that can be tuned to make one's algorithms look 5 Similar issues on downplaying judgment have been found in studies using classical statistical analysis based on pvalues and confidence intervals (Ioannidis 2005;Berman et al 2018). A recent study also illustrates how assumed mechanistic objectivity in Principle Component Analysis, a method used both in ML and statistical analysis, may require thousands of population genetics studies to be reevaluated (Elhaik 2022). The focus of this paper is on ML, but these shared issues illustrate how some parts of the analysis are also relevant to other fields.…”
Section: Discussionmentioning
confidence: 92%
“…Current incentive structures encourage simplified narratives of progress based on neat metrics that can be tuned to make one's algorithms look 5 Similar issues on downplaying judgment have been found in studies using classical statistical analysis based on pvalues and confidence intervals (Ioannidis 2005;Berman et al 2018). A recent study also illustrates how assumed mechanistic objectivity in Principle Component Analysis, a method used both in ML and statistical analysis, may require thousands of population genetics studies to be reevaluated (Elhaik 2022). The focus of this paper is on ML, but these shared issues illustrate how some parts of the analysis are also relevant to other fields.…”
Section: Discussionmentioning
confidence: 92%
“…When moving beyond toy data, there exists a considerable amount of work advocating for the adoption of newer graph-based methods for the analysis of biological data. For example, Elhaik (2022) demonstrates that the validity of the results produced from one commonly used linear dimensionality reduction algorithm, PCA, can be easily called into question due to artifacts within the data and the ease by which it can be manipulated to produce favorable results (6). In addition, there is an ongoing debate around how useful a projection is when the amount of explained variance for the first few axes is very low (arbitrarily defined as less than 60%).…”
Section: Resultsmentioning
confidence: 99%
“…A considerable amount of work has demonstrated that many genomic, transcriptomic, and metagenomic datasets are better understood if the underlying geometry of the input space is considered. This better understanding can be reached since these datasets do not often meet the "Assumption of Linearity" used by methods such as Non-Negative Matrix Factorization, Principal Components Analysis and its generalization, Principal Coordinates Analysis (5)(6)(7)(8). Finally, these methods rarely explore correlations and other dependencies between taxa.…”
Section: Introductionmentioning
confidence: 99%
“…Population structure was also examined using principal component analysis (PCA) implemented in EIGENSOFT SmartPCA v18140 [ 81 ]. A recent study has shown that PCA can yield highly biased results and should not be used as a first hypothesis generator in population genetic analyses [ 82 ]. Here, PCA was used to corroborate k cluster estimation that was based primarily on phylogenetic reconstruction methods assuming bifurcation or reticulation, and admixture models to characterize ancestral genetic structure.…”
Section: Methodsmentioning
confidence: 99%