Search citation statements
Paper Sections
Citation Types
Year Published
Publication Types
Relationship
Authors
Journals
Multivariate methods are incredibly beneficial for population genetic analyses where the number of measured variables (genetic loci) can easily exceed the number of sampled individuals. Discriminant analysis of principal components (DAPC) has become a popular method for visualising population structure in genotype datasets due to its simplicity, its computational speed, and its freedom from demographic assumptions. Despite the popularity of DAPC in population genetic studies, there has been little discussion on best practise and parameterisation. Unappreciated, perhaps, is the fact that unlike principal component analysis (PCA), which is a hypothesis free method, discriminant analysis (DA) is a hypothesis driven method. That is, when performing a DA, a researcher is making an explicit hypothesis about how variation in a set of predictor variables is organised among pre-defined groups in a sample set. Parameter choice is critical to ensure the results produced by a DA are biologically meaningful. In a DAPC, one of the most important parameter choices is the number of PC axes, paxes, to use as predictors in a DA of among-population differences. Yet there are no clear guidelines on how researchers should choose paxes. In this work, I propose that the value of paxes is a deterministic feature of a genotype dataset based on population genetic theory. For k discrete populations, only the first k - 1 PC axes are expected to be biologically informative and capture population structure. DAs fit using more than the first k - 1 PC axes are over-parametrised and may discriminate groups using biologically uninformative predictors. Using samples drawn from simulated metapopulations, I show that DAPCs parameterised with the appropriate k - 1 PC axes are: (1) more parsimonious; (2) capture the maximal amount of among-population variation using biologically relevant predictors; (3) are less sensitive to unintended interpretations of population structure; and (4) are more generally applicable to independent sample sets.
Multivariate methods are incredibly beneficial for population genetic analyses where the number of measured variables (genetic loci) can easily exceed the number of sampled individuals. Discriminant analysis of principal components (DAPC) has become a popular method for visualising population structure in genotype datasets due to its simplicity, its computational speed, and its freedom from demographic assumptions. Despite the popularity of DAPC in population genetic studies, there has been little discussion on best practise and parameterisation. Unappreciated, perhaps, is the fact that unlike principal component analysis (PCA), which is a hypothesis free method, discriminant analysis (DA) is a hypothesis driven method. That is, when performing a DA, a researcher is making an explicit hypothesis about how variation in a set of predictor variables is organised among pre-defined groups in a sample set. Parameter choice is critical to ensure the results produced by a DA are biologically meaningful. In a DAPC, one of the most important parameter choices is the number of PC axes, paxes, to use as predictors in a DA of among-population differences. Yet there are no clear guidelines on how researchers should choose paxes. In this work, I propose that the value of paxes is a deterministic feature of a genotype dataset based on population genetic theory. For k discrete populations, only the first k - 1 PC axes are expected to be biologically informative and capture population structure. DAs fit using more than the first k - 1 PC axes are over-parametrised and may discriminate groups using biologically uninformative predictors. Using samples drawn from simulated metapopulations, I show that DAPCs parameterised with the appropriate k - 1 PC axes are: (1) more parsimonious; (2) capture the maximal amount of among-population variation using biologically relevant predictors; (3) are less sensitive to unintended interpretations of population structure; and (4) are more generally applicable to independent sample sets.
The biological world is beautifully complex, characterized by variation in multiple dimensions. Multivariate statistics play a pivotal role in helping us make sense of this multidimensionality and developing a deeper appreciation of biology. Describing population genetic patterns, for example, becomes increasingly difficult with many sampled individuals, genetic markers and populations. However, ordination methods can summarize variation across multiple loci to create new synthetic axes and reduce dimensionality. Such new axes of variation
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.