2018
DOI: 10.1111/2041-210x.12968
|View full text |Cite
|
Sign up to set email alerts
|

A fast likelihood solution to the genetic clustering problem

Abstract: The investigation of genetic clusters in natural populations is an ubiquitous problem in a range of fields relying on the analysis of genetic data, such as molecular ecology, conservation biology and microbiology. Typically, genetic clusters are defined as distinct panmictic populations, or parental groups in the context of hybridisation. Two types of methods have been developed for identifying such clusters: model‐based methods, which are usually computer‐intensive but yield results which can be interpreted i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
117
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
7
1

Relationship

3
5

Authors

Journals

citations
Cited by 111 publications
(122 citation statements)
references
References 51 publications
0
117
0
Order By: Relevance
“…For the analysis of the baseline simulations, we systematically varied the cutoffs and reporting rate, as described in S2 Table. Detailed information on the simulation process is available in S1 Text. Following Beugin et al [35], we quantify the ability of our method to correctly identify clusters of cases linked by transmission, through measuring 1) the true positive rate (TPR) or sensitivity, i.e. the proportion of pairs of cases belonging to the same transmission tree who are inferred to be in the same outbreak cluster), 2) the true negative rate (TNR) or specificity, i.e.…”
Section: Simulationsmentioning
confidence: 99%
See 2 more Smart Citations
“…For the analysis of the baseline simulations, we systematically varied the cutoffs and reporting rate, as described in S2 Table. Detailed information on the simulation process is available in S1 Text. Following Beugin et al [35], we quantify the ability of our method to correctly identify clusters of cases linked by transmission, through measuring 1) the true positive rate (TPR) or sensitivity, i.e. the proportion of pairs of cases belonging to the same transmission tree who are inferred to be in the same outbreak cluster), 2) the true negative rate (TNR) or specificity, i.e.…”
Section: Simulationsmentioning
confidence: 99%
“…the proportion of pairs of cases belonging to the same transmission tree who are inferred to be in the same outbreak cluster), 2) the true negative rate (TNR) or specificity, i.e. the proportion of pairs of cases not belonging to the same transmission tree who are assigned to different outbreak clusters and 3) the mean between TPR and TNR, which is proportional to the Rand index, a common criterion used to evaluate clustering methods [35,70]. We also compare the estimates of the reproduction number and importation rate to the values used in the simulation.…”
Section: Simulationsmentioning
confidence: 99%
See 1 more Smart Citation
“…Population structure was assessed using the function snapclust in the R package adegenet (Beugin, Gayet, Gayet, Pontier, Devillard, & Jombart, ), the discriminant analysis of principal components (DAPC) as implemented in the adegenet R package (Jombart, Devillard, & Balloux, ), and structure version 2.3 (Pritchard, Stephens, & Donnelly, ). structure and snapclust may produce similar individual membership probability plots, but they have totally different approaches to the genetic clustering problem: while structure uses a Bayesian approach with Markov chain Monte Carlo (MCMC) method to estimate allele frequencies in each cluster and population memberships for every individual, snapclust is a fast likelihood optimization method combining both model‐based and geometric clustering approaches, which uses the Expectation‐Maximization (EM) algorithm to assign genotypes to populations and detect admixture patterns.…”
Section: Methodsmentioning
confidence: 99%
“…Corander et al (2003) and Corander, Waldmann, Marttinen, and Sillanpää (2004) implemented a split-and-merge algorithm in their program BAPS to estimate K. Patterson, Price, and Reich (2006) proposed an eigenanalysis method, implemented in SmartPCa software, to estimate K as 1 plus the number of significant eigenvalues explaining the variation of genotype data. Jombart et al (2010) and Beugin, Gayet, Pontier, Devillard, and Jombart (2018) used Akaike information criterion (AIC: Akaike, 1998), Bayesian Information Criterion (BIC: Schwarz, 1978), Kullback Information Criterion (KIC: Cavanaugh, 1999) and their variants to assess the best supported model, and therefore the most likely number of populations. These and other methods were demonstrated to yield good estimates of K in some simple scenarios (e.g., Gao, Bryc, & Bustamante, 2011), but can be highly inaccurate in difficult situations such as many source populations (e.g., K > 10), unbalanced sample sizes (Wang, 2017), hierarchical population structures (Evanno et al, 2005), weak differentiation or low marker information (Gao et al, 2011), and high admixture.…”
mentioning
confidence: 99%