Background:Determining population structure helps us understand connections among different populations and how they evolve over time. This knowledge is important for studies ranging from evolutionary biology to largescale variant-trait association studies. Current approaches to determining population structure include model-based approaches, statistical approaches, and distance-based ancestry inference approaches. In this work, we outline an approach that identi es population structure from k-mer frequencies using principal component analysis (PCA). This approach can be classi ed as statistical; however, prior work employing PCA has used multilocus genotype data (SNPs, microsatellites, or haplotypes), while here we analyze k-mer frequencies. K-mer frequencies can be viewed as a summary statistic of a genome and have the advantage of being easily derived from a genome by counting the number of times a k-mer occurred in a sequence. No genetic assumptions must be met to generate k-mers, whereas current population structure approaches often depend on several genetic assumptions and require careful selection of ancestry informative markers to identify populations.
Results:In this work, we show that PCA is able to determine population structure just from the frequency of k-mers found in the genome. The application of PCA and a clustering algorithm to k-mer pro les of genomes provides an easy approach to detecting the number and composition of populations (clusters) present in the dataset. We describe this approach and show that the results are comparable to those found by a model-based approach using genetic markers. We validate our method using 48 human genomes from populations identi ed by the 1000 Human Genomes Project. We also compare our results to those from mash, which determines relationships among individuals using the number of matched k-mers between sequences.
Conclusions:This study shows that PCA, together with a clustering algorithm, is able to detect population structure from k-mer frequencies and can identify samples of admixed and non-admixed origin. In contrast, mash (based on the number of k-mer matches) was highly sensitive to the parameters of k-mer length and sketch size. Using k-mer frequencies to determine population structure has the potential to avoid some challenges of existing methods.