Background:Determining population structure helps us understand connections among different populations and how they evolve over time. This knowledge is important for studies ranging from evolutionary biology to large-scale variant-trait association studies. Current approaches to determining population structure include model-based approaches, statistical approaches, and distance-based ancestry inference approaches. In this work, we outline an approach that identifies population structure from k-mer frequencies using principal component analysis (PCA). This approach can be classified as statistical; however, prior work employing PCA has used multilocus genotype data (SNPs, microsatellites, or haplotypes), while here we analyze k-mer frequencies. K-mer frequencies can be viewed as a summary statistic of a genome and have the advantage of being easily derived from a genome by counting the number of times a k-mer occurred in a sequence. No genetic assumptions must be met to generate k-mers, whereas current population structure approaches often depend on several genetic assumptions and require careful selection of ancestry informative markers to identify populations. Results: In this work, we show that PCA is able to determine population structure just from the frequency of k-mers found in the genome. The application of PCA and a clustering algorithm to k-mer profiles of genomes provides an easy approach to detecting the number and composition of populations (clusters) present in the dataset. We describe this approach and show that the results are comparable to those found by a model-based approach using genetic markers. We validate our method using 48 human genomes from populations identified by the 1000 Human Genomes Project. We also compare our results to those from mash, which determines relationships among individuals using the number of matched k-mers between sequences. Conclusions:This study shows that PCA, together with a clustering algorithm, is able to detect population structure from k-mer frequencies and can identify samples of admixed and non-admixed origin. In contrast, mash (based on the number of k-mer matches) was highly sensitive to the parameters of k-mer length and sketch size. Using k-mer frequencies to determine population structure has the potential to avoid some challenges of existing methods.
Background: Phylogenies enrich our understanding of how genes, genomes, and species evolve. Traditionally, alignment-based methods are used to construct phylogenies from genetic sequence data; however, this process can be time-consuming when analyzing the large amounts of genomic data available today. Additionally, these analyses face challenges due to differences in genome structure, synteny, and the need to identify similarities in the face of repeated substitutions resulting in loss of phylogenetic information contained in the sequence. Alignment Free (AF) approaches using k-mers (short subsequences) can be an efficient alternative due to their indifference to positional rearrangements in a sequence. However, these approaches may be sensitive to k-mer length and the distance between samples.Results: In this paper, we analyzed the sensitivity of an AF approach based on k-mer frequencies to these challenges using cosine and Euclidean distance metrics for both assembled genomes and unassembled sequencing reads. Quantification of the sensitivity of this AF approach for phylogeny reconstruction to branch length and k-mer length provides a better understanding of the necessary parameter ranges for accurate phylogeny reconstruction. Our results show that a frequency-based AF approach can result in accurate phylogeny reconstruction when using whole genomes, but not stochastically sequenced reads, so long as longer k-mers are used. Conclusions: In this study, we have shown an AF approach for phylogeny reconstruction is robust in analyzing assembled genome data for a range of numbers of substitutions using longer k-mers. Using simulated reads randomly selected from the genome by the Illumina sequencer had a detrimental effect on phylogeny estimation. Additionally, filtering out infrequent k-mers improved the computational efficiency of the method while preserving the accuracy of the results thus suggesting the feasibility of using only a subset of data to improve computational efficiency in cases where large sets of genome-scale data are analyzed.
Background: Phylogenies enrich our understanding of how genes, genomes, and species evolve. Traditionally, alignment-based methods are used to construct phylogenies from genetic sequence data; however, this process can be time-consuming when analyzing the large amounts of genomic data available today. Additionally, these analyses face challenges due to differences in genome structure, synteny, and the need to identify similarities in the face of repeated substitutions resulting in loss of phylogenetic information contained in the sequence. Alignment Free (AF) approaches using k-mers (short subsequences) can be an efficient alternative due to their indifference to positional rearrangements in a sequence. However, these approaches may be sensitive to k-mer length and the distance between samples.Results: In this paper, we analyzed the sensitivity of an AF approach based on k-mer frequencies to these challenges using cosine and Euclidean distance metrics for both assembled genomes and unassembled sequencing reads. Quantification of the sensitivity of this AF approach for phylogeny reconstruction to branch length and k-mer length provides a better understanding of the necessary parameter ranges for accurate phylogeny reconstruction. Our results show that a frequency-based AF approach can result in accurate phylogeny reconstruction when using whole genomes, but not stochastically sequenced reads, so long as longer k-mers are used. Conclusions: In this study, we have shown an AF approach for phylogeny reconstruction is robust in analyzing assembled genome data for a range of numbers of substitutions using longer k-mers. Using simulated reads randomly selected from the genome by the Illumina sequencer had a detrimental effect on phylogeny estimation. Additionally, filtering out infrequent k-mers improved the computational efficiency of the method while preserving the accuracy of the results thus suggesting the feasibility of using only a subset of data to improve computational efficiency in cases where large sets of genome-scale data are analyzed.
The position of some taxa on the Tree of Life remains controversial despite the increase in genomic data used to infer phylogenies. While analyzing large datasets alleviates stochastic errors, it does not prevent systematic errors in inference, caused by both biological (e.g., incomplete lineage sorting, hybridization) and methodological (e.g., incorrect modeling, erroneous orthology assessments) factors. In our study, we systematically investigated factors that could result in these controversies, using the treeshrew (Scandentia, Mammalia) as a study case. Recent studies have narrowed the phylogenetic position of treeshrews to three competing hypotheses: sister to primates and flying lemurs (Primatomorpha), sister to rodents and lagomorphs (Glires), or sister to a clade comprising all of these. We sampled 50 mammal species including three treeshrews, a selection of taxa from the potential sister groups, and outgroups. Using a large diverse set of loci, we assessed support for the alternative phylogenetic position of treeshrews. A plurality of loci support treeshrews as sister to rodents and lagomorphs; however, only a few loci exhibit strong support for any hypothesis. Surprisingly, we found that a subset of loci that strongly support the monophyly of Primates, support treeshrews as sister to primates and flying lemurs. The overall small magnitude of differences in phylogenetic signal among the alternative hypotheses suggests that these three groups diversified nearly simultaneously. However, with our large dataset and approach to examining support, we provide evidence for the hypothesis of treeshrews as sister to rodents and lagomorphs, while demonstrating why support for alternate hypotheses has been seen in prior work. We also suggest that locus selection can unwittingly bias results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.