Population genetic analyses often use summary statistics to describe patterns of genetic variation and provide insight into evolutionary processes. Among the most fundamental of these summary statistics are π and dXY, which are used to describe genetic diversity within and between populations, respectively. Here, we address a widespread issue in π and dXY calculation: systematic bias generated by missing data of various types. Many popular methods for calculating π and dXY operate on data encoded in the variant call format (VCF), which condenses genetic data by omitting invariant sites. When calculating π and dXY using a VCF, it is often implicitly assumed that missing genotypes (including those at sites not represented in the VCF) are homozygous for the reference allele. Here, we show how this assumption can result in substantial downward bias in estimates of π and dXY that is directly proportional to the amount of missing data. We discuss the pervasive nature and importance of this problem in population genetics, and introduce a user‐friendly UNIX command line utility, pixy, that solves this problem via an algorithm that generates unbiased estimates of π and dXY in the face of missing data. We compare pixy to existing methods using both simulated and empirical data, and show that pixy alone produces unbiased estimates of π and dXY regardless of the form or amount of missing data. In summary, our software solves a long‐standing problem in applied population genetics and highlights the importance of properly accounting for missing data in population genetic analyses.
This is the author manuscript accepted for publication and has undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as
Humans have undergone large migrations over the past hundreds to thousands of years, exposing ourselves to new environments and selective pressures. Yet, evidence of ongoing or recent selection in humans is difficult to detect. Many of these migrations also resulted in gene flow between previously separated populations. These recently admixed populations provide unique opportunities to study rapid evolution in humans. Developing methods based on distributions of local ancestry, we demonstrate that this sort of genetic exchange has facilitated detectable adaptation to a malaria parasite in the admixed population of Cabo Verde within the last ~20 generations. We estimate that the selection coefficient is approximately 0.08, one of the highest inferred in humans. Notably, we show that this strong selection at a single locus has likely affected patterns of ancestry genome-wide, potentially biasing demographic inference. Our study provides evidence of adaptation in a human population on historical timescales.
1 2 Population genetic analyses often use summary statistics to describe patterns of genetic 3 variation and provide insight into evolutionary processes. Among the most fundamental 4 of these summary statistics are π and d XY , which are used to describe genetic diversity 5 within and between populations, respectively. Here, we address a widespread issue in π 6 and d XY calculation: systematic bias generated by missing data of various types. Many 7 popular methods for calculating π and d XY operate on data encoded in the Variant Call 8 Format (VCF), which condenses genetic data by omitting invariant sites. When 9 calculating π and d XY using a VCF, it is often implicitly assumed that missing genotypes 10 (including those at sites not represented in the VCF) are homozygous for the reference 11 allele. Here, we show how this assumption can result in substantial downward bias in 12 estimates of π and d XY that is directly proportional to the amount of missing data. We 13 discuss the pervasive nature and importance of this problem in population genetics, and 14 introduce a user-friendly UNIX command line utility, pixy, that solves this problem via 15 an algorithm that generates unbiased estimates of π and d XY in the face of missing data. 16 We compare pixy to existing methods using both simulated and empirical data, and 17 show that pixy alone produces unbiased estimates of π and d XY regardless of the form or 18 amount of missing data. In sum, our software solves a long-standing problem in applied 19 population genetics and highlights the importance of properly accounting for missing 20 data in population genetic analyses. 21 22 31 genetics. 32 33 Many summary statistics are based on the comparison of DNA sequences. Two 34 important summary statistics in this class are π, the average number of nucleotide 35 differences between genotypes drawn from the same population (Nei and Li 1979); and 36 d XY , the average number of nucleotide differences between genotypes drawn from two 37 different populations (Nei and Li 1979). These two summary statistics underlie a large 38 variety of descriptive and inferential procedures in population genetics. For example, π 39 is often used as an estimator of the central population genetic parameter (and is thus 40 sometimes styled as ). Similarly, d XY is a key statistic for exploring patterns of 41 divergence between populations, particularly in the context of divergence with gene 42 flow (Noor and Bennett 2009; Cruickshank and Hahn 2014; Burri 2017). 43 44 Calculation of π and d XY 45 46 For a single biallelic locus, π is usually calculated using one of three expressions shown 47 in Equation 1, all of which are exactly equivalent: 48 (Eq. 1) 49 50 51 52 (Nei and Li 1979; Gillespie 2004; Hahn 2019) 53 54 Where k ij corresponds to the count of allelic differences between the ith and jth haploid 55 genotypes, n is the number of samples, and c 0 and c 1 are the respective counts of the two 56 alleles at the locus. Note that the last expression is simply the sample-size correct...
Crossing over is well known to have profound effects on patterns of genetic diversity and genome evolution. Far less direct attention has been paid to another distinct outcome of meiotic recombination: noncrossover gene conversion (NCGC). Crossing over and NCGC both shuffle combinations of alleles, and this degradation of linkage disequilibrium (LD) has major evolutionary consequences, ranging from immediate effects on nucleotide diversity to long-term consequences that shape genome evolution, species formation and species persistence. Unlike simple crossing over, NCGC has the potential to alter allele frequencies. Gene conversion can also occur in genomic regions where crossing over does not, and it purportedly exhibits more uniform rates across genomes. Considerable progress has been made towards understanding the mechanisms of gene conversion, and this progress enables us to begin exploring how gene conversion affects processes such as molecular evolution and interspecies gene flow. These topics are timely with the recent shift in focus from a primarily neutral null model of molecular evolution and speciation to one incorporating base levels of selection, making it all the more crucial to understand the basis and evolutionary implications of linkage. Here, we discuss the impact of gene conversion on genome structure and evolution and the current methods for detecting these events. We provide a comprehensive review of how gene conversion breaks down LD and affects both short- and long-term evolutionary processes, and we contrast its impact to that expected from crossing over alone.
Eukaryotic genomes show tremendous size variation across taxa. Proximate explanations for genome size variation include differences in ploidy and amounts of noncoding DNA, especially repetitive DNA. Ultimate explanations include selection on physiological correlates of genome size such as cell size, which in turn influence body size, resulting in the often-observed correlation between body size and genome size. In this study, we examined body size and repetitive DNA elements in relationship to the evolution of genome size in North American representatives of a single beetle family, the Lampyridae (fireflies). The 23 species considered represent an excellent study system because of the greater than 5-fold range of genome sizes, documented here using flow cytometry, and the 3-fold range in body size, measured using pronotum width. We also identified common genomic repetitive elements using low-coverage sequencing. We found a positive relationship between genome size and repetitive DNA, particularly retrotransposons. Both genome size and these elements were evolving as expected given phylogenetic relatedness. We also tested whether genome size varied with body size and found no relationship. Together, our results suggest that genome size is evolving neutrally in fireflies.
Throughout human history, large-scale migrations have facilitated the formation of populations with ancestry from multiple previously separated populations. This process leads to subsequent shuffling of genetic ancestry through recombination, producing variation in ancestry between populations, among individuals in a population, and along the genome within an individual. Recent methodological and empirical developments have elucidated the genomic signatures of this admixture process, bringing previously understudied admixed populations to the forefront of population and medical genetics. Under this theme, we present a collection of recent PLOS Genetics publications that exemplify recent progress in human genetic admixture studies, and we discuss potential areas for future work.
Over the past 50 years, geneticists have made great strides in understanding how our species' evolutionary history gave rise to current patterns of human genetic diversity classically summarized by Lewontin in his 1972 paper, ‘The Apportionment of Human Diversity’. One evolutionary process that requires special attention in both population genetics and statistical genetics is admixture: gene flow between two or more previously separated source populations to form a new admixed population. The admixture process introduces ancestry-based structure into patterns of genetic variation within and between populations, which in turn influences the inference of demographic histories, identification of genetic targets of selection and prediction of complex traits. In this review, we outline some challenges for admixture population genetics, including limitations of applying methods designed for populations without recent admixture to the study of admixed populations. We highlight recent studies and methodological advances that aim to overcome such challenges, leveraging genomic signatures of admixture that occurred in the past tens of generations to gain insights into human history, natural selection and complex trait architecture. This article is part of the theme issue ‘Celebrating 50 years since Lewontin's apportionment of human diversity’.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.