Founder events play a critical role in shaping genetic diversity, fitness and disease risk in a population. Yet our understanding of the prevalence and distribution of founder events in humans and other species remains incomplete, as most existing methods require large sample sizes or phased genomes. Thus, we developed ASCEND that measures the correlation in allele sharing between pairs of individuals across the genome to infer the age and strength of founder events. We show that ASCEND can reliably estimate the parameters of founder events under a range of demographic scenarios. We then apply ASCEND to two species with contrasting evolutionary histories: ~460 worldwide human populations and ~40 modern dog breeds. In humans, we find that over half of the analyzed populations have evidence for recent founder events, associated with geographic isolation, modes of sustenance, or cultural practices such as endogamy. Notably, island populations have lower population sizes than continental groups and most hunter-gatherer, nomadic and indigenous groups have evidence of recent founder events. Many present-day groups––including Native Americans, Oceanians and South Asians––have experienced more extreme founder events than Ashkenazi Jews who have high rates of recessive diseases due their known history of founder events. Using ancient genomes, we show that the strength of founder events differs markedly across geographic regions and time––with three major founder events related to the peopling of Americas and a trend in decreasing strength of founder events in Europe following the Neolithic transition and steppe migrations. In dogs, we estimate extreme founder events in most breeds that occurred in the last 25 generations, concordant with the establishment of many dog breeds during the Victorian times. Our analysis highlights a widespread history of founder events in humans and dogs and elucidates some of the demographic and cultural practices related to these events.
Founder events play a critical role in shaping genetic diversity, impacting the fitness of a species and disease risk in humans. Yet our understanding of the prevalence and distribution of founder events in humans and other species remains incomplete, as most existing methods for characterizing founder events require large sample sizes or phased genomes. To learn about the frequency and evolutionary history of founder events, we introduce ASCEND (Allele Sharing Correlation for the Estimation of Non-equilibrium Demography), a flexible two-locus method to infer the age and strength of founder events. This method uses the correlation in allele sharing across the genome between pairs of individuals to recover signatures of past bottlenecks. By performing coalescent simulations, we show that ASCEND can reliably estimate the parameters of founder events under a range of demographic scenarios, with genotype or sequence data. We apply ASCEND to ~5,000 worldwide human samples (~3,500 present-day and ~1,500 ancient individuals), and ~1,000 domesticated dog samples. In both species, we find pervasive evidence of founder events in the recent past. In humans, over half of the populations surveyed in our study had evidence for a founder events in the past 10,000 years, associated with geographic isolation, modes of sustenance, and historical invasions and epidemics. We document that island populations have historically maintained lower population sizes than continental groups, ancient hunter-gatherers had stronger founder events than Neolithic Farmers or Steppe Pastoralists, and periods of epidemics such as smallpox were accompanied by major population crashes. Many present-day groups--including Central & South Americans, Oceanians and South Asians--have experienced founder events stronger than estimated in Ashkenazi Jews who have high rates of recessive diseases due to their history of founder events. In dogs, we uncovered extreme founder events in most groups, more than ten times stronger than the median strength of founder events in humans. These founder events occurred during the last 25 generations and are likely related to the establishment of dog breeds during Victorian times. Our results highlight a widespread history of founder events in humans and dogs, and provide insights about the demographic and cultural processes underlying these events.
Motivation Multiple sequence alignment (MSA) is a basic step in many bioinformatics pipelines. However, achieving highly accurate alignments on large datasets, especially those with sequence length heterogeneity, is a challenging task. UPP (Ultra-large multiple sequence alignment using Phylogeny-aware Profiles) is a method for MSA estimation that builds an ensemble of Hidden Markov Models (eHMM) to represent an estimated alignment on the full length sequences in the input, and then adds the remaining sequences into the alignment using selected HMMs in the ensemble. Although UPP provides good accuracy, it is computationally intensive on large datasets. Results We present UPP2, a direct improvement on UPP. The main advance is a fast technique for selecting HMMs in the ensemble that allows us to achieve the same accuracy as UPP but with greatly reduced runtime. We show that UPP2 produces more accurate alignments compared to leading MSA methods on datasets exhibiting substantial sequence length heterogeneity, and is among the most accurate otherwise. Availability https://github.com/gillichu/sepp Supplementary information Supplementary information are available online at Bioinformatics
Motivation: Multiple sequence alignment (MSA) is a basic step in many bioinformatics pipelines. However, achieving highly accurate alignments on large datasets, especially those with sequence length heterogeneity, is a challenging task. UPP (Ultra-large multiple sequence alignment using Phylogeny-aware Pro les) is a method for MSA estimation that builds an ensemble of Hidden Markov Models (eHMM) to represent an estimated alignment on the full length sequences in the input, and then adds the remaining sequences into the alignment using selected HMMs in the ensemble. Although UPP provides good accuracy, it is computationally intensive on large datasets. Results: We present UPP2, a direct improvement on UPP. The main advance is a fast technique for selecting HMMs in the ensemble that allows us to achieve the same accuracy as UPP but with greatly reduced runtime. We show UPP2 produces more accurate alignments compared to leading MSA methods on datasets exhibiting substantial sequence length heterogeneity, and is among the most accurate otherwise. Availability: https://github.com/gillichu/sepp Contact: warnow@illinois.edu
Phylogenetic placement is the problem of placing “query” sequences into an existing tree (called a “backbone tree”). One of the most accurate phylogenetic placement methods to date is the maximum likelihood-based method pplacer, using RAxML to estimate numeric parameters on the backbone tree and then adding the given query sequence to the edge that maximizes the probability that the resulting tree generates the query sequence. Unfortunately, this way of running pplacer fails to return valid outputs on many moderately large backbone trees, and so is limited to backbone trees with at most ∼10,000 leaves. SCAMPP is a technique to enable pplacer to run on larger backbone trees, which operates by finding a small “placement subtree” specific to each query sequence, within which the query sequence are placed using pplacer. That approach matched the scalability and accuracy of APPLES-2, the previous most scalable method. Here, we explore a different aspect of pplacer's strategy: the technique used to estimate numeric parameters on the backbone tree. We confirm anecdotal evidence that using FastTree instead of RAxML to estimate numeric parameters on the backbone tree enables pplacer to scale to much larger backbone trees, almost (but not quite) matching the scalability of APPLES-2 and pplacer-SCAMPP. We then evaluate the combination of these two techniques – SCAMPP and the use of FastTree. We show that this combined approach, pplacer-SCAMPP-FastTree, has the same scalability as APPLES-2, improves on the scalability of pplacer-FastTree, and achieves better accuracy than the comparably scalable methods. Availability https://github.com/gillichu/PLUSplacer-taxtastic.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.