Inferring the full genealogical history of a set of DNA sequences is a core problem in evolutionary biology as it encodes information about the events and forces that have influenced a species. However, current methods are limited, with the most accurate able to process no more than a hundred samples. With data sets consisting of millions of genomes being collected, there is a need for scalable and efficient inference methods to fully utilise these resources. We introduce an algorithm to infer whole-genome histories with comparable accuracy to the state-of-the-art but able to process four orders of magnitude more sequences. The approach also provides an “evolutionary encoding” of the data, enabling efficient calculation of relevant statistics. We apply the method to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the inferred genealogies are rich in biological signal and efficient to process.
Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime’s many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.
The sequencing of modern and ancient genomes from around the world has revolutionized our understanding of human history and evolution. However, the problem of how best to characterize ancestral relationships from the totality of human genomic variation remains unsolved. Here, we address this challenge with nonparametric methods that enable us to infer a unified genealogy of modern and ancient humans. This compact representation of multiple datasets explores the challenges of missing and erroneous data and uses ancient samples to constrain and date relationships. We demonstrate the power of the method to recover relationships between individuals and populations as well as to identify descendants of ancient samples. Finally, we introduce a simple nonparametric estimator of the geographical location of ancestors that recapitulates key events in human history.
Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this necessity, a large number of specialised simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and tskit library. We summarise msprime's many features, and show that its performance is excellent, often many times faster and more memory efficient than specialised alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.
The sequencing of modern and ancient genomes from around the world has revolutionised our understanding of human history and evolution. However, the general problem of how best to characterise the full complexity of ancestral relationships from the totality of human genomic variation remains unsolved. Patterns of variation in each data set are typically analysed independently, and often using parametric models or data reduction techniques that cannot capture the full complexity of human ancestry. Moreover, variation in sequencing technology, data quality and in silico processing, coupled with complexities of data scale, limit the ability to integrate data sources. Here, we introduce a non-parametric approach to inferring human genealogical history that overcomes many of these challenges and enables us to build the largest genealogy of both modern and ancient humans yet constructed. The genealogy provides a lossless and compact representation of multiple datasets, addresses the challenges of missing and erroneous data, and benefits from using ancient samples to constrain and date relationships. Using simulations and empirical analyses, we demonstrate the power of the method to recover relationships between individuals and populations, as well as to identify descendants of ancient samples. Finally, we show how applying a simple nonparametric estimator of ancestor geographical location to the inferred genealogy recapitulates key events in human history. Our results demonstrate that whole-genome genealogies are a powerful means of synthesising genetic data and provide rich insights into human evolution.
Objectives In tests on known individuals macroscopic sex estimation has between 70% and 98% accuracy. However, materials used to create and test these methods are overwhelming modern. As sexual dimorphism is dependent on multiple factors, it is unclear whether macroscopic methods have similar success on earlier materials, which differ in lifestyle and nutrition. This research aims to assess the accuracy of commonly used traits by comparing macroscopic sex estimates to genetic sex in medieval English material. Materials and Methods Sixty‐six individuals from the 13th to 16th century Hospital of St John the Evangelist, Cambridge, were assessed. Genetic sex was determined using a shotgun approach. Eighteen skeletal traits were examined, and macroscopic sex estimates were derived from the os coxae, skull, and os coxae and skull combined. Each trait was tested for accuracy to explore sex estimates errors. Results The combined estimate (97.7%) outperformed the os coxae only estimate (95.7%), which outperformed the skull only estimate (90.4%). Accuracy rates for individual traits varied: Phenice traits were most accurate, whereas supraorbital margins, frontal bossing, and gonial flaring were least accurate. The preauricular sulcus and arc compose showed a bias in accuracy between sexes. Discussion Macroscopic sex estimates are accurate when applied to medieval material from Cambridge. However, low trait accuracy rates may relate to differences in dimorphism between the method derivative sample and the St John's collection. Given the sex bias, the preauricular sulcus, frontal bossing, and arc compose should be reconsidered as appropriate traits for sex estimation for this group.
In the fourth millennium BCE a cultural phenomenon of monumental burial structures spread along the Atlantic façade. Megalithic burials have been targeted for aDNA analyses, but a gap remains in East Anglia, where Neolithic structures were generally earthen or timber. An early Neolithic (3762-3648 cal. BCE) burial monument at the site of Trumpington Meadows, Cambridgeshire, UK, contained the partially articulated remains of at least three individuals. To determine whether this monument fits a pattern present in megalithic burials regarding sex bias, kinship, diet and relationship to modern populations, teeth and ribs were analysed for DNA and carbon and nitrogen isotopic values, respectively. Whole ancient genomes were sequenced from two individuals to a mean genomic coverage of 1.6 and 1.2X and genotypes imputed. Results show that they were brothers from a small population genetically and isotopically similar to previously published British Neolithic individuals, with a level of genome-wide homozygosity consistent with a small island population sourced from continental Europe, but bearing no signs of recent inbreeding. The first Neolithic whole genomes from a monumental burial in East Anglia confirm that this region was connected with the larger pattern of Neolithic megaliths in the British Isles and the Atlantic façade.
A central problem in evolutionary biology is to infer the full genealogical history of a set of DNA sequences. This history contains rich information about the forces that have influenced a sexually reproducing species. However, existing methods are limited: the most accurate is unable to cope with more than a few dozen samples. With modern genetic data sets rapidly approaching millions of genomes, there is an urgent need for efficient inference methods to exploit such rich resources. We introduce an algorithm to infer whole-genome history which has comparable accuracy to the state-of-the-art but can process around four orders of magnitude more sequences. Additionally, our method results in an "evolutionary encoding" of the original sequence data, enabling efficient access to genealogies and calculation of genetic statistics over the data. We apply this technique to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the genealogies we estimate are both rich in biological signal and efficient to process.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.