We have studied a genome-wide set of single-nucleotide polymorphism (SNP) allele frequency measures for African-American, East Asian, and European-American samples. For this analysis we derived a simple, closed mathematical formulation for the spectrum of expected allele frequencies when the sampled populations have experienced nonstationary demographic histories. The direct calculation generates the spectrum orders of magnitude faster than coalescent simulations do and allows us to generate spectra for a large number of alternative histories on a multidimensional parameter grid. Model-fitting experiments using this grid reveal significant population-specific differences among the demographic histories that best describe the observed allele frequency spectra. European and Asian spectra show a bottleneck-shaped history: a reduction of effective population size in the past followed by a recent phase of size recovery. In contrast, the African-American spectrum shows a history of moderate but uninterrupted population expansion. These differences are expected to have profound consequences for the design of medical association studies. The analytical methods developed for this study, i.e., a closed mathematical formulation for the allele frequency spectrum, correcting the ascertainment bias introduced by shallow SNP sampling, and dealing with variable sample sizes provide a general framework for the analysis of public variation data.T HE analysis of statistical distributions of genetic the effects of recombination or mutation rate heterogeneity as we show below. variations has a rich history in classical population genetic studies (Crow and Kimura 1970), and recentModeling the distribution of allele frequency: Prior study of the AFS has been restricted to properties of genome-scale data collection projects have positioned the field to apply, challenge, and improve traditional summary statistics such as Tajima's D (Tajima 1989), or the proportion of rare-to medium-frequency alleles (Fu theory by examining data from thousands of loci simultaneously. The two most frequently studied distributions and Li 1993). There has been very little analysis of the general shape of observed spectral distributions. The of nucleotide sequence variation are the marker density analytical shape of the AFS, under a stationary history (MD), or mismatch distribution (Li 1977; Rogers and of constant effective population size, was derived by Fu Harpending 1992; i.e., the distribution of the number (1995) who showed that, within n samples, the expected of polymorphic sites observed when a collection of senumber of mutations of size i is inversely proportional quences of a given length are compared), and the allele to i. Important properties of the coalescent process unfrequency spectrum (AFS; Ewens 1972; i.e., the distribuder deterministically changing population size have tion of diallelic polymorphic sites according to the numbeen derived in publications of Griffiths and Tavare ber of chromosomes that carry a given allele within a (199...
A computational method was developed for delineating connected gene neighborhoods in bacterial and archaeal genomes. These gene neighborhoods are not typically present, in their entirety, in any single genome, but are held together by overlapping, partially conserved gene arrays. The procedure was applied to comparing the orders of orthologous genes, which were extracted from the database of Clusters of Orthologous Groups of proteins (COGs), in 31 prokaryotic genomes and resulted in the identification of 188 clusters of gene arrays, which included 1001 of 2890 COGs. These clusters were projected onto actual genomes to produce extended neighborhoods including additional genes, which are adjacent to the genes from the clusters and are transcribed in the same direction, which resulted in a total of 2387 COGs being included in the neighborhoods. Most of the neighborhoods consist predominantly of genes united by a coherent functional theme, but also include a minority of genes without an obvious functional connection to the main theme. We hypothesize that although some of the latter genes might have unsuspected roles, others are maintained within gene arrays because of the advantage of expression at a level that is typical of the given neighborhood. We designate this phenomenon 'genomic hitchhiking'. The largest neighborhood includes 79 genes (COGs) and consists of overlapping, rearranged ribosomal protein superoperons; apparent genome hitchhiking is particularly typical of this neighborhood and other neighborhoods that consist of genes coding for translation machinery components. Several neighborhoods involve previously undetected connections between genes, allowing new functional predictions. Gene neighborhoods appear to evolve via complex rearrangement, with different combinations of genes from a neighborhood fixed in different lineages.
The joint degree matrix of a graph gives the number of edges between vertices of degree i and degree j for every pair (i, j). One can perform restricted swap operations to transform a graph into another with the same joint degree matrix. We prove that the space of all realizations of a given joint degree matrix over a fixed vertex set is connected via these restricted swap operations. This was claimed before, but there is an error in the previous proof, which we illustrate by example. We also give a simplified proof of the necessary and sufficient conditions for a matrix to be a joint degree matrix. Finally, we address some of the issues concerning the mixing time of the corresponding MCMC method to sample uniformly from these realizations.
Single-nucleotide polymorphisms (SNPs) constitute the great majority of variations in the human genome, and as heritable variable landmarks they are useful markers for disease mapping and resolving population structure. Redundant coverage in overlaps of large-insert genomic clones, sequenced as part of the Human Genome Project, comprises a quarter of the genome, and it is representative in terms of base compositional and functional sequence features. We mined these regions to produce 500,000 high-confidence SNP candidates as a uniform resource for describing nucleotide diversity and its regional variation within the genome. Distributions of marker density observed at different overlap length scales under a model of recombination and population size change show that the history of the population represented by the public genome sequence is one of collapse followed by a recent phase of mild size recovery. The inferred times of collapse and recovery are Upper Paleolithic, in agreement with archaeological evidence of the initial modern human colonization of Europe.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.