Reduced representation genome-sequencing approaches based on restriction digestion are enabling large-scale marker generation and facilitating genomic studies in a wide range of model and nonmodel systems. However, sampling chromosomes based on restriction digestion may introduce a bias in allele frequency estimation due to polymorphisms in restriction sites. To explore the effects of this nonrandom sampling and its sensitivity to different evolutionary parameters, we developed a coalescent-simulation framework to mimic the biased recovery of chromosomes in restriction-based short-read sequencing experiments (RADseq). We analysed simulated DNA sequence datasets and compared known values from simulations with those that would be estimated using a RADseq approach from the same samples. We compare these 'true' and 'estimated' values of commonly used summary statistics, π, θ(w), Tajima's D and F(ST). We show that loci with missing haplotypes have estimated summary statistic values that can deviate dramatically from true values and are also enriched for particular genealogical histories. These biases are sensitive to nonequilibrium demography, such as bottlenecks and population expansion. In silico digests with 102 completely sequenced Drosophila melanogaster genomes yielded results similar to our findings from coalescent simulations. Though the potential of RADseq for marker discovery and trait mapping in nonmodel systems remains undisputed, our results urge caution when applying this technique to make population genetic inferences.
MCPH) was notified by an elementary school that on May 23, an unvaccinated teacher had reported receiving a positive test result for SARS-CoV-2, the virus that causes COVID-19. The teacher reported becoming symptomatic on May 19, but continued to work for 2 days before receiving a test on May 21. On occasion during this time, the teacher read aloud unmasked to the class despite school requirements to mask while indoors. Beginning May 23, additional cases of COVID-19 were reported among other staff members, students, parents, and siblings connected to the school. To characterize the outbreak, on May 26, MCPH initiated case investigation and contact tracing that included whole genome sequencing (WGS) of available specimens. A total of 27 cases were identified, including that of the teacher. During May 23-26, among the teacher's 24 students, 22 students, all ineligible for vaccination because of age, received testing for SARS-CoV-2; 12 received positive test results. The attack rate in the two rows seated closest to the teacher's desk was 80% (eight of 10) and was 28% (four of 14) in the three back rows (Fisher's exact test; p = 0.036). During May 24-June 1, six of 18 students in a separate grade at the school, all also too young for vaccination, received positive SARS-CoV-2 test results. Eight additional cases were also identified, all in parents and siblings of students in these two grades. Among these additional cases, three were in persons fully vaccinated in accordance with CDC recommendations (1). Among the 27 total cases, 22 (81%) persons reported symptoms; the most frequently reported symptoms were fever (41%), cough (33%), headache (26%), and sore throat (26%). WGS of all 18 available specimens identified the B.1.617.2 (Delta) variant. Vaccines are effective against the Delta variant (2), but risk of transmission remains elevated among unvaccinated persons in schools without strict adherence to prevention strategies. In addition to vaccination for eligible persons, strict adherence to nonpharmaceutical prevention strategies, including masking, routine testing, facility ventilation, and staying home when symptomatic, are important to ensure safe in-person learning in schools (3).
It has become clear that hybridization between species is much more common than previously recognized. As a result, we now know that the genomes of many modern species, including our own, are a patchwork of regions derived from past hybridization events. Increasingly researchers are interested in disentangling which regions of the genome originated from each parental species using local ancestry inference methods. Due to the diverse effects of admixture, this interest is shared across disparate fields, from human genetics to research in ecology and evolutionary biology. However, local ancestry inference methods are sensitive to a range of biological and technical parameters which can impact accuracy. Here we present paired simulation and ancestry inference pipelines, mixnmatch and ancestryinfer, to help researchers plan and execute local ancestry inference studies. mixnmatch can simulate arbitrarily complex demographic histories in the parental and hybrid populations, selection on hybrids, and technical variables such as coverage and contamination. ancestryinfer takes as input sequencing reads from simulated or real individuals, and implements an efficient local ancestry inference pipeline. We perform a series of simulations with mixnmatch to pinpoint factors that influence accuracy in local ancestry inference and highlight useful features of the two pipelines. mixnmatch is a powerful tool for simulations of hybridization while ancestryinfer facilitates local ancestry inference on real or simulated data.
The California Conservation Genomics Project (CCGP) is a unique, critically important step forward in the use of comprehensive landscape genetic data to modernize natural resource management at a regional scale. We describe the CCGP, including all aspects of project administration, data collection, current progress, and future challenges. The CCGP will generate, analyze, and curate a single high-quality reference genome and 100-150 resequenced genomes for each of 153 species projects (representing 235 individual species) that span the ecological and phylogenetic breadth of California’s marine, freshwater, and terrestrial ecosystems. The resulting portfolio of roughly 20,000 resequenced genomes will be analyzed with identical informatic and landscape genomic pipelines, providing a comprehensive overview of hotspots of within-species genomic diversity, potential and realized corridors connecting these hotspots, regions of reduced diversity requiring genetic rescue, and the distribution of variation critical for rapid climate adaptation. After two years of concerted effort, full funding ($12M USD) has been secured, species identified, and funds distributed to 68 laboratories and 114 investigators drawn from all 10 University of California campuses. The remaining phases of the CCGP include completion of data collection and analyses, and delivery of the resulting genomic data and inferences to state and federal regulatory agencies to help stabilize species declines. The aspirational goals of the CCGP are to identify geographic regions that are critical to long term preservation of California biodiversity, prioritize those regions based on defensible genomic criteria, and provide foundational knowledge that informs management strategies at both the individual species and ecosystem levels.
The evolution of reproductive barriers is fundamental to the formation of new species and can help us understand the diversification of life on Earth. These reproductive barriers often take the form of hybrid incompatibilities, where genes derived from two different species no longer interact properly. Theory predicts that incompatibilities involving multiple genes should be common and that rapidly evolving genes will be more likely to cause incompatibilities, but empirical evidence has lagged behind these predictions. Here, we describe a mitonuclear incompatibility involving three genes within respiratory Complex I in naturally hybridizing swordtail fish. Individuals with specific mismatched protein combinations fail to complete embryonic development while those heterozygous for the incompatibility have reduced function of Complex I and unbalanced representation of parental alleles in the mitochondrial proteome. We localize the protein-protein interactions that underlie the incompatibility and document accelerated evolution and introgression in the genes involved. This work thus provides a precise characterization of the genetic architecture, physiological impacts, and evolutionary origin of a multi-gene incompatibility impacting naturally hybridizing species.
Exposure to different mutagens leaves distinct mutational patterns that can allow prediction of pathogen replication niches (Ruis 2022). We therefore hypothesised that analysis of SARS-CoV-2 mutational spectra might show lineage-specific differences, dependant on the dominant site(s) of replication and onwards transmission, and could therefore rapidly infer virulence of emergent variants of concern (VOC; Konings 2021). Through mutational spectrum analysis, we found a significant reduction in G>T mutations in Omicron, which replicates in the upper respiratory tract (URT), compared to other lineages, which replicate in both upper and lower respiratory tracts (LRT). Mutational analysis of other viruses and bacteria indicates a robust, generalisable association of high G>T mutations with replication within the LRT. Monitoring G>T mutation rates over time, we found early separation of Omicron from Beta, Gamma and Delta, while the mutational burden in Alpha varied consistent with changes in transmission source as social restrictions were lifted. This supports the use of mutational spectra to infer niches of established and emergent pathogens.
Phylogenetics has a crucial role in genomic epidemiology. Enabled by unparalleled volumes of genome sequence data generated to study and help contain the COVID-19 pandemic, phylogenetic analyses of SARS-CoV-2 genomes have shed light on the virus’s origins, spread, and the emergence and reproductive success of new variants. However, most phylogenetic approaches, including maximum likelihood and Bayesian methods, cannot scale to the size of the datasets from the current pandemic. We present ‘MAximum Parsimonious Likelihood Estimation’ (MAPLE), an approach for likelihood-based phylogenetic analysis of epidemiological genomic datasets at unprecedented scales. MAPLE infers SARS-CoV-2 phylogenies more accurately than existing maximum likelihood approaches while running up to thousands of times faster, and requiring at least 100 times less memory on large datasets. This extends the reach of genomic epidemiology, allowing the continued use of accurate phylogenetic, phylogeographic and phylodynamic analyses on datasets of millions of genomes.
Motivation Phylogenetic tree optimization is necessary for precise analysis of evolutionary and transmission dynamics, but existing tools are inadequate for handling the scale and pace of data produced during the coronavirus disease 2019 (COVID-19) pandemic. One transformative approach, online phylogenetics, aims to incrementally add samples to an ever-growing phylogeny, but there are no previously existing approaches that can efficiently optimize this vast phylogeny under the time constraints of the pandemic. Results Here, we present matOptimize, a fast and memory-efficient phylogenetic tree optimization tool based on parsimony that can be parallelized across multiple CPU threads and nodes, and provides orders of magnitude improvement in runtime and peak memory usage compared to existing state-of-the-art methods. We have developed this method particularly to address the pressing need during the COVID-19 pandemic for daily maintenance and optimization of a comprehensive SARS-CoV-2 phylogeny. matOptimize is currently helping refine on a daily basis possibly the largest-ever phylogenetic tree, containing millions of SARS-CoV-2 sequences. Availability and implementation The matOptimize code is freely available as part of the UShER package (https://github.com/yatisht/usher) and can also be installed via bioconda (https://bioconda.github.io/recipes/usher/README.html). All scripts we used to perform the experiments in this manuscript are available at https://github.com/yceh/matOptimize-experiments. Supplementary information Supplementary data are available at Bioinformatics online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.