Cryptococcus gattii recently emerged as the causative agent of cryptococcosis in healthy individuals in western North America, despite previous characterization of the fungus as a pathogen in tropical or subtropical regions. As a foundation to study the genetics of virulence in this pathogen, we sequenced the genomes of a strain (WM276) representing the predominant global molecular type (VGI) and a clinical strain (R265) of the major genotype (VGIIa) causing disease in North America. We compared these C. gattii genomes with each other and with the genomes of representative strains of the two varieties of Cryptococcus neoformans that generally cause disease in immunocompromised people. Our comparisons included chromosome alignments, analysis of gene content and gene family evolution, and comparative genome hybridization (CGH). These studies revealed that the genomes of the two representative C. gattii strains (genotypes VGI and VGIIa) are colinear for the majority of chromosomes, with some minor rearrangements. However, multiortholog phylogenetic analysis and an evaluation of gene/sequence conservation support the existence of speciation within the C. gattii complex. More extensive chromosome rearrangements were observed upon comparison of the C. gattii and the C. neoformans genomes. Finally, CGH revealed considerable variation in clinical and environmental isolates as well as changes in chromosome copy numbers in C. gattii isolates displaying fluconazole heteroresistance.IMPORTANCE Isolates of Cryptococcus gattii are currently causing an outbreak of cryptococcosis in western North America, and most of the cases occurred in the absence of coinfection with HIV. This pattern is therefore in stark contrast to the current global burden of one million annual cases of cryptococcosis, caused by the related species Cryptococcus neoformans, in the HIV/AIDS population. The genome sequences of two outbreak-associated major genotypes of C. gattii reported here provide insights into genome variation within and between cryptococcal species. These sequences also provide a resource to further evaluate the epidemiology of cryptococcal disease and to evaluate the role of pathogen genes in the differential interactions of C. gattii and C. neoformans with immunocompromised and immunocompetent hosts.
MotivationSequencing of human genomes is now routine, and assembly of shotgun reads is increasingly feasible. However, assemblies often fail to inform about chromosome-scale structure due to a lack of linkage information over long stretches of DNA—a shortcoming that is being addressed by new sequencing protocols, such as the GemCode and Chromium linked reads from 10 × Genomics.ResultsHere, we present ARCS, an application that utilizes the barcoding information contained in linked reads to further organize draft genomes into highly contiguous assemblies. We show how the contiguity of an ABySS H.sapiens genome assembly can be increased over six-fold, using moderate coverage (25-fold) Chromium data. We expect ARCS to have broad utility in harnessing the barcoding information contained in linked read data for connecting high-quality sequences in genome assembly drafts.Availability and implementation https://github.com/bcgsc/ARCS/ Supplementary information Supplementary data are available at Bioinformatics online.
Motivation In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. Results We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30–40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. Availability and implementation https://github.com/bcgsc/ntedit Supplementary information Supplementary data are available at Bioinformatics online.
Despite major advances in DNA sequencing technologies we do not yet have complete genome sequences. Producing high-quality, contiguous, draft assemblies de novo is of paramount importance as it informs on genetic content and organization of the genome (Pagani et al. 2012). The past decade has seen improvements in sequence throughput, a substantially lower DNA sequencing cost and increased read lengths. Whereas the base accuracy of short (currently~250 bp) read lengths such as those from Illumina have improved (>99%), the base accuracy of long sequence read platforms (Pacific Biosciences, Oxford Nanopore) remains low for generating reference-grade genome assemblies without read error correction. Gap-filling tools designed to help finish draft genomes in an automated fashion, which includes our own (Paulino et al. 2015), have been recently developed (Tsai, Otto, and Berriman 2010, Boetzer and Pirovano (2012)). They are typically designed to work with short sequencing reads, not high-quality long sequences from other draft assemblies. In many such projects that employ short sequence reads for de novo assembly, a k-mer graph assembly approach is often favored, as it effectively discards errors and spurious sequences, albeit at the cost of long-range information loss and limited ability to resolve long repeats. However, researchers routinely produce various assembly drafts varying the parameter k length in search of the most contiguous assembly. This multitude of assembly drafts is comprised of sequences with untapped potential, representing a wealth of information for gap-filling and scaffolding. Here, I make available two bioinformatics software tools, Cobbler and RAILS (Rene L Warren 2016) to exploit this information for automated finishing and scaffolding with long DNA sequences, respectively. They can be used to scaffold & finish high-quality draft genome assemblies with any long, preferably high-quality, sequences such as scaftigs/contigs from another genome draft. They both rely on accurate, long DNA sequences to patch gaps in existing genome assembly drafts. More specifically, Cobbler is a utility to automatically patch gaps (ambiguous regions in a draft assembly, represented by N's). It does so by first aligning the long sequences to the assembly, tallying the alignments and replacing N's with the sequences from these long DNA sequences. RAILS is an all-in-one scaffolder and gap-filler. Its process is similar to that of Cobbler. It scaffolds a given genome draft with the help of long DNA sequences (contig sequences are ordered/oriented using alignment information) using the scaffolding engine I originally developed for SSAKE (René L. Warren et al. 2007) and LINKS (Warren et al. 2015). The newly created gaps are automatically filled with the DNA string of the provided long DNA sequences. In a simulated long sequences experiment (1, 2.5, 5, 15 kbp sequences) designed from the human genome reference, Cobbler closed >65% of gaps in a human genome assembly draft (Table 1; test provided with the distribution, corre...
We assembled the 9.8 Gbp genome of western redcedar (WRC, Thuja plicata), an ecologically and economically important conifer species of the Cupressaceae. The genome assembly, derived from a uniquely inbred tree produced through five generations of self-fertilization (selfing), was determined to be 86% complete by BUSCO analysis -one of the most complete genome assemblies for a conifer.Population genomic analysis revealed WRC to be one of the most genetically depauperate wild plant species, with an effective population size of approximately 300 and no significant genetic differentiation across its geographic range. Nucleotide diversity, π, is low for a continuous tree species, with many loci exhibiting zero diversity, and the ratio of π at zero-to four-fold degenerate sites is relatively high (~ 0.33), suggestive of weak purifying selection. Using an array of genetic lines derived from up to five generations of selfing, we explored the relationship between genetic diversity and mating system. While overall heterozygosity was found to decline faster than expected during selfing, heterozygosity persisted at many loci, and nearly 100 loci were found to deviate from expectations of genetic drift, suggestive of associative overdominance. Non-reference alleles at such loci often harbor deleterious mutations and are rare in natural populations, implying that balanced polymorphisms are maintained by linkage to dominant beneficial alleles. This may account for how WRC remains responsive to natural and artificial selection, despite low genetic diversity.
Summary The ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short-read assembly with a draft long-read assembly and a draft assembly with an assembly from a closely related species. When scaffolding a human short-read assembly using the reference human genome or a long-read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using <11 GB of RAM. Compared to existing reference-guided scaffolders, ntJoin generates highly contiguous assemblies faster and using less memory. Availability and implementation ntJoin is written in C++ and Python and is freely available at https://github.com/bcgsc/ntjoin. Supplementary information Supplementary data are available at Bioinformatics online.
Background: The Human Leukocyte Antigen (HLA) gene locus plays a fundamental role in human immunity, and it is established that certain HLA alleles are disease determinants. Methods: By combining the predictive power of multiple in silico HLA predictors, we have previously identified prevalent HLA class I and class II alleles, including DPA1*02:02, in two small cohorts at the COVID-19 pandemic onset. Since then, newer and larger patient cohorts with controls and associated demographic and clinical data have been deposited in public repositories. Here, we report on HLA-I and HLA-II alleles, along with their associated risk significance in one such cohort of 126 patients, including COVID-19 positive (n=100) and negative patients (n=26). Results: We recapitulate an enrichment of DPA1*02:02 in the COVID-19 positive cohort (29%) when compared to the COVID-negative control group (Fisher's exact test [FET] p=0.0174). Having this allele, however, does not appear to put this cohort's patients at an increased risk of hospitalization. Inspection of COVID-19 disease severity outcomes reveal nominally significant risk associations with A*11:01 (FET p=0.0078), C*04:01 (FET p=0.0087) and DQA1*01:02 (FET p=0.0121). Conclusions: While enrichment of these alleles falls below statistical significance after Bonferroni correction, COVID-19 patients with the latter three alleles tend to fare worse overall. This is especially evident for patients with C*04:01, where disease prognosis measured by mechanical ventilation-free days was statistically significant after multiple hypothesis correction (Bonferroni p = 0.0023), and may hold potential clinical value.
SUMMARY Spruces (Picea spp.) are coniferous trees widespread in boreal and mountainous forests of the northern hemisphere, with large economic significance and enormous contributions to global carbon sequestration. Spruces harbor very large genomes with high repetitiveness, hampering their comparative analysis. Here, we present and compare the genomes of four different North American spruces: the genome assemblies for Engelmann spruce (Picea engelmannii) and Sitka spruce (Picea sitchensis) together with improved and more contiguous genome assemblies for white spruce (Picea glauca) and for a naturally occurring introgress of these three species known as interior spruce (P. engelmannii × glauca × sitchensis). The genomes were structurally similar, and a large part of scaffolds could be anchored to a genetic map. The composition of the interior spruce genome indicated asymmetric contributions from the three ancestral genomes. Phylogenetic analysis of the nuclear and organelle genomes revealed a topology indicative of ancient reticulation. Different patterns of expansion of gene families among genomes were observed and related with presumed diversifying ecological adaptations. We identified rapidly evolving genes that harbored high rates of non‐synonymous polymorphisms relative to synonymous ones, indicative of positive selection and its hitchhiking effects. These gene sets were mostly distinct between the genomes of ecologically contrasted species, and signatures of convergent balancing selection were detected. Stress and stimulus response was identified as the most frequent function assigned to expanding gene families and rapidly evolving genes. These two aspects of genomic evolution were complementary in their contribution to divergent evolution of presumed adaptive nature. These more contiguous spruce giga‐genome sequences should strengthen our understanding of conifer genome structure and evolution, as their comparison offers clues into the genetic basis of adaptation and ecology of conifers at the genomic level. They will also provide tools to better monitor natural genetic diversity and improve the management of conifer forests. The genomes of four closely related North American spruces indicate that their high similarity at the morphological level is paralleled by the high conservation of their physical genome structure. Yet, the evidence of divergent evolution is apparent in their rapidly evolving genomes, supported by differential expansion of key gene families and large sets of genes under positive selection, largely in relation to stimulus and environmental stress response.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.