Software to call single-nucleotide polymorphisms or related genetic variants has converged on the variant call format (VCF) as the output format of choice. This has created a need for tools to work with VCF files. While an increasing number of software exists to read VCF data, many only extract the genotypes without including the data associated with each genotype that describes its quality. We created the r package vcfr to address this issue. We developed a VCF file exploration tool implemented in the r language because r provides an interactive experience and an environment that is commonly used for genetic data analysis. Functions to read and write VCF files into r as well as functions to extract portions of the data and to plot summary statistics of the data are implemented. vcfr further provides the ability to visualize how various parameterizations of the data affect the results. Additional tools are included to integrate sequence (fasta) and annotation data (GFF) for visualization of genomic regions such as chromosomes. Conversion functions translate data from the vcfr data structure to formats used by other r genetics packages. Computationally intensive functions are implemented in C++ to improve performance. Use of these tools is intended to facilitate VCF data exploration, including intuitive methods for data quality control and easy export to other r packages for further analysis. vcfr thus provides essential, novel tools currently not available in r.
The U.S. Endangered Species Act (ESA) allows listing of subspecies and other groupings below the rank of species. This provides the U.S. Fish and Wildlife Service and the National Marine Fisheries Service with a means to target the most critical unit in need of conservation. While roughly one-quarter of listed taxa are subspecies, these management agencies are hindered by uncertainties about taxonomic standards during listing or delisting activities. In a review of taxonomic publications and societies, we found few subspecies lists and none that stated standardized criteria for determining subspecific taxa. Lack of criteria is attributed to a centuries-old debate over species and subspecies concepts. However, the critical need to resolve this debate for ESA listings lead us to propose that minimal biological criteria to define disjunct subspecies (legally or taxonomically) should include the discreteness and significance criteria of Distinct Population Segments (as defined under the ESA). Our subspecies criteria are in stark contrast to that proposed by supporters of the Phylogenetic Species Concept and provide a clear distinction between species and subspecies. Efforts to eliminate or reduce ambiguity associated with subspecies-level classifications will assist with ESA listing decisions. Thus, we urge professional taxonomic societies to publish and periodically update peer-reviewed species and subspecies lists. This effort must be paralleled throughout the world for efficient taxonomic conservation to take place.
While the benefits of targeted sequencing are greatest in plants with large genomes, nearly all comparative projects can benefit from the improved throughput offered by targeted multiplex DNA sequencing, particularly as the amount of data produced from a single instrument approaches a trillion bases per run.
The ascomycete fungus Tolypocladium inflatum, a pathogen of beetle larvae, is best known as the producer of the immunosuppressant drug cyclosporin. The draft genome of T. inflatum strain NRRL 8044 (ATCC 34921), the isolate from which cyclosporin was first isolated, is presented along with comparative analyses of the biosynthesis of cyclosporin and other secondary metabolites in T. inflatum and related taxa. Phylogenomic analyses reveal previously undetected and complex patterns of homology between the nonribosomal peptide synthetase (NRPS) that encodes for cyclosporin synthetase (simA) and those of other secondary metabolites with activities against insects (e.g., beauvericin, destruxins, etc.), and demonstrate the roles of module duplication and gene fusion in diversification of NRPSs. The secondary metabolite gene cluster responsible for cyclosporin biosynthesis is described. In addition to genes necessary for cyclosporin biosynthesis, it harbors a gene for a cyclophilin, which is a member of a family of immunophilins known to bind cyclosporin. Comparative analyses support a lineage specific origin of the cyclosporin gene cluster rather than horizontal gene transfer from bacteria or other fungi. RNA-Seq transcriptome analyses in a cyclosporin-inducing medium delineate the boundaries of the cyclosporin cluster and reveal high levels of expression of the gene cluster cyclophilin. In medium containing insect hemolymph, weaker but significant upregulation of several genes within the cyclosporin cluster, including the highly expressed cyclophilin gene, was observed. T. inflatum also represents the first reference draft genome of Ophiocordycipitaceae, a third family of insect pathogenic fungi within the fungal order Hypocreales, and supports parallel and qualitatively distinct radiations of insect pathogens. The T. inflatum genome provides additional insight into the evolution and biosynthesis of cyclosporin and lays a foundation for further investigations of the role of secondary metabolite gene clusters and their metabolites in fungal biology.
BackgroundDouglas-fir (Pseudotsuga menziesii), one of the most economically and ecologically important tree species in the world, also has one of the largest tree breeding programs. Although the coastal and interior varieties of Douglas-fir (vars. menziesii and glauca) are native to North America, the coastal variety is also widely planted for timber production in Europe, New Zealand, Australia, and Chile. Our main goal was to develop a SNP resource large enough to facilitate genomic selection in Douglas-fir breeding programs. To accomplish this, we developed a 454-based reference transcriptome for coastal Douglas-fir, annotated and evaluated the quality of the reference, identified putative SNPs, and then validated a sample of those SNPs using the Illumina Infinium genotyping platform.ResultsWe assembled a reference transcriptome consisting of 25,002 isogroups (unique gene models) and 102,623 singletons from 2.76 million 454 and Sanger cDNA sequences from coastal Douglas-fir. We identified 278,979 unique SNPs by mapping the 454 and Sanger sequences to the reference, and by mapping four datasets of Illumina cDNA sequences from multiple seed sources, genotypes, and tissues. The Illumina datasets represented coastal Douglas-fir (64.00 and 13.41 million reads), interior Douglas-fir (80.45 million reads), and a Yakima population similar to interior Douglas-fir (8.99 million reads). We assayed 8067 SNPs on 260 trees using an Illumina Infinium SNP genotyping array. Of these SNPs, 5847 (72.5%) were called successfully and were polymorphic.ConclusionsBased on our validation efficiency, our SNP database may contain as many as ~200,000 true SNPs, and as many as ~69,000 SNPs that could be genotyped at ~20,000 gene loci using an Infinium II array—more SNPs than are needed to use genomic selection in tree breeding programs. Ultimately, these genomic resources will enhance Douglas-fir breeding and allow us to better understand landscape-scale patterns of genetic variation and potential responses to climate change.
Population genetic analysis is a powerful tool to understand how pathogens emerge and adapt. However, determining the genetic structure of populations requires complex knowledge on a range of subtle skills that are often not explicitly stated in book chapters or review articles on population genetics. What is a good sampling strategy? How many isolates should I sample? How do I include positive and negative controls in my molecular assays? What marker system should I use? This review will attempt to address many of these practical questions that are often not readily answered from reading books or reviews on the topic, but emerge from discussions with colleagues and from practical experience. A further complication for microbial or pathogen populations is the frequent observation of clonality or partial clonality. Clonality invariably makes analyses of population data difficult because many assumptions underlying the theory from which analysis methods were derived are often violated. This review provides practical guidance on how to navigate through the complex web of data analyses of pathogens that may violate typical population genetics assumptions. We also provide resources and examples for analysis in the R programming environment.
Conservation and management of natural populations requires accurate and inexpensive genotyping methods. Traditional microsatellite, or simple sequence repeat (SSR), marker analysis remains a popular genotyping method because of the comparatively low cost of marker development, ease of analysis and high power of genotype discrimination. With the availability of massively parallel sequencing (MPS), it is now possible to sequence microsatellite-enriched genomic libraries in multiplex pools. To test this approach, we prepared seven microsatellite-enriched, barcoded genomic libraries from diverse taxa (two conifer trees, five birds) and sequenced these on one lane of the Illumina Genome Analyzer using paired-end 80-bp reads. In this experiment, we screened 6.1 million sequences and identified 356,958 unique microreads that contained di- or trinucleotide microsatellites. Examination of four species shows that our conversion rate from raw sequences to polymorphic markers compares favourably to Sanger- and 454-based methods. The advantage of multiplexed MPS is that the staggering capacity of modern microread sequencing is spread across many libraries; this reduces sample preparation and sequencing costs to less than $400 (USD) per species. This price is sufficiently low that microsatellite libraries could be prepared and sequenced for all 1373 organisms listed as 'threatened' and 'endangered' in the United States for under $0.5 M (USD).
BackgroundScience-based wildlife management relies on genetic information to infer population connectivity and identify conservation units. The most commonly used genetic marker for characterizing animal biodiversity and identifying maternal lineages is the mitochondrial genome. Mitochondrial genotyping figures prominently in conservation and management plans, with much of the attention focused on the non-coding displacement ("D") loop. We used massively parallel multiplexed sequencing to sequence complete mitochondrial genomes from 40 fishers, a threatened carnivore that possesses low mitogenomic diversity. This allowed us to test a key assumption of conservation genetics, specifically, that the D-loop accurately reflects genealogical relationships and variation of the larger mitochondrial genome.ResultsOverall mitogenomic divergence in fishers is exceedingly low, with 66 segregating sites and an average pairwise distance between genomes of 0.00088 across their aligned length (16,290 bp). Estimates of variation and genealogical relationships from the displacement (D) loop region (299 bp) are contradicted by the complete mitochondrial genome, as well as the protein coding fraction of the mitochondrial genome. The sources of this contradiction trace primarily to the near-absence of mutations marking the D-loop region of one of the most divergent lineages, and secondarily to independent (recurrent) mutations at two nucleotide position in the D-loop amplicon.ConclusionsOur study has two important implications. First, inferred genealogical reconstructions based on the fisher D-loop region contradict inferences based on the entire mitogenome to the point that the populations of greatest conservation concern cannot be accurately resolved. Whole-genome analysis identifies Californian haplotypes from the northern-most populations as highly distinctive, with a significant excess of amino acid changes that may be indicative of molecular adaptation; D-loop sequences fail to identify this unique mitochondrial lineage. Second, the impact of recurrent mutation appears most acute in closely related haplotypes, due to the low level of evolutionary signal (unique mutations that mark lineages) relative to evolutionary noise (recurrent, shared mutation in unrelated haplotypes). For wildlife managers, this means that the populations of greatest conservation concern may be at the highest risk of being misidentified by D-loop haplotyping. This message is timely because it highlights the new opportunities for basing conservation decisions on more accurate genetic information.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.