Abstract:Comparative genomics based on whole genome sequencing (WGS) is increasingly being applied to investigate questions within evolutionary and molecular biology, as well as questions concerning public health (e.g., pathogen outbreaks). Given the impact that conclusions derived from such analyses may have, we have evaluated the robustness of clustering individuals based on WGS data to three key factors: (1) next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, and SOLiD), (2) algorithms used to … Show more
“…For the case of this E. coli dataset, the phylogeny inferred by Mugsy, a reference-independent approach, was in topological agreement with other reference-dependent approaches (Table 2). In fact, kSNPv3 was one of the only methods that returned a topology that was inconsistent with all other methods (Table 2); an inconsistent kSNP phylogeny has also been reported in the analysis of other datasets (Pettengill et al, 2014). To analyze this further, we identified SNPs (n = 826) from the NASP run using simulated paired-end reads that were uniquely shared on a branch of the phylogeny that defines a monophyletic lineage (Fig.…”
Section: Pipeline Comparisons On E Coli Genomes Data Setmentioning
confidence: 97%
“…Previously, it has been demonstrated that different phylogenies can be obtained for the same dataset using either RAxML or FastTree2 (Pettengill et al, 2014). To test this result across multiple phylogenetic inference methods, the NASP E. coli read dataset was used.…”
Section: Phylogeny Differences For the Same Datasetmentioning
confidence: 99%
“…The program lyve-SET has been applied to outbreak investigations and uses raw or simulated reads to identify SNPs (Katz et al, 2013). Finally, the CFSAN SNP pipeline is a published method from the United States Food and Drug Administration that only supports the use of raw reads (Pettengill et al, 2014). There have been, to our knowledge, no published comparative studies to compare the functionality of these pipelines on a range of test datasets.…”
Whole-genome sequencing (WGS) of bacterial isolates has become standard practice in many laboratories. Applications for WGS analysis include phylogeography and molecular epidemiology, using single nucleotide polymorphisms (SNPs) as the unit of evolution. NASP was developed as a reproducible method that scales well with the hundreds to thousands of WGS data typically used in comparative genomics applications. In this study, we demonstrate how NASP compares with other tools in the analysis of two real bacterial genomics datasets and one simulated dataset. Our results demonstrate that NASP produces similar, and often better, results in comparison with other pipelines, but is much more flexible in terms of data input types, job management systems, diversity of supported tools and output formats. We also demonstrate differences in results based on the choice of the reference genome and choice of inferring phylogenies from concatenated SNPs or alignments including monomorphic positions. NASP represents a source-available, version-controlled, unit-tested method and can be obtained from tgennorth.github.io/NASP.
“…For the case of this E. coli dataset, the phylogeny inferred by Mugsy, a reference-independent approach, was in topological agreement with other reference-dependent approaches (Table 2). In fact, kSNPv3 was one of the only methods that returned a topology that was inconsistent with all other methods (Table 2); an inconsistent kSNP phylogeny has also been reported in the analysis of other datasets (Pettengill et al, 2014). To analyze this further, we identified SNPs (n = 826) from the NASP run using simulated paired-end reads that were uniquely shared on a branch of the phylogeny that defines a monophyletic lineage (Fig.…”
Section: Pipeline Comparisons On E Coli Genomes Data Setmentioning
confidence: 97%
“…Previously, it has been demonstrated that different phylogenies can be obtained for the same dataset using either RAxML or FastTree2 (Pettengill et al, 2014). To test this result across multiple phylogenetic inference methods, the NASP E. coli read dataset was used.…”
Section: Phylogeny Differences For the Same Datasetmentioning
confidence: 99%
“…The program lyve-SET has been applied to outbreak investigations and uses raw or simulated reads to identify SNPs (Katz et al, 2013). Finally, the CFSAN SNP pipeline is a published method from the United States Food and Drug Administration that only supports the use of raw reads (Pettengill et al, 2014). There have been, to our knowledge, no published comparative studies to compare the functionality of these pipelines on a range of test datasets.…”
Whole-genome sequencing (WGS) of bacterial isolates has become standard practice in many laboratories. Applications for WGS analysis include phylogeography and molecular epidemiology, using single nucleotide polymorphisms (SNPs) as the unit of evolution. NASP was developed as a reproducible method that scales well with the hundreds to thousands of WGS data typically used in comparative genomics applications. In this study, we demonstrate how NASP compares with other tools in the analysis of two real bacterial genomics datasets and one simulated dataset. Our results demonstrate that NASP produces similar, and often better, results in comparison with other pipelines, but is much more flexible in terms of data input types, job management systems, diversity of supported tools and output formats. We also demonstrate differences in results based on the choice of the reference genome and choice of inferring phylogenies from concatenated SNPs or alignments including monomorphic positions. NASP represents a source-available, version-controlled, unit-tested method and can be obtained from tgennorth.github.io/NASP.
“…In the United States, nationwide real-time whole-genome sequencing (WGS) was implemented using the GenomeTrakr and PulseNet network to enhance listeriosis outbreak detection and investigation (14). In several outbreak investigations, the U.S. Centers for Disease Control and Prevention (CDC) had employed a whole-genome multilocus sequence typing (wgMLST) tool that targets the allelic differences in genome-wide coding regions (14), and the U.S. Food and Drug Administration (FDA) had employed a reference-based Center for Food Safety and Applied Nutrition (CFSAN) SNP Pipeline that identifies single nucleotide polymorphisms (SNPs) in the entire genome, including core genes, accessory genes, and intergenic regions (8, 11, 15). …”
In 2014, the identification of stone fruits contaminated with Listeria monocytogenes led to the subsequent identification of a multistate outbreak. Simultaneous detection and enumeration of L. monocytogenes were performed on 105 fruits, each weighing 127 to 145 g, collected from 7 contaminated lots. The results showed that 53.3% of the fruits yielded L. monocytogenes (lower limit of detection, 5 CFU/fruit), and the levels ranged from 5 to 2,850 CFU/fruit, with a geometric mean of 11.3 CFU/fruit (0.1 CFU/g of fruit). Two serotypes, IVb-v1 and 1/2b, were identified by a combination of PCR- and antiserum-based serotyping among isolates from fruits and their packing environment; certain fruits contained a mixture of both serotypes. Single nucleotide polymorphism (SNP)-based whole-genome sequencing (WGS) analysis clustered isolates from two case-patients with the serotype IVb-v1 isolates and distinguished outbreak-associated isolates from pulsed-field gel electrophoresis (PFGE)-matched, but epidemiologically unrelated, clinical isolates. The outbreak-associated isolates differed by up to 42 SNPs. All but one serotype 1/2b isolate formed another WGS cluster and differed by up to 17 SNPs. Fully closed genomes of isolates from the stone fruits were used as references to maximize the resolution and to increase our confidence in prophage analysis. Putative prophages were conserved among isolates of each WGS cluster. All serotype IVb-v1 isolates belonged to singleton sequence type 382 (ST382); all but one serotype 1/2b isolate belonged to clonal complex 5.IMPORTANCE WGS proved to be an excellent tool to assist in the epidemiologic investigation of listeriosis outbreaks. The comparison at the genome level contributed to our understanding of the genetic diversity and variations among isolates involved in an outbreak or isolates associated with food and environmental samples from one facility. Fully closed genomes increased our confidence in the identification and comparison of accessory genomes. The diversity among the outbreak-associated isolates and the inclusion of PFGE-matched, but epidemiologically unrelated, isolates demonstrate the high resolution of WGS. The prevalence and enumeration data could contribute to our further understanding of the risk associated with Listeria monocytogenes contamination, especially among high-risk populations.
“…This eliminates any biases potentially introduced due to the selection of a reference and allows for the detection of SNVs not present in the reference genome. However, as noted by Pettengill et al, a reference-free approach may lead to a higher SNV false discovery rate without appropriate thresholds (177). The software package kSNP (178,179) takes a reference-free approach to identifying SNVs by breaking up each genomic data set into k-mers and comparing these k-mers.…”
SUMMARYThe number of large-scale genomics projects is increasing due to the availability of affordable high-throughput sequencing (HTS) technologies. The use of HTS for bacterial infectious disease research is attractive because one whole-genome sequencing (WGS) run can replace multiple assays for bacterial typing, molecular epidemiology investigations, and more in-depth pathogenomic studies. The computational resources and bioinformatics expertise required to accommodate and analyze the large amounts of data pose new challenges for researchers embarking on genomics projects for the first time. Here, we present a comprehensive overview of a bacterial genomics projects from beginning to end, with a particular focus on the planning and computational requirements for HTS data, and provide a general understanding of the analytical concepts to develop a workflow that will meet the objectives and goals of HTS projects.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.