Highly dimensional data generated from bacterial whole-genome sequencing is providing an unprecedented scale of information that requires an appropriate statistical analysis framework to infer biological function from populations of genomes. The application of genome-wide association study (GWAS) methods is an appropriate framework for bacterial population genome analysis that yields a list of candidate genes associated with a phenotype, but it provides an unranked measure of importance. Here, we validated a novel framework to define infection mechanism using the combination of GWAS, machine learning, and bacterial population genomics that ranked allelic variants that accurately identified disease. This approach parsed a dataset of 1.2 million single nucleotide polymorphisms (SNPs) and indels that resulted in an importance ranked list of associated alleles of porA in Campylobacter jejuni using spatiotemporal analysis over 30 years. We validated this approach using previously proven laboratory experimental alleles from an in vivo guinea pig abortion model. This framework, termed μPathML, defined intestinal and extraintestinal groups that have differential allelic porA variants that cause abortion. Divergent variants containing indels that defeated automated annotation were rescued using biological context and knowledge that resulted in defining rare, divergent variants that were maintained in the population over two continents and 30 years. This study defines the capability of machine learning coupled with GWAS and population genomics to simultaneously identify and rank alleles to define their role in infectious disease mechanisms.
Vibrio parahaemolyticus is the most common cause of seafood-borne illness reported in the United States. Draft genomes of 132 North American clinical and oyster V. parahaemolyticus isolates were sequenced to investigate their phylogenetic and biogeographic relationships. The majority of oyster isolate sequence types (STs) were from a single harvest location; however, four were identified from multiple locations. There was population structure along the Gulf and Atlantic Coasts of North America, with what seemed to be a hub of genetic variability along the Gulf Coast with some of the same STs occurring along the Atlantic Coast and one shared between the coastal waters of the Gulf and those of Washington state. Phylogenetic analyses found nine well-supported clades. Two clades were composed of isolates from both clinical and oyster sources. Four were composed entirely from clinical sources and three entirely from oyster sources. Each single source clade consisted of one ST. Some human isolates lack tdh and trh and some T3SS genes, which are established virulence genes of V. parahaemolyticus. Thus, these genes are not essential for pathogenicity. However, isolates in the monophyletic groups from clinical sources were enriched in several categories of genes when compared to those from monophyletic groups of oyster isolates. These functional categories include: cell signaling, transport, and metabolism. Identification of genes in these functional categories provides a basis for future in-depth pathogenicity investigations of V. parahaemolyticus. IMPORTANCE Vibrio parahaemolyticus is the most common cause of seafood-borne illness reported in the United States and is frequently associated with shellfish consumption. This study contributes to our knowledge of the biogeography and functional genomics of this species around North America. STs shared between the Gulf Coast and the Atlantic seaboard as well as Pacific waters suggests possible transport via oceanic currents or large shipping vessels. STs frequently isolated from humans, but rarely if ever from the environment, are likely more competitive in the human gut compared to other STs. This could be due to additional functional capabilities in areas like cell signaling, transport, and metabolism which may give these isolates an advantage in novel nutrient replete environments like the human gut.
Enterohemorrhagic Escherichia coli (EHEC) continues to be a significant public health risk. With the onset of next generation sequencing, whole genome sequences are a potential resource for predictive modelling of the different regulatory mechanism of pathogens, particularly quorum sensing. We used a pangenome approach to determine EHEC genome clustering, determine the synonymous and nonsynonymous mutations across the EHEC sdiA and modelled the associated amino acid changes. Across the EHEC population, nonsynonymous variants are notably absent in ligand binding site for quorum sensing, indicating that population wide conservation of sdiA ligand site can be targeted for potential prophylactic purposes. Applying pathotype-wide pangenomics as a guide for determining evolution of pharmacophore sites is a potential approach in drug discovery.
Background: Global spread of COVID-19 created an unprecedented infectious disease crisis that progressed to a pandemic with >180,000 cases in >100 countries. Reproductive number (R) is an outbreak metric estimating the transmission of a pathogen. Initial R values were published based on the early outbreak in China with limited number of cases with whole genome sequencing. Initial comparisons failed to show a direct relationship viral genomic diversity and epidemic severity was not established for SARS-Cov-2. Methods: Each country's COVID-19 outbreak status was classified according to epicurve stage (index, takeoff, exponential, decline). Instantaneous R estimates (Wallinga and Teunis method) with a short and standard serial interval examined asymptomatic spread. Whole genome sequences were used to quantify the pathogen genome identity score that were used to estimate transmission time and epicurve stage. Transmission time was estimated based on evolutionary rate of 2 mutations/month. Findings: The country-specific R revealed variable infection dynamics between and within outbreak stages. Outside China, R estimates revealed propagating epidemics poised to move into the takeoff and exponential stages. Population density and local temperatures had variable relationship to the outbreaks. GENI scores differentiated countries in index stage with cryptic transmission. Integration of incidence data with genome variation directly increases in cases with increased genome variation. Interpretation: R was dynamic for each country and during the outbreak stage. Integrating the outbreak dynamic, dynamic R, and genome variation found a direct association between cases and genome variation. Synergistically, GENI provides an evidence-based transmission metric that can be determined by sequencing the virus from each case. We calculated an instantaneous country-specific R at different stages of outbreaks and formulated a novel metric for infection dynamics using viral genome sequences to capture gaps in untraceable transmission. Integrating epidemiology with genome sequencing allows evidence-based dynamic disease outbreak tracking with predictive evidence.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.