2022
DOI: 10.1101/2022.01.11.475882
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

EnteroBase: Hierarchical clustering of 100,000s of bacterial genomes into species/sub-species and populations

Abstract: The definition of bacterial species is traditionally a taxonomic issue while defining bacterial populations is done with population genetics. These assignments are species specific, and depend on the practitioner. Legacy multilocus sequence typing is commonly used to identify sequence types (STs) and clusters (ST Complexes). However, these approaches are not adequate for the millions of genomic sequences from bacterial pathogens that have been generated since 2012. EnteroBase (http://enterobase.warwick.ac.uk) … Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5

Relationship

2
3

Authors

Journals

citations
Cited by 6 publications
(14 citation statements)
references
References 125 publications
1
13
0
Order By: Relevance
“…We observed variation in catabolic capabilities across the HC1100 representatives. This result aligns with observations of variable O-antigens presence throughout Escherichia [6] and further bolsters observations regarding the high frequency of homologous recombination in this genus [19,20]. Some of the most variable growth-supporting carbon sources included 4-hydroxyphenylacetate (74% models predicted to grow), rhamnose (73%), myo-inositol (61%), (R)-propane-1,2-diol (48%) and allose (34%) (figure 4 b ).…”
Section: Introductionsupporting
confidence: 92%
“…We observed variation in catabolic capabilities across the HC1100 representatives. This result aligns with observations of variable O-antigens presence throughout Escherichia [6] and further bolsters observations regarding the high frequency of homologous recombination in this genus [19,20]. Some of the most variable growth-supporting carbon sources included 4-hydroxyphenylacetate (74% models predicted to grow), rhamnose (73%), myo-inositol (61%), (R)-propane-1,2-diol (48%) and allose (34%) (figure 4 b ).…”
Section: Introductionsupporting
confidence: 92%
“…As shown in [41,44], DTM pipelines greatly reduce the running time for ASTRAL on large taxon sets and can also improve accuracy. The divide-and-conquer pipeline presented in [12] is also used to estimate a species tree, with ASTRAL the method for constructing species trees on each subset. Although the details of the pipeline in [12] are slightly different from the specific DTM pipeline structure given in figure 2, clearly the divide-and-conquer pipeline in [12] is a DTM pipeline for species tree estimation.…”
Section: Recent Advances In Species Tree Estimationmentioning
confidence: 99%
“…We show how divide-and-conquer can improve many steps in a phylogenomic pipeline, starting with large-scale multiple sequence alignment (a precursor to phylogeny estimation) and ending with updating large trees. However, these are not the only recently developed divide-and-conquer methods; this issue also has a paper by Achtman et al [12] that presents another divide-and-conquer method and uses it to construct a very large bacterial tree. Thus, divide-and-conquer is a powerful technique that can be used in different ways for large-scale phylogeny and alignment estimation.…”
Section: Introductionmentioning
confidence: 99%
“…Achtman et al . [7] present a monograph on the suitability of HierCC (hierarchical clustering of core genome multi-locus sequence typing data (cgMLST)) within EnteroBase, for the identification of species and sub-species in six bacterial genera: Salmonella, Escherichia/Shigella , Clostridioides , Yersinia , Vibrio and Streptococcus. For each genus, a large representative dataset of assembled genomes is identified.…”
Section: Alternatives To Phylogeneticsmentioning
confidence: 99%
“…It is an efficient implementation of dimensional reduction that can be run in parallel mode on a GPU processor, and provides comparable or better clustering of simulated population structure of bacterial genomes than other slower methods including principal components analysis, t-distributed stochastic neighbour embedding and uniform manifold approximation and projection. Mandrake was also applied to a gene presence/absence matrix of 20 047 genomes of S. pneumoniae and yielded very similar clustering to either PopPunk clusters based on core plus accessory genes or HC160 clusters according to EnteroBase HierCC [ 9 ]. Similar results were obtained with Salmonella, where Mandrake clusters correlated with a somewhat finer clustering than HC900 in HierCC.…”
Section: Alternatives To Phylogeneticsmentioning
confidence: 99%