Assembly of bacterial short-read whole-genome sequencing data frequently results in hundreds of contigs for which the origin, plasmid or chromosome, is unclear. Complete genomes resolved by long-read sequencing can be used to generate and label short-read contigs. These were used to train several popular machine learning methods to classify the origin of contigs from Enterococcus faecium, Klebsiella pneumoniae and Escherichia coli using pentamer frequencies. We selected support-vector machine (SVM) models as the best classifier for all three bacterial species (F1-score E. faecium=0.92, F1-score K. pneumoniae=0.90, F1-score E. coli=0.76), which outperformed other existing plasmid prediction tools using a benchmarking set of isolates. We demonstrated the scalability of our models by accurately predicting the plasmidome of a large collection of 1644 E. faecium isolates and illustrate its applicability by predicting the location of antibiotic-resistance genes in all three species. The SVM classifiers are publicly available as an R package and graphical-user interface called ‘mlplasmids’. We anticipate that this tool may significantly facilitate research on the dissemination of plasmids encoding antibiotic resistance and/or contributing to host adaptation.
Assembly of bacterial short-read whole genome sequencing (WGS) data frequently results in hundreds of contigs for which the origin, plasmid or chromosome, is unclear. Long-read sequencing has emerged as a solution to resolve plasmid structures and to obtain complete genomes for most bacterial species. This information can be used to generate and label datasets from short-read based contigs as plasmid-or chromosome-derived. We investigated the use of several popular machine learning methods to classify short-read contigs with known plasmid-or chromosome-origin from Enterococcus faecium , Klebsiella pneumoniae and Escherichia coli using pentamer frequencies. Based on resulting F1-scores we selected support-vector machine (SVM) models as best classifier for all three bacterial species (F1-score E. faecium = 0.94, F1-score K. pneumoniae = 0.90, F1-score E. coli = 0.76) , which outperformed other existing plasmid tools using an independent set of isolates (precision E. faecium = 0.92, precision K. pneumoniae = 0.86, precision E. coli = 0.82). We demonstrated the scalability of our model by accurately predicting the plasmidome of a large collection of 1,644 E. faecium isolates with only short-read WGS available using a standard laptop with a single core. A low number of false positive predicted sequences suggests that the assignment of a particular gene of interest as plasmid-or chromosome-encoded by the models is plausible. The SVM classifiers are publicly available as a new R package called 'mlplasmids' at https://gitlab.com/sirarredondo/mlplasmids under the GNU General Public License v3.0. We additionally developed a graphical-user interface using the Shiny package which can be accessed at https://sarredondo.shinyapps.io/mlplasmids/ . Single genomes can easily be predicted by uploading genome assemblies. We anticipate that this tool may significantly facilitate research on the dissemination of plasmids encoding antibiotic resistance and/or contributing to host adaptation.
Background Knowledge on the molecular epidemiology of Escherichia coli causing E. coli bacteremia (ECB) in the Netherlands is mostly based on extended-spectrum beta-lactamase-producing E. coli (ESBL-Ec). We determined differences in clonality and resistance and virulence gene (VG) content between non-ESBL-producing E. coli (non-ESBL-Ec) and ESBL-Ec isolates from ECB episodes with different epidemiological characteristics. Methods A random selection of non-ESBL-Ec isolates as well as all available ESBL-Ec blood isolates was obtained from two Dutch hospitals between 2014 and 2016. Whole genome sequencing was performed to infer sequence types (STs), serotypes, acquired antibiotic resistance genes and VG scores, based on presence of 49 predefined putative pathogenic VG. Results ST73 was most prevalent among the 212 non-ESBL-Ec (N = 26, 12.3%) and ST131 among the 69 ESBL-Ec (N = 30, 43.5%). Prevalence of ST131 among non-ESBL-Ec was 10.4% (N = 22, P value < .001 compared to ESBL-Ec). O25:H4 was the most common serotype in both non-ESBL-Ec and ESBL-Ec. Median acquired resistance gene counts were 1 (IQR 1-6) and 7 (IQR 4-9) for non-ESBL-Ec and ESBL-Ec, respectively (P value < .001). Among non-ESBL-Ec, acquired resistance gene count was highest among blood isolates from a
S. Harbarth). y Other members of the MODERN WP2 study group are listed in the study group section. Contents lists available at ScienceDirect Clinical Microbiology and Infection j o u r n a l h o m e p a g e : w w w . c l i n i c a l m i c r o b i o l o g y a n d i n f e c t i o n . c o m
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.