Learning From Limited Data: Towards Best Practice Techniques for Antimicrobial Resistance Prediction From Whole Genome Sequencing Data

Lüftinger, Lukas; Marynen, Peter; Beisken, Stephan; Rattei, Thomas; Posch, Andreas E.

doi:10.3389/fcimb.2021.610348

Cited by 19 publications

(18 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This problem is described by the fact that it is generally more cost and time effective to screen for a large number of variants within an individual than it is to screen large numbers of individuals [ 62 ] as is common in the fields of, e.g., neuroimaging, genomics, motion tracking, eye tracking, and many other technology-based data collection methods that have led to a torrent of high-dimensional datasets. This is a well-known area where classical machine learning algorithms do not perform well [ 63 , 64 ]. However, despite small sample sizes being common and the fact that limited data are problematic for pattern recognition, only a limited number of papers have systematically investigated how the machine learning validation process should be designed to help avoid optimistic performance estimates.…”

Section: Discussionmentioning

confidence: 99%

The Relative Power of Structural Genomic Variation versus SNPs in Explaining the Quantitative Trait Growth in the Marine Teleost Chrysophrys auratus

Ruigrok

Xue

Catanach

et al. 2022

Genes

View full text Add to dashboard Cite

Background: Genetic diversity provides the basic substrate for evolution. Genetic variation consists of changes ranging from single base pairs (single-nucleotide polymorphisms, or SNPs) to larger-scale structural variants, such as inversions, deletions, and duplications. SNPs have long been used as the general currency for investigations into how genetic diversity fuels evolution. However, structural variants can affect more base pairs in the genome than SNPs and can be responsible for adaptive phenotypes due to their impact on linkage and recombination. In this study, we investigate the first steps needed to explore the genetic basis of an economically important growth trait in the marine teleost finfish Chrysophrys auratus using both SNP and structural variant data. Specifically, we use feature selection methods in machine learning to explore the relative predictive power of both types of genetic variants in explaining growth and discuss the feature selection results of the evaluated methods. Methods: SNP and structural variant callers were used to generate catalogues of variant data from 32 individual fish at ages 1 and 3 years. Three feature selection algorithms (ReliefF, Chi-square, and a mutual-information-based method) were used to reduce the dataset by selecting the most informative features. Following this selection process, the subset of variants was used as features to classify fish into small, medium, or large size categories using KNN, naïve Bayes, random forest, and logistic regression. The top-scoring features in each feature selection method were subsequently mapped to annotated genomic regions in the zebrafish genome, and a permutation test was conducted to see if the number of mapped regions was greater than when random sampling was applied. Results: Without feature selection, the prediction accuracies ranged from 0 to 0.5 for both structural variants and SNPs. Following feature selection, the prediction accuracy increased only slightly to between 0 and 0.65 for structural variants and between 0 and 0.75 for SNPs. The highest prediction accuracy for the logistic regression was achieved for age 3 fish using SNPs, although generally predictions for age 1 and 3 fish were very similar (ranging from 0–0.65 for both SNPs and structural variants). The Chi-square feature selection of SNP data was the only method that had a significantly higher number of matches to annotated genomic regions of zebrafish than would be explained by chance alone. Conclusions: Predicting a complex polygenic trait such as growth using data collected from a low number of individuals remains challenging. While we demonstrate that both SNPs and structural variants provide important information to help understand the genetic basis of phenotypic traits such as fish growth, the full complexities that exist within a genome cannot be easily captured by classical machine learning techniques. When using high-dimensional data, feature selection shows some increase in the prediction accuracy of classification models and provides the potential to identify unknown genomic correlates with growth. Our results show that both SNPs and structural variants significantly impact growth, and we therefore recommend that researchers interested in the genotype–phenotype map should strive to go beyond SNPs and incorporate structural variants in their studies as well. We discuss how our machine learning models can be further expanded to serve as a test bed to inform evolutionary studies and the applied management of species.

show abstract

Section: Discussionmentioning

confidence: 99%

The Relative Power of Structural Genomic Variation versus SNPs in Explaining the Quantitative Trait Growth in the Marine Teleost Chrysophrys auratus

Ruigrok

Xue

Catanach

et al. 2022

Genes

View full text Add to dashboard Cite

show abstract

“…Organism–compound datasets with fewer than 100 susceptible and 100 resistant isolates were excluded. Filtered datasets were partitioned into training and test sets (80%:20%) using a genome-distance-based method [ 17 ]. This dataset partitioning method is designed to reduce similarity between the training and the test dataset.…”

Section: Methodsmentioning

confidence: 99%

“…ML-based WGS-AST typically uses nucleotide k-mer representations of either input genome assemblies or raw sequencing reads [ 14 , 15 , 16 , 17 , 18 ]. K-mer sets have been successfully used for various bioinformatics analyses, ranging from species identification [ 19 ] to genome assembly [ 20 ], as they offer advantages in computing efficiency and speed.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Genome-Wide Mutation Scoring for Machine-Learning-Based Antimicrobial Resistance Prediction

Marynen

Lüftinger

Beisken

et al. 2021

IJMS

Self Cite

View full text Add to dashboard Cite

The prediction of antimicrobial resistance (AMR) based on genomic information can improve patient outcomes. Genetic mechanisms have been shown to explain AMR with accuracies in line with standard microbiology laboratory testing. To translate genetic mechanisms into phenotypic AMR, machine learning has been successfully applied. AMR machine learning models typically use nucleotide k-mer counts to represent genomic sequences. While k-mer representation efficiently captures sequence variation, it also results in high-dimensional and sparse data. With limited training data available, achieving acceptable model performance or model interpretability is challenging. In this study, we explore the utility of feature engineering with several biologically relevant signals. We propose to predict the functional impact of observed mutations with PROVEAN to use the predicted impact as a new feature for each protein in an organism’s proteome. The addition of the new features was tested on a total of 19,521 isolates across nine clinically relevant pathogens and 30 different antibiotics. The new features significantly improved the predictive performance of trained AMR models for Pseudomonas aeruginosa, Citrobacter freundii, and Escherichia coli. The balanced accuracy of the respective models of those three pathogens improved by 6.0% on average.

show abstract

“…Only a subset of ML algorithms is capable of effectively making use of high-dimensional data while minimizing overfitting [ 30 ]. Likewise, rigorous validation on independently sampled datasets is required for robust estimation of model performance in the general case [ 45 , 71 ]. While the increasing availability of datasets with both NGS and AST data will help in improving performance and generalizability, more research is required to establish guidelines for sampling and validation of pAST ML models that can support clinical applications.…”

Section: Current Limitations and Perspectivesmentioning

confidence: 99%

Predictive Antibiotic Susceptibility Testing by Next-Generation Sequencing for Periprosthetic Joint Infections: Potential and Limitations

et al. 2021

Self Cite

View full text Add to dashboard Cite

Joint replacement surgeries are one of the most frequent medical interventions globally. Infections of prosthetic joints are a major health challenge and typically require prolonged or even indefinite antibiotic treatment. As multidrug-resistant pathogens continue to rise globally, novel diagnostics are critical to ensure appropriate treatment and help with prosthetic joint infections (PJI) management. To this end, recent studies have shown the potential of molecular methods such as next-generation sequencing to complement established phenotypic, culture-based methods. Together with advanced bioinformatics approaches, next-generation sequencing can provide comprehensive information on pathogen identity as well as antimicrobial susceptibility, potentially enabling rapid diagnosis and targeted therapy of PJIs. In this review, we summarize current developments in next generation sequencing based predictive antibiotic susceptibility testing and discuss potential and limitations for common PJI pathogens.

show abstract

Learning From Limited Data: Towards Best Practice Techniques for Antimicrobial Resistance Prediction From Whole Genome Sequencing Data

Cited by 19 publications

References 52 publications

The Relative Power of Structural Genomic Variation versus SNPs in Explaining the Quantitative Trait Growth in the Marine Teleost Chrysophrys auratus

The Relative Power of Structural Genomic Variation versus SNPs in Explaining the Quantitative Trait Growth in the Marine Teleost Chrysophrys auratus

Genome-Wide Mutation Scoring for Machine-Learning-Based Antimicrobial Resistance Prediction

Predictive Antibiotic Susceptibility Testing by Next-Generation Sequencing for Periprosthetic Joint Infections: Potential and Limitations

Contact Info

Product

Resources

About