2022
DOI: 10.3389/fmicb.2022.886201
|View full text |Cite
|
Sign up to set email alerts
|

Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning

Abstract: Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practical challenges. Class imbalance in available microbiome data is one of the major problems, which, if unaccounted for, leads to spurious prediction accuracies and limits the classifier's generalization. Here, we inves… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

1
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 74 publications
1
3
0
Order By: Relevance
“…Nevertheless, it is noteworthy that it was still possible to build a good classifier based on the salivary microbiota to differentiate between current smokers and current non-smokers in the present study. Indeed, we are not the first to do so, as in 2022, another study used data on the salivary microbiota from 175 current smokers and 1070 ex- and non-smokers to create a model that achieved an AUC of 0.81 ( Díez López et al., 2022 ), which is comparable with our findings ( Table 2 ). Importantly, when considering that smoking, which accounted for only 3.2% of differences in our dataset, can deliver such a robust model, this reinforces the concept of using the salivary microbiota for identification of oral diseases, as periodontitis has been shown to have a much higher impact on the composition of the salivary microbiota, than smoking ( Belstrøm et al., 2021 ).…”
Section: Discussionsupporting
confidence: 83%
“…Nevertheless, it is noteworthy that it was still possible to build a good classifier based on the salivary microbiota to differentiate between current smokers and current non-smokers in the present study. Indeed, we are not the first to do so, as in 2022, another study used data on the salivary microbiota from 175 current smokers and 1070 ex- and non-smokers to create a model that achieved an AUC of 0.81 ( Díez López et al., 2022 ), which is comparable with our findings ( Table 2 ). Importantly, when considering that smoking, which accounted for only 3.2% of differences in our dataset, can deliver such a robust model, this reinforces the concept of using the salivary microbiota for identification of oral diseases, as periodontitis has been shown to have a much higher impact on the composition of the salivary microbiota, than smoking ( Belstrøm et al., 2021 ).…”
Section: Discussionsupporting
confidence: 83%
“…Additionally, smoking habit predictions can be corrected with other traits (such as sex and age) for improved accuracy and combined for a more complete picture of a personalized epigenomic fingerprint. As a final point, efforts to discover and combine other types of molecular biomarkers of smoking, such as single nucleotide polymorphisms [90][91][92], RNA markers [84,93] and microbial DNA [94], are worth exploring.…”
Section: Discussionmentioning
confidence: 99%
“…Notably, the computer vision field has utilized data augmentation with great success [176][177][178] . Efforts to use neural networks on biological data have seen limited use of data augmentation techniques to improve model performance with research focusing on single cell RNA, methylation, and SNP data [179][180][181][182] . All of these efforts focus on generating new samples from generative models such as variational autoencoders, generative adversarial networks or deep Boltzmann machines.…”
Section: Introductionmentioning
confidence: 99%
“…All of these efforts focus on generating new samples from generative models such as variational autoencoders, generative adversarial networks or deep Boltzmann machines. They rely on the variable and imperfect nature of the generative process of these models to produce new samples that are somewhat different from their authentic counterparts 179,180,183,184 . This effort is excellent at building datasets that normalize sources of confounding variations such as batch effects but do not address the need for missing samples in the population.…”
Section: Introductionmentioning
confidence: 99%