Stijn Hawinkel scite author profile

High-throughput sequencing technologies allow easy characterization of the human microbiome, but the statistical methods to analyze microbiome data are still in their infancy. Differential abundance methods aim at detecting associations between the abundances of bacterial species and subject grouping factors. The results of such methods are important to identify the microbiome as a prognostic or diagnostic biomarker or to demonstrate efficacy of prodrug or antibiotic drugs. Because of a lack of benchmarking studies in the microbiome field, no consensus exists on the performance of the statistical methods. We have compared a large number of popular methods through extensive parametric and nonparametric simulation as well as real data shuffling algorithms. The results are consistent over the different approaches and all point to an alarming excess of false discoveries. This raises great doubts about the reliability of discoveries in past studies and imperils reproducibility of microbiome experiments. To further improve method benchmarking, we introduce a new simulation tool that allows to generate correlated count data following any univariate count distribution; the correlation structure may be inferred from real data. Most simulation studies discard the correlation between species, but our results indicate that this correlation can negatively affect the performance of statistical methods.

show abstract

Sequence count data are poorly fit by the negative binomial distribution

Hawinkel

Rayner

Bijnens

et al. 2020

PLoS ONE

View full text Add to dashboard Cite

Sequence count data are commonly modelled using the negative binomial (NB) distribution. Several empirical studies, however, have demonstrated that methods based on the NBassumption do not always succeed in controlling the false discovery rate (FDR) at its nominal level. In this paper, we propose a dedicated statistical goodness of fit test for the NB distribution in regression models and demonstrate that the NB-assumption is violated in many publicly available RNA-Seq and 16S rRNA microbiome datasets. The zero-inflated NB distribution was not found to give a substantially better fit. We also show that the NB-based tests perform worse on the features for which the NB-assumption was violated than on the features for which no significant deviation was detected. This gives an explanation for the poor behaviour of NB-based tests in many published evaluation studies. We conclude that nonparametric tests should be preferred over parametric methods.

show abstract

A unified framework for unconstrained and constrained ordination of microbiome read count data

et al. 2019

View full text Add to dashboard Cite

show abstract

Model-based joint visualization of multiple compositional omics datasets

Hawinkel

Bijnens

Cao

et al. 2020

View full text Add to dashboard Cite

The integration of multiple omics datasets measured on the same samples is a challenging task: data come from heterogeneous sources and vary in signal quality. In addition, some omics data are inherently compositional, e.g. sequence count data. Most integrative methods are limited in their ability to handle covariates, missing values, compositional structure and heteroscedasticity. In this article we introduce a flexible model-based approach to data integration to address these current limitations: COMBI. We combine concepts, such as compositional biplots and log-ratio link functions with latent variable models, and propose an attractive visualization through multiplots to improve interpretation. Using real data examples and simulations, we illustrate and compare our method with other data integration techniques. Our algorithm is available in the R-package combi.

show abstract

A unified framework for unconstrained and constrained ordination of microbiome read count data

Hawinkel

Kerckhof

Bijnens

et al. 2018

Preprint

View full text Add to dashboard Cite

Explorative visualization techniques provide a first summary of microbiome read count datasets through dimension reduction. A plethora of dimension reduction methods exists, but many of them focus primarily on sample ordination, failing to elucidate the role of the bacterial species. Moreover, implicit but often unrealistic assumptions underlying these methods fail to account for overdispersion and differences in sequencing depth, which are two typical characteristics of sequencing data. We combine log-linear models with a dispersion estimation algorithm and flexible response function modelling into a framework for unconstrained and constrained ordination. The method allows easy filtering of technical confounders. As opposed to most existing ordination methods, the assumptions underlying the method are stated explicitly and can be verified using simple diagnostics. The combination of unconstrained and constrained ordination in the same framework is unique in the field and greatly facilitates microbiome data exploration. We illustrate the advantages of our method on simulated and real datasets, while pointing out flaws in existing methods. The algorithms for fitting and plotting are available in the R-package RCM.Explorative visualization is a key first step in the analysis of high-dimensional ecological 2 datasets. It provides insights into the strongest patterns in the dataset, unbiased by the 3 researcher's prior beliefs. It can also help to formulate new hypotheses to be tested in a 4 subsequent study. Nowadays, microbiological communities are characterized by 5 sequencing either marker genes or the entire metagenome of a sample, and attributing 6 the sequences to their matching operational taxonomic units (OTUs), species or other 7 phylogenetic levels. Throughout this paper we will refer to the lowest level to which the 8 reads are attributed as taxa. Sample-specific variables, such as patient baseline 9 characteristics or environmental conditions, can also be recorded. Microbiome 10 sequencing datasets typically contain information on thousands of microbial taxa, 11 whereas the number of samples and sample-specific variables is usually in the order of 12 tens to hundreds. These data are thus high-dimensional, and require a dimension 13 reduction before visualization. Apart from the biological variability, the 14 September 12, 2018 1/15DNA-extraction, amplification and sequencing steps, introduce additional variability 15 and technical artefacts, such as differences in sequencing depth. At best, data 16 visualization methods must be insensitive to this technical noise, while accurately 17 capturing the biological signal. The first aim of such a dimension reduction is to 18 optimally represent (dis)similarities between samples in an ordination: samples that are 19 similar in high dimensional space should also be represented close together in a two or 20 three dimensional visualization. A second aim is to elucidate which taxa drive the 21 (dis)similarities between samples. A final objective might be to identify...

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Stijn Hawinkel

A broken promise: microbiome differential abundance methods do not control the false discovery rate

Sequence count data are poorly fit by the negative binomial distribution

A unified framework for unconstrained and constrained ordination of microbiome read count data

Model-based joint visualization of multiple compositional omics datasets

A unified framework for unconstrained and constrained ordination of microbiome read count data

Contact Info

Product

Resources

About