Background: The advance of metagenomic studies provides the opportunity to identify microbial taxa that are associated to human diseases. Multiple methods exist for the association analysis. However, the results could be inconsistent, presenting challenges in interpreting the host-microbiome interactions. To address this issue, we introduce Meta-Signer, a novel Metagenomic Signature Identifier tool based on rank aggregation of features identified from multiple machine learning models including Random Forest, Support Vector Machines, LASSO, Multi-Layer Perceptron Neural Networks, and our recently developed Convolutional Neural Network framework (PopPhy-CNN). Meta-Signer generates ranked taxa lists by training individual machine learning models over multiple training partitions and aggregates them into a single ranked list by an optimization procedure to represent the most informative and robust microbial features. Meta-Signer can rank taxa using two input forms of the data: the relative abundances of the original taxa and taxa from the populated taxonomic trees generated from the original taxa. The latter form allows the evaluation of the association of microbial features at different taxonomic levels to the disease, which is attributed to our novel model of PopPhy-CNN. Results: We evaluate Mega-Signer on five different human gut-microbiome datasets. We demonstrate that the features derived from Meta-Signer were more informative compared to those obtained from other available feature ranking methods. The highly ranked features are strongly supported by published literature. Conclusion: Meta-Signer is capable of deriving a robust set of microbial features at multiple taxonomic levels for the prediction of host phenotype. Meta-Signer is user-friendly and customizable, allowing users to explore their datasets quickly and efficiently.
BackgroundRecent metagenomic studies of the gut microbiome have linked dysbiosis to many human diseases [1,2,3]. A metagenomic sample is typically represented by its microbial taxonomic composition using microbial taxa at one of the taxonomic levels, i.e., Super-kingdom, Phylum, Class, Order, Family, Genus, and Species. The identification of microbial taxa associated with the human disease has been one of important efforts in metagenomics data analysis [4]. Procedures used in various metagenomic studies use parametric or non-parametric statistical tests to detect differentially abundant individual taxa between disease and control groups [5,6,7,8,9]. These type of methods can potentially miss taxa with weak associations which can together present strong statistical association. In order to capture group association, several methods are proposed by exploring related taxa on a phylogenetic taxonomic tree. For example, a concept of variable fusion was introduced to bring two closely related taxa on the tree into a Lasso linear regression model [10]. OMiAT, a statistical framework, combines tests of all upper-and lower-level taxa to generate a microbiome comprehensive association mapping (MiC...