Metagenomics, the application of high-throughput DNA sequencing for surveys of environmental samples, has revolutionized our view on the taxonomic and genetic composition of complex microbial communities. An enormous richness of microbiota keeps unfolding in the context of various fields ranging from biomedicine and food industry to geology. Primary analysis of metagenomic reads allows to infer semi-quantitative data describing the community structure. However, such compositional data possess statistical specific properties that are important to consider during preprocessing, hypothesis testing and interpreting the results of statistical tests. Failure to account for these specifics may lead to essentially wrong conclusions as a result of the survey. Here we present a researcher introduction to the field of metagenomics with the basic properties of microbial compositional data including statistical power and proposed distribution models, perform a review of the publicly available software tools developed specifically for such data and outline the recommendations for the application of the methods.
IntroductionMicrobiota, complex communities consisting of microbial species, appear to inhabit literally any environmental niche in the world. Recent advances in molecular genetic techniques allowed the study of microbiota in a cultivation-independent way, leading to the discovery of enormous diversity. One of the most advanced and widely used techniques is metagenomic sequencing: classification and quantification of metagenomic sequences can be used