Editorial: Compositional data analysis and related methods applied to genomics—a first special issue from<i>NAR Genomics and Bioinformatics</i>

Erb, Ionas; Gloor, Gregory B.; Quinn, Thomas P.

doi:10.1093/nargab/lqaa103

Cited by 11 publications

(13 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Consequently, a positive relationship is very likely for metabarcoding datasets, and, with enough randomly assembled mock communities, the estimated slope will be very near 1. Such a comparison is inappropriate because the lack of independence among the observations renders most standard statistical analyses (e.g., ordinary regression or correlation analyses) inappropriate (see Gloor et al 2017, Erb et al 2020 mordax in Fig. 1D) but some taxa are strongly over-represented (A. brama in Fig.…”

Section: The Problemmentioning

confidence: 99%

Toward quantitative metabarcoding

Shelton

Gold

Jensen

et al. 2022

Preprint

View full text Add to dashboard Cite

Amplicon-sequence data from environmental DNA (eDNA) and microbiome studies provides important information for ecology, conservation, management, and health. At present, amplicon-sequencing studies – known also as metabarcoding studies, in which the primary data consist of targeted, amplified fragments of DNA sequenced from many taxa in a mixture – struggle to link genetic observations to underlying biology in a quantitative way, but many applications require quantitative information about the taxa or systems under scrutiny. As metabarcoding studies proliferate in ecology following decades of microbial and microbiome work using similar techniques, it becomes more important to develop ways ot make them quantitative to ensure that their conclusions are adequately supported. Here we link previously disparate sets of techniques for making such data quantitative, showing that the underlying PCR mechanism explains observed patterns of amplicon data in a general way. By modeling the process through which amplicon-sequence data arises, rather than transforming the data post-hoc, we show how to estimate the starting DNA proportions from a mixture of many taxa. We illustrate how to calibrate the model using mock communities and apply the approach to simulated data and a series of empirical examples. Our approach opens the door to improve the use of metabarcoding data in a wide range of applications in ecology, public health, and related fields.

show abstract

Section: The Problemmentioning

confidence: 99%

Toward quantitative metabarcoding

Shelton

Gold

Jensen

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…where q = (q j ) D j=1 is the vector of individual event probabilities 4 . The multinomial encodes a constraint on n that leads to a mutual dependence between the parts.…”

Section: Sequencing Data Are Relativementioning

confidence: 99%

“…[1]) uses scale-free methods on data occurring in form of percentages, and its log-ratio methodology [2] has been applied to relative counts as well. While the sample spaces [3] of both data types are certainly not the same, the underlying problematic is identical: direct comparisons across samples can have paradoxical effects due to the lack of a common scale [4]. We have recently proposed to make use of information geometry to analyse compositional data [5].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Power Transformations of Relative Count Data as a Shrinkage Problem

Erb¹

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Here we show an application of our recently proposed informationgeometric approach to compositional data analysis (CoDA). This application regards relative count data, which are, e.g., obtained from sequencing experiments. First we review in some detail a variety of necessary concepts ranging from basic count distributions and their information-geometric description over the link between Bayesian statistics and shrinkage to the use of power transformations in CoDA. We then show that powering, i.e., the equivalent to scalar multiplication on the simplex, can be understood as a shrinkage problem on the tangent space of the simplex. In information-geometric terms, traditional shrinkage corresponds to an optimization along a mixture (or m-) geodesic, while powering (or exponential shrinkage) can be optimized along an exponential (or e-) geodesic. While the m-geodesic corresponds to the posterior mean of the multinomial counts using a conjugate prior, the e-geodesic corresponds to an alternative parametrization of the posterior where prior and data contributions are weighted by geometric rather than arithmetic means. To optimize the exponential shrinkage parameter, we use meansquared error as a cost function on the tangent space. This is just the expected squared Aitchison distance from the true parameter. We derive an analytic solution for its minimum based on the delta method and test it via simulations. We also discuss exponential shrinkage as an alternative to zero imputation for dimension reduction and data normalization.

show abstract

“…The whole data analysis was conducted in R 3.4.4. Firstly we transformed the MG, MT and MP data using the central log ratio with the function clr 76 to overcome the inherent problems of compositional data 77,78 . In order to estimate the batch effect between the train and test samples, introduced by the different experimental procedure (mainly the robotic biomolecular extraction in the test samples and the read length), we regressed every entry in the MG and MT matrices with a linear model (with the function lm) as:…”

Section: Batch Effect Correctionmentioning

confidence: 99%

Forecasting of a complex microbial community using meta-omics

Delogu

Kunath

Queirós

et al. 2022

Preprint

View full text Add to dashboard Cite

Microbial communities are complex assemblages whose dynamics are shaped by abiotic and biotic factors. A major challenge concerns correctly forecasting the community behaviour in the future. In this context, communities in biological wastewater treatment plants (BWWTPs) represent excellent model systems, because forecasting them is required to ultimately control and operate the plants in a sustainable manner. Here, we forecast the microbial community from the water-air interface of the anaerobic tank of a BWWTP via longitudinal meta-omics (metagenomics, metatranscriptomics and metaproteomics) data covering 14 months at weekly intervals. We extracted all the available time-dependent information, summarised it in 17 temporal signals (explaining 91.1% of the temporal variance) and linked them over time to rebuild the sequence of ecological phenomena behind the community dynamics. We forecasted the signals over the following five years and tested the predictions with 21 extra samples. We were able to correctly forecast five signals accounting for 22.5% of the time-dependent information in the system and generate mechanistic predictions on the ecological events in the community (e.g. a predation cycle involving bacteria, viruses and amoebas). Through the forecasting of the 17 signals and the environmental variables readings we reconstructed the gene abundance and expression for the following 5 years, showing a nearly perfect trend prediction (coefficient of determination ≥ 0.97) for the first 2 years. The study demonstrates the maturity of microbial ecology to forecast composition and gene expression of open microbial ecosystems using year-spanning interactions between community cycles and environmental parameters.

show abstract

Editorial: Compositional data analysis and related methods applied to genomics—a first special issue fromNAR Genomics and Bioinformatics

Cited by 11 publications

References 30 publications

Toward quantitative metabarcoding

Toward quantitative metabarcoding

Power Transformations of Relative Count Data as a Shrinkage Problem

Forecasting of a complex microbial community using meta-omics

Contact Info

Product

Resources

About