Understanding sequencing data as compositions: an outlook and review

Quinn, Thomas P.; Erb, Ionas; Richardson, Mark F.; Crowley, Tamsyn M.

doi:10.1093/bioinformatics/bty175

Cited by 264 publications

(237 citation statements)

References 64 publications

Supporting

Mentioning

211

Contrasting

Order By: Relevance

“…Most methods for analyzing RNA-Seq expression data assume that raw read counts represent absolute abundances (Quinn, Richardson, Lovell, & Crowley, 2017). However, RNA-Seq count data are not absolute and instead represent relative abundances as a type of compositional count data (Quinn, Erb, Richardson, & Crowley, 2018c;Quinn, Richardson, et al, 2017). Using methods that assume absolute values is invalid for compositional data (without first including a transformation) because the total number of reads (library size) generated from each sample varies based on factors such as sequencing performance, making comparisons of the actual count values between samples difficult (Fernandes et al, 2014;Quinn, Erb, et al, 2018c).…”

Section: Count Filtering and Log-ratio Transformationsmentioning

confidence: 99%

Immune and environment‐driven gene expression during invasion: An eco‐immunological application of RNA‐Seq

Selechnik

Richardson

Shine

et al. 2019

Ecology and Evolution

View full text Add to dashboard Cite

Host–pathogen associations change rapidly during a biological invasion and are predicted to impose strong selection on immune function. It has been proposed that the invader may experience an abrupt reduction in pathogen‐mediated selection (“enemy release”), thereby favoring decreased investment into “costly” immune responses. Across plants and animals, there is mixed support for this prediction. Pathogens are not the only form of selection imposed on invaders; differences in abiotic environmental conditions between native and introduced ranges are also expected to drive rapid evolution. Here, we use RNA‐Seq to assess the expression patterns of immune and environmentally associated genes in the cane toad (Rhinella marina) across its invasive Australian range. Transcripts encoding mediators of costly immune responses (inflammation, cytotoxicity) showed a curvilinear relationship with invasion history, with highest expression in toads from oldest and newest colonized areas. This pattern is surprising given theoretical expectations of density dynamics in invasive species and may be because density influences both intraspecific competition and parasite transmission, generating conflicting effects on the strength of immune responses. Alternatively, this expression pattern may be the result of other evolutionary forces, such as spatial sorting and genetic drift, working simultaneously with natural selection. Our findings do not support predictions about immune function based on the enemy release hypothesis and suggest instead that the effects of enemy release are difficult to isolate in wild populations, especially in the absence of information regarding parasite and pathogen infection. Additionally, expression patterns of genes underlying putatively environmentally associated traits are consistent with previous genetic studies, providing further support that Australian cane toads have adapted to novel abiotic challenges.

show abstract

Section: Count Filtering and Log-ratio Transformationsmentioning

confidence: 99%

Immune and environment‐driven gene expression during invasion: An eco‐immunological application of RNA‐Seq

Selechnik

Richardson

Shine

et al. 2019

Ecology and Evolution

View full text Add to dashboard Cite

show abstract

“…Alternatively, compositional data analysis as a well‐developed body of statistical methodology provides models and methods equivalent to traditional ones yet accounts for these special constraining features of relative data. The approach has been used for decades to analyze analogous types of data in the geosciences (Buccianti et al, ) and, more recently, in other disparate areas such as molecular biology to analyze sequencing data (Quinn et al, ) or physical activity epidemiology for the analysis of daily time‐use patterns (Chastin et al, ; McGregor et al, ). While the statistical theory may be unfamiliar and not typically taught in most statistics courses, recent publications and software have made the use of these techniques both feasible and accessible.…”

Section: Resultsmentioning

confidence: 99%

Analyzing Wildland Fire Smoke Emissions Data Using Compositional Data Techniques

et al. 2020

View full text Add to dashboard Cite

By conservation of mass, the mass of wildland fuel that is pyrolyzed and combusted must equal the mass of smoke emissions, residual char, and ash. For a given set of conditions, these amounts are fixed. This places a constraint on smoke emissions data that violates key assumptions for many of the statistical methods ordinarily used to analyze these data such as linear regression, analysis of variance, and t tests. These data are inherently multivariate, relative, and nonnegative parts of a whole and are then characterized as so‐called compositional data. This paper introduces the field of compositional data analysis to the biomass burning emissions community and provides examples of statistical treatment of emissions data. Measures and tests of proportionality, unlike ordinary correlation, allow one to coherently investigate associations between parts of the smoke composition. An alternative method based on compositional linear trends was applied to estimate trace gas composition over a range of combustion efficiency that reduced prediction error by 4% while avoiding use of modified combustion efficiency as if it were an independent variable. Use of log‐ratio balances to create meaningful contrasts between compositional parts definitively stressed differences in smoke emissions from fuel types originating in the southeastern and southwestern United States. Application of compositional statistical methods as an appropriate approach to account for the relative nature of data about the composition of smoke emissions and the atmosphere is recommended.

show abstract

“…There are many problems associated with the analysis of compositional data that cannot be handled by DMM alone (see Aitchison & Egozcue, , Gloor & Reid, , Quinn, Erb, Richardson, & Crowley, , Tsilimigras & Fodor, , van den Boogaart & Tolosana‐Delgado, ). The most intuitive challenge posed by compositional data is that spurious correlations among features can arise because of the data's inherent covariance structure (Pearson, ).…”

Section: Discussionmentioning

confidence: 99%

Dirichlet‐multinomial modelling outperforms alternatives for analysis of microbiome and other ecological count data

Harrison

Calder

Shastry

et al. 2020

Molecular Ecology Resources

View full text Add to dashboard Cite

Molecular ecology regularly requires the analysis of count data that reflect the relative abundance of features of a composition (e.g., taxa in a community, gene transcripts in a tissue). The sampling process that generates these data can be modelled using the multinomial distribution. Replicate multinomial samples inform the relative abundances of features in an underlying Dirichlet distribution. These distributions together form a hierarchical model for relative abundances among replicates and sampling groups. This type of Dirichlet‐multinomial modelling (DMM) has been described previously, but its benefits and limitations are largely untested. With simulated data, we quantified the ability of DMM to detect differences in proportions between treatment and control groups, and compared the efficacy of three computational methods to implement DMM—Hamiltonian Monte Carlo (HMC), variational inference (VI), and Gibbs Markov chain Monte Carlo. We report that DMM was better able to detect shifts in relative abundances than analogous analytical tools, while identifying an acceptably low number of false positives. Among methods for implementing DMM, HMC provided the most accurate estimates of relative abundances, and VI was the most computationally efficient. The sensitivity of DMM was exemplified through analysis of previously published data describing lung microbiomes. We report that DMM identified several potentially pathogenic, bacterial taxa as more abundant in the lungs of children who aspirated foreign material during swallowing; these differences went undetected with different statistical approaches. Our results suggest that DMM has strong potential as a statistical method to guide inference in molecular ecology.

show abstract

Understanding sequencing data as compositions: an outlook and review

Cited by 264 publications

References 64 publications

Immune and environment‐driven gene expression during invasion: An eco‐immunological application of RNA‐Seq

Immune and environment‐driven gene expression during invasion: An eco‐immunological application of RNA‐Seq

Analyzing Wildland Fire Smoke Emissions Data Using Compositional Data Techniques

Dirichlet‐multinomial modelling outperforms alternatives for analysis of microbiome and other ecological count data

Contact Info

Product

Resources

About