Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis

Fernandes, Andrew D.; Reid, Jennifer; Macklaim, Jean M.; McMurrough, T.A.; Edgell, David R.; Gloor, Gregory B.

doi:10.1186/2049-2618-2-15

Cited by 991 publications

(973 citation statements)

References 45 publications

Supporting

Mentioning

906

Contrasting

Order By: Relevance

“…Of course, many other methods exist, including but not limited to: Cuffdiff (Trapnell et al (2010)), Cuffdiff2 (Trapnell et al (2013)), NBPSeq (Di, Schafer, Cumbie, and Chang (2011)), TSPM (Auer and Doerge (2011)), baySeq (Hardcastle and Kelly (2010)), EBSeq (Leng et al (2013)), NOISeq (Tarazona, García-Alcalde, Dopazo, Ferrer, and Conesa (2011)), SAMseq (J. Li and Tibshirani (2013)), ShrinkSeq (Van De Wiel et al (2012)), DEGSeq (Wang, Feng, Wang, Wang, and Zhang (2010)), BBSeq (Y.-H. Zhou, Xia, and Wright (2011)), FDM (Singh et al (2011)), RSEM (B. Li and Dewey (2011)), Myrna (Langmead, Hansen, and Leek (2010)), PANDORA (Moulos and Hatzis (2014)), ALDEx2 (Fernandes et al (2014)), PoissonSeq (J. Li, Witten, Johnstone, and Tibshirani (2011)), and GPSeq (Srivastava and Chen (2010)). We provide code that can be easily adapted to any method that runs in R and applied to the publicly available data sets we used, as well as others.…”

Section: Cc-by-nd 40 International License Peer-reviewed) Is the Autmentioning

confidence: 99%

Excess False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data

Rocke

Ruan

Zhang

et al. 2015

Preprint

View full text Add to dashboard Cite

Motivation: An important property of a valid method for testing for differential expression is that the false positive rate should at least roughly correspond to the p-value cutoff, so that if 10,000 genes are tested at a p-value cutoff of 10 −4 , and if all the null hypotheses are true, then there should be only about 1 gene declared to be significantly differentially expressed. We tested this by resampling from existing RNA-Seq data sets and also by matched negative binomial simulations.Results: Methods we examined, which rely strongly on a negative binomial model, such as edgeR, DESeq, and DESeq2, show large numbers of false positives in both the resampled real-data case and in the simulated negative binomial case. This also occurs with a negative binomial generalized linear model function in R. Methods that use only the variance function, such as limma-voom, do not show excessive false positives, as is also the case with a variance stabilizing transformation followed by linear model analysis with limma. The excess false positives are likely caused by apparently small biases in estimation of negative binomial dispersion and, perhaps surprisingly, occur mostly when the mean and/or the dispersion is high, rather than for low-count genes. Contact:dmrocke@ucdavis.edu, lruan@ucdavis.edu, yilzhang@ucdavis.edu, gt4636b@gatech.edu, bpdurbin@ucdavis.edu, saviran@ucdavis.edu.Supplementary Information: The computational tools developed for this study are freely available via our website http://dmrocke.ucdavis.edu/software.html. They can be downloaded as R code or run directly through an interactive web-based shiny application to reproduce the analysis presented here per a user's choice of dataset and the methods to be evaluated.

show abstract

Section: Cc-by-nd 40 International License Peer-reviewed) Is the Autmentioning

confidence: 99%

Excess False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data

Rocke

Ruan

Zhang

et al. 2015

Preprint

View full text Add to dashboard Cite

show abstract

“…Log-ratio analysis of compositional data does not allow null proportions as an argument of a logarithm, thus requiring special treatment of such data (Martín-Fernández, Hron, Templ, Filzmoser, and Palarea-Albaladejo 2015a;Gloor, Macklaim, Pawlowsky-Glahn, and Egozcue 2017). The present contribution is not aimed at discussing procedures and methods for treating these zero count data, although Bayesian estimation methods show some promise (Fernandes et al 2014). Therefore, the number of genera used in the examples has been reduced to 12 by removing all genera with a total count across all samples of less than 5000, or that have zero counts in more than 100 samples.…”

Section: Example Using An 16s Rrna Gene Profiling Casementioning

confidence: 99%

Linear Association in Compositional Data Analysis

Egozcue

Pawlowsky‐Glahn²,

Gloor³

2018

AJS

View full text Add to dashboard Cite

With compositional data, ordinary covariation indices, designed for real random variables, fail to describe dependence. There is a need for compositional alternatives to covariance and correlation. Based on the Euclidean structure of the simplex, called Aitchison geometry, compositional association is identified to a linear restriction of the sample space when a log-contrast is constant. In order to simplify interpretation, a sparse and simple version of compositional association is defined in terms of balances which are constant across the sample. It is called b-association. This kind of association of compositional variables is extended to association between groups of compositional variables. In practice, exact b-association seldom occurs, and measures of degree of b-association are reviewed based on those previously proposed. Also, some techniques for testing b-association are studied. These techniques are applied to available oral microbiome data to illustrate both their advantages and difficulties. Both testing and measurements of b-association appear to be quite sensitive to heterogeneities in the studied populations and to outliers.

show abstract

“…Here we will describe R packages specifically intended for metagenomic analysis: basic methods commonly used for a comparison of two or more groups [on the example of their implementation in ALDEx2 package (Fernandes et al, 2014)] and advanced approaches based on generalized linear models allowing both continuous and discrete factors [metagenomeSeq (Paulson et al, 2013), edgeR (McCarthy et al, 2012), DESeq2 (Love et al, 2014), MaAsLin, shotgunFunctionalizeR (Kristiansson et al, 2009)]. Finally, the methods for vector-wise rather than component-wise comparison will be introduced [HMP (La Rosa et al, 2012), vegan (Oksanen et al, 2012), micropower (Kelly et al, 2015)].…”

Section: Total Read Count Varies Between the Samplesmentioning

confidence: 99%

“…., α n ) is a feature vector (with an additional pseudocount of 0.5 added to each component) and B(α) is the multivariate beta function. The greater the taxon abundance and the less the whole number of reads for the sample, the greater the variance (Fernandes et al, 2014). Substituting the original feature vector with several random vectors generated from the corresponding Dirichlet distribution leads to a more correct estimation of variance and thus of significance of differences.…”

Section: Component-wise Analysismentioning

confidence: 99%

“…Besides stabilizing the variance, such transformation ensures the proper comparison of two components of the same vector (e.g. which of two species is higher in abundance within a single community) even if low-abundance taxa are excluded from the study (as it is often done) (Fernandes et al, 2014). The transformed abundances may be compared using either Wilcoxon or Welch's tests.…”

Section: Component-wise Analysismentioning

confidence: 99%

See 1 more Smart Citation

Guidelines to Statistical Analysis of Microbial Composition Data Inferred from Metagenomic Sequencing

Odintsova¹,

Tyakht²,

Alexeev³

2017

Current Issues in Molecular Biology

View full text Add to dashboard Cite

Metagenomics, the application of high-throughput DNA sequencing for surveys of environmental samples, has revolutionized our view on the taxonomic and genetic composition of complex microbial communities. An enormous richness of microbiota keeps unfolding in the context of various fields ranging from biomedicine and food industry to geology. Primary analysis of metagenomic reads allows to infer semi-quantitative data describing the community structure. However, such compositional data possess statistical specific properties that are important to consider during preprocessing, hypothesis testing and interpreting the results of statistical tests. Failure to account for these specifics may lead to essentially wrong conclusions as a result of the survey. Here we present a researcher introduction to the field of metagenomics with the basic properties of microbial compositional data including statistical power and proposed distribution models, perform a review of the publicly available software tools developed specifically for such data and outline the recommendations for the application of the methods. IntroductionMicrobiota, complex communities consisting of microbial species, appear to inhabit literally any environmental niche in the world. Recent advances in molecular genetic techniques allowed the study of microbiota in a cultivation-independent way, leading to the discovery of enormous diversity. One of the most advanced and widely used techniques is metagenomic sequencing: classification and quantification of metagenomic sequences can be used

show abstract

Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis

Cited by 991 publications

References 45 publications

Excess False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data

Excess False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data

Linear Association in Compositional Data Analysis

Guidelines to Statistical Analysis of Microbial Composition Data Inferred from Metagenomic Sequencing

Contact Info

Product

Resources

About