We assess inferential quality in the field of differential expression profiling by high-throughput sequencing (HT-seq) based on analysis of datasets submitted from 2008 to 2020 to the NCBI GEO data repository. We take advantage of the parallel differential expression testing over thousands of genes, whereby each experiment leads to a large set of p-values, the distribution of which can indicate the validity of assumptions behind the test. From a well-behaved p-value set π0, the fraction of genes that are not differentially expressed can be estimated. We found that only 25% of experiments resulted in theoretically expected p-value histogram shapes, although there is a marked improvement over time. Uniform p-value histogram shapes, indicative of <100 actual effects, were extremely few. Furthermore, although many HT-seq workflows assume that most genes are not differentially expressed, 37% of experiments have π0-s of less than 0.5, as if most genes changed their expression level. Most HT-seq experiments have very small sample sizes and are expected to be underpowered. Nevertheless, the estimated π0-s do not have the expected association with N, suggesting widespread problems of experiments with controlling false discovery rate (FDR). Both the fractions of different p-value histogram types and the π0 values are strongly associated with the differential expression analysis program used by the original authors. While we could double the proportion of theoretically expected p-value distributions by removing low-count features from the analysis, this treatment did not remove the association with the analysis program. Taken together, our results indicate widespread bias in the differential expression profiling field and the unreliability of statistical methods used to analyze HT-seq data.
Here we assess reproducibility and inferential quality in the field of differential HT-seq, based on analysis of datasets submitted 2008-2019 to the NCBI GEO data repository. Analysis of GEO submission file structures places an overall 59% upper limit to reproducibility. We further show that only 23% of experiments resulted in theoretically expected p value histogram shapes, although both reproducibility and p value distributions show marked improvement over time. Uniform p value histogram shapes, indicative of <100 true effects, were extremely few. Our calculations of π0, the fraction of true nulls, showed that 36% of experiments have π0 <0.5, meaning that in over a third of experiments most RNA-s were estimated to change their expression level upon experimental treatment. Both the fraction of different p value histogram types and π0 values are strongly associated with the software used for calculating these p values by the original authors, indicating widespread bias.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.