As the cost of single-cell RNA-seq experiments has decreased, an increasing number of datasets are now available. Combining newly generated and publicly accessible datasets is challenging due to non-biological signals, commonly known as batch effects. Although there are several computational methods available that can remove batch effects, evaluating which method performs best is not straightforward. Here, we present BatchBench (https://github.com/cellgeni/batchbench), a modular and flexible pipeline for comparing batch correction methods for single-cell RNA-seq data. We apply BatchBench to eight methods, highlighting their methodological differences and assess their performance and computational requirements through a compendium of well-studied datasets. This systematic comparison guides users in the choice of batch correction tool, and the pipeline makes it easy to evaluate other datasets.
As the cost of single-cell RNA-seq experiments has decreased, an increasing number of datasets are now available. Combining newly generated and publicly accessible datasets is challenging due to non-biological signals, commonly known as batch effects .Although there are several computational methods available that can remove batch effects, evaluating which method performs best is not straightforward. Here we present BatchBench ( https://github.com/cellgeni/batchbench ), a modular and flexible pipeline for comparing batch correction methods for single-cell RNA-seq data. We apply BatchBench to eight methods, highlighting their methodological differences and assess their performance and computational requirements through a compendium of well-studied datasets. This systematic comparison guides users in the choice of batch correction tool, and the pipeline makes it easy to evaluate other datasets.
Over the past two decades, the advances in high throughput sequencing (HTS) enabled the characterisation of biological processes at an unprecedented level of detail; as a result the vast majority of hypotheses in molecular biology rely on analyses of HTS data. However, achieving increased robustness and reproducibility of results remains one of the main challenges across analyses. Although variability in results may be introduced at various stages, such as alignment, summarisation or detection of differences in expression, one source of variability has been systematically omitted: the consequences of choices that influence the sequencing design which propagate through analyses and introduce an additional layer of technical variation.
In this study, we illustrate qualitative and quantitative differences in results arising from the splitting of samples across lanes, on bulk and single cell sequencing outputs. For bulk mRNAseq data, we focus on differential expression and enrichment analyses; for bulk ChIPseq data, we investigate the effect on peak calling, and the peaks' properties. At single cell level, we concentrate on the identification of cell subpopulations (cells clustered based on their expression profiles). We rely on the identity of markers used for assigning cell identities; both smartSeq and 10x data are presented.
We conclude that the observed reduction in the number of unique sequenced fragments reduces the level of detail on which the different prediction approaches depend. Further, the sequencing stochasticity adds in a weighting bias corroborated with variable sequencing depths.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.