8Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading 9 conclusions. We present scBatch, a numerical algorithm that conducts batch effect correction on the 10 count matrix of RNA sequencing (RNA-seq) data. Different from traditional methods, scBatch starts 11 with establishing an ideal correction of the sample distance matrix that effectively reflect the underlying 12 biological subgroups, without considering the actual correction of the raw count matrix itself. It then 13 seeks an optimal linear transformation of the count matrix to approximate the established sample pattern. 14 The benefit of such an approach is the final result is not restricted by assumptions on the mechanism of 15 the batch effect. As a result, the method yields good clustering and gene differential expression (DE) 16 results. We compared the new method, scBatch, with leading batch effect removal methods ComBat 17 and mnnCorrect on simulated data, real bulk RNA-seq data, and real single-cell RNA-seq data. The 18 comparisons demonstrated that scBatch achieved better sample clustering and DE gene detection results.In the recent decade, RNA sequencing (RNA-seq) has become a major tool for transcriptomics. Due 21 to the limitation of sequencing technology and sample preparations, technical variations exist among 22 reads from different batches of experiments. These unwanted technical variations, or batch effects, can 23 lead to misleading scientific findings in downstream data analysis (Hicks et al., 2017). Typically, batch 24 effects can alter the sample patterns, causing false interpretations about cell lineage and heterogeneity. 25 If the goal is to detect differential expression (DE) genes, the analysis can suffer loss of statistical power 26 and/or bias.
27While the severity of batch effects varies in different datasets, batch effect corrections were shown to 28 be effective in general. For instance, batch effect correction on the ENCODE human and mouse tissues 29 bulk RNA-seq data (Lin et al., 2014), where the batch effects were intense, obtained largely different and 30 more sensible tissue clustering results compared to before correction (Gilad and Mizrahi-Man, 2015).
31In other datasets, batch effects are often more subtle. In such cases, although the true biological pattern 32 is maintained to some extent, weak to moderate batch effects can still be observed. Hicks et al. (2017) 33 discussed the coexistence of biological signal and technical variation, which may still compromise the 34 downstream analysis. The correction of the batch effects can yield better clustering results (Fei et al., 35 2018) on data with weak to moderate batch effects that were unobvious from dimension reduction plots 36 (Usoskin et al., 2015; Muraro et al., 2016). These previous efforts argue for the inclusion of batch effect 37 corrections as a routine procedure in data preparation.
38Since the microarray era, efforts have been made to correct batch effects. Johnson et al. (2007)
39proposed an e...