Analysis and interpretation of single-cell RNA-sequencing (scRNA-seq) experiments are compromised by the presence of poor quality cells. For meaningful analyses, such poor quality cells should be excluded to avoid biases and large variation. However, no clear guidelines exist.We introduce SkewC, a novel quality-assessment method to identify poor quality single-cells in scRNA-seq experiments. The method is based on the assessment of gene coverage for each single cell and its skewness as a quality measure. To validate the method, we investigated the impact of poor quality cells on downstream analyses and compared biological differences 2 between typical and poor quality cells. Moreover, we measured the ratio of intergenic expression, suggesting genomic contamination, and foreign organism contamination of single-cell samples.SkewC is tested in 37,993 single-cells generated by 15 scRNA-seq protocols. We envision SkewC as an indispensable QC method to be incorporated into scRNA-seq experiment to preclude the possibility of scRNA-seq data misinterpretation.
Results
Wide discrepancies in gene body coverage among scRNA-seq protocolsGene body (full transcript length) coverage considers the distribution of the sequence tags over the entire transcripts. We computed and analyzed the gene body coverage of different protocols (Online Methods). Remarkably, the gene body coverage shows wide differences among the data-sets generated by different scRNA-seq protocols ( Fig. 1a and b; Panels a and b in Supplementary Figs. 1-6). This indicates major variation and differences in scRNA-seq datasets produced by different protocols.We investigated the pattern of the gene body coverage for each single cell in individual datasets. The visualization of the gene body coverage profile revealed two patterns of gene body coverage. The first set of single-cells show well clustered gene coverage distribution according to the target sequence of the protocol. The second set of single-cells showed skewed gene body coverage distributions. The skewness in the distribution could be observed in term of coverage bias towards specific gene region.In one type, there was bias towards the 3'-end of the gene body in case of the 5'-end sequence and full-length sequence protocols. The bias towards the 3'-end indicated by high coverage at the 3'-end of the gene body (Fig. 1c). The tag-based sequencing of 5' or 3' ends methods [11][12][13][14] should have the peak coverage at either the 5' or 3' end of the gene with low/no coverage in the middle region of the gene body. In the second type, there was high coverage in the middle of the gene for 5'-end and 3'-end sequence protocols (Fig. 1c), in contrast to the full-length sequencing protocols. In the third type, there was low coverage in the middle of the gene for full-length sequence protocols. This indicated by low coverage at mid-point of 5'-3'-end of gene body (Fig. 1d).