Abstract. Single cell sequencing and proteome profiling efforts in the past few years have revealed widespread genetic and proteomic heterogeneity among tumor cells. However, sensible cell-type definition of such heterogeneous cell populations has so far been a challenging task. Single cell technologies such as RNA sequencing and mass cytometry provide information precluded by conventional bulk measurements and have achieved significant improvements in multiparametricity at high cellular throughput. By combining these technologies with computational and mathematical techniques it is possible to quantitatively define cellular heterogeneity, uncovering distinct phenotypic profiles that can be utilized to, for example, characterize tumor heterogeneity with the potential to develop and improve therapeutic strategies.
Current Methods and Challenges in Analyzing Single-Cell dataVarious approaches have been used to characterize cellular heterogeneity from single cell data (1,2), based typically on a definition of a cell type as a characteristic abundance profile of a set of molecular markers (gates). These profiles have been either defined from prior knowledge (3) or derived in a data driven fashion by means of conventional clustering techniques (4). Other techniques addressing this task cover generic dimensionality reduction (5) and clustering techniques that account for differentiation mechanisms in an algorithmic fashion (6). However, the high dimensionality and complex biological variability of these single cell data require development of novel computational approaches for quantification and interpretation of the results.To date such approaches have not sufficiently taken into account the non-linear continuous nature of biological processes, which give rise to intermediate cell states observed during transition between persistent cell states. Conventional clustering techniques (7) make rigid implicit assumptions on the shape of cell subpopulations and not only ignore but are confused by the occurrence of such intermediate states(8). Recent computational approaches took advantage of intermediate cell states to define cell types from single cell data for linear and general bifurcated processes (8,9). While the latter approaches permit a wide spectrum of geometric arrangements of heterogeneous cell populations, these approaches lack robustness and reproducibility due to the stochastic nature of the underlying algorithm. Dimensionality reduction techniques such as principal component analysis and t-distributed Stochastic Neighbor Embedding (t-SNE) shift the interpretation of nonlinearities and cell population a