The rapidly emerging diversity of single cell RNAseq datasets allows us to characterize the transcriptional behav-1 ior of cell types across a wide variety of biological and clinical conditions. With this comprehensive breadth comes a major 2 analytical challenge. The same cell type across tissues, from different donors, or in different disease states, may appear 3 to express different genes. A joint analysis of multiple datasets requires the integration of cells across diverse conditions. 4 This is particularly challenging when datasets are assayed with different technologies in which real biological differences 5 are interspersed with technical differences. We present Harmony, an algorithm that projects cells into a shared embedding 6 in which cells group by cell type rather than dataset-specific conditions. Unlike available single-cell integration methods, 7 Harmony can simultaneously account for multiple experimental and biological factors. We develop objective metrics to 8 evaluate the quality of data integration. In four separate analyses, we demonstrate the superior performance of Harmony to 9 four single-cell-specific integration algorithms. Moreover, we show that Harmony requires dramatically fewer computational 10 resources. It is the only available algorithm that makes the integration of ∼ 10 6 cells feasible on a personal computer. We 11 demonstrate that Harmony identifies both broad populations and fine-grained subpopulations of PBMCs from datasets with 12 large experimental differences. In a meta-analysis of 14,746 cells from 5 studies of human pancreatic islet cells, Harmony 13 accounts for variation among technologies and donors to successfully align several rare subpopulations. In the resulting in-14 tegrated embedding, we identify a previously unidentified population of potentially dysfunctional alpha islet cells, enriched 15 for genes active in the Endoplasmic Reticulum (ER) stress response. The abundance of these alpha cells correlates across 16 donors with the proportion of dysfunctional beta cells also enriched in ER stress response genes. Harmony is a fast and 17 flexible general purpose integration algorithm that enables the identification of shared fine-grained subpopulations across a 18 variety of experimental and biological conditions.
19Recent technological advances 1 have enabled unbiased single cell transcriptional profiling of thousands of cells in a 20 single experiment. Projects such as the Human Cell Atlas 2 (HCA) and Accelerating Medicines Partnership 3, 4 exemplify 21 the growing body of reference datasets of primary human tissues. While individual experiments contribute incrementally 22 to our understanding of cell types, a comprehensive catalogue of healthy and diseased cells will require the integration of 23 multiple datasets across donors, studies, and technological platforms. Moreover, in translational research, joint analyses 24 across tissues and clinical conditions will be essential to identify disease expanded populations. However, meaningful 25 biological variatio...