In this work, we describe the development of Polar Gini Curve, a method for characterizing cluster markers by analyzing single-cell RNA sequencing (scRNA-seq) data. Polar Gini Curve combines the gene expression and the 2D coordinates (“spatial”) information to detect patterns of uniformity in any clustered cells from scRNA-seq data. We demonstrate that Polar Gini Curve can help users characterize the shape and density distribution of cells in a particular cluster, which can be generated during routine scRNA-seq data analysis. To quantify the extent to which a gene is uniformly distributed in a cell cluster space, we combine two polar Gini curves (PGCs)—one drawn upon the cell-points expressing the gene (the “foreground curve”) and the other drawn upon all cell-points in the cluster (the “background curve”). We show that genes with highly dissimilar foreground and background curves tend not to uniformly distributed in the cell cluster—thus having spatially divergent gene expression patterns within the cluster. Genes with similar foreground and background curves tend to uniformly distributed in the cell cluster—thus having uniform gene expression patterns within the cluster. Such quantitative attributes of PGCs can be applied to sensitively discover biomarkers across clusters from scRNA-seq data. We demonstrate the performance of the Polar Gini Curve framework in several simulation case studies. Using this framework to analyze a real-world neonatal mouse heart cell dataset, the detected biomarkers may characterize novel subtypes of cardiac muscle cells. The source code and data for Polar Gini Curve could be found at http://discovery.informatics.uab.edu/PGC/ or https://figshare.com/projects/Polar_Gini_Curve/76749.
23In this work, we design the Polar Gini Curve (PGC) technique, which combines the gene 24 expression and the 2D embedded visual information to detect biomarkers from single-cell data. 25Theoretically, a Polar Gini Curve characterizes the shape and 'evenness' of cell-point 26 distribution of cell-point set. To quantify whether a gene could be a marker in a cell cluster, we 27 can combine two Polar Gini Curves: one drawn upon the cell-points expressing the gene, and the 28 other drawn upon all cell-points in the cluster. We hypothesize that the closers these two curves 29 are, the more likely the gene would be cluster markers. We demonstrate the framework in several 30 simulation case-studies. Applying our framework in analyzing neonatal mouse heart single-cell 31 data, the detected biomarkers may characterize novel subtypes of cardiac muscle cells. The 32 source code and data for PGC could be found at 33 https://figshare.com/projects/Polar_Gini_Curve/76749. 34 35 Introduction 38 Discovering biomarkers from the single-cell gene expression data is an interesting yet 39 challenging problem [1]. Compared to the well-established bulk gene expression data, the 40 expression distribution in single-cell is significantly more heterogeneous [2-4]. Therefore, as 41shown in [5, 6], the bulk-analysis strategies [7, 8] achieve low sensitivity in detecting markers. In 42 addition, as embedding [9][10][11] and clustering [12][13][14] are the essential components in many 43 single-cell expression analytical pipelines [15,16], the biomarker detection techniques would 44 need to tackle the challenges and errors from embedding and clustering [17, 18]. 45 46 From the statistical point of view, there are two different directions among the current state-47 of-the-art methods in solving the single-cell biomarker discovery problem. The first direction is 48 using non-parametric approaches [19]. Non-parametric approaches do not attempt to construct 49 the model characterizing the gene expression distribution [20]. They do not require too many 50 prior assumptions about the expression data. Therefore, in theory, they could be applied in most 51 of the heterogeneous scenarios in single-cell expression. For example, Seurat [16] and the 52 SINCERA [21] pipelines use the Mann-Whitney test [22]. The disadvantages of non-parametric 53 approaches include lacking the point-estimator (for example, we could not tell how much of 54 fold-change when comparing the expressions of the same gene in two populations) and the lower 55 true positive rate [5, 6]. On the other hand, the parametric approaches model the underlying 56 expression distribution. For example, [23] applies Bayesian statistics, Monocle2 [11, 24] and 57 MAST [2] apply different linear models, and [25] applies the Poisson models to single-cell 58 differential expression analysis. The parametric approaches, compared to the non-parametric 59 ones, are significantly more sensitive [5, 6], especially in detecting markers in small cell-cluster 60 since they may require less number of cell-sa...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.