Motivation
Recent developments of spatial transcriptomic sequencing technologies provide powerful tools for understanding cells in the physical context of tissue microenvironments. A fundamental task in spatial gene expression analysis is to identify genes with spatially variable expression patterns, or spatially variable genes (SVgenes). Several computational methods have been developed for this task. Their high computational complexity limited their scalability to the latest and future large-scale spatial expression data.
Results
We present SOMDE, an efficient method for identifying SVgenes in large-scale spatial expression data. SOMDE uses self-organizing map (SOM) to cluster neighboring cells into nodes, and then uses a Gaussian process to fit the node-level spatial gene expression to identify SVgenes. Experiments show that SOMDE is about 5-50 times faster than existing methods with comparable results. The adjustable resolution of SOMDE makes it the only method that can give results in ∼5 minutes in large datasets of more than 20,000 sequencing sites. SOMDE is available as a python package on PyPI at https://pypi.org/project/somde free for academic use.
Supplementary information
Supplementary data are available at Bioinformatics online.
Summary
Clustering is a key step in revealing heterogeneities in single-cell data. Most existing single-cell clustering methods output a fixed number of clusters without the hierarchical information. Classical hierarchical clustering provides dendrograms of cells, but cannot scale to large datasets due to high computational complexity. We present HGC, a fast Hierarchical Graph-based Clustering tool to address both problems. It combines the advantages of graph-based clustering and hierarchical clustering. On the shared nearest-neighbor graph of cells, HGC constructs the hierarchical tree with linear time complexity. Experiments showed that HGC enables multiresolution exploration of the biological hierarchy underlying the data, achieves state-of-the-art accuracy on benchmark data, and can scale to large datasets.
Availability and implementation
The R package of HGC is available at https://bioconductor.org/packages/HGC/.
Supplementary information
Supplementary data are available at Bioinformatics online.
This perspective discusses the need and directions for developing a unified information framework to enable assembling cell atlases and revolutionizing medical research on the virtual body of assembled cell systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.