Supervised discovery of interpretable gene programs from single-cell data

Kunes, Russell; Walle, Thomas; Nawy, Tal; Pe’er, Dana

doi:10.1101/2022.12.20.521311

Cited by 10 publications

(13 citation statements)

References 99 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Inference schemes using EM (Expectation Maximisation) or VI (Variational Inference) for updates could be implemented, but they are usually impractical for large datasets. For example, a recent scRNA-seq NMF based method implements an EM algorithm but favours the use of ADAM (a gradient descent based method) due long run times [40]. A potential extension for intNMF, given its speed, is iterative learning - e .…”

Section: Discussionmentioning

confidence: 99%

“…intNMF is implemented with an l2 loss, this is a practical decision (similar to MOFA+'s recommendation to use a Gaussian loss for large datasets) due to the efficient update schemes available given an l2 loss. Other cost functions such as the Kullback-Leibler divergence or l1/Poisson could also be applied, which are likely to attend less to highly expressed genes ( [40,41]). Additionally there are flavours of NMF with probabilistic interpretations (primarily by adding sum to one constraints for each cell and each feature) [42], which may further enhance the interpretability of the models.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Scalable joint non-negative matrix factorisation for paired single cell gene expression and chromatin accessibility data

Morgans,

Sharrocks,

Iqbal

2023

Preprint

View full text Add to dashboard Cite

Single cell multi-modal technologies provide powerful means to simultaneously profile components of the gene regulatory path-ways of individual cells. These are now being employed to study gene regulatory mechanisms in a variety of biological systems. Tailored computational methods for integration and analysis of these data are much-needed with desirable properties in terms of efficiency -to cope with high dimensionality of the data, inter-pretability -for downstream biological discovery and hypothesis generation, and flexibility -to be able to easily incorporate future modalities. Existing methods cover some but not all of the desirable properties for effective integration of these data.Here we present a highly efficient method, intNMF, for representation and integration of single cell multi-modal data using joint non-negative matrix factorisation which can facilitate discovery of linked regulatory topics in each modality. We provide thorough benchmarking using large publicly available datasets against five popular existing methods. intNMF performs comparably against the current state-of-the-art, and provides advantages in terms of computational efficiency and interpretability of discovered regulatory topics in the original feature space. We illustrate this enhanced interpretability in providing insights into cell state changes associated with Alzheimer’s disease. int-NMF is available as a Python package with extensive documentation and use-cases athttps://github.com/wmorgans/quick_intNMF

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Scalable joint non-negative matrix factorisation for paired single cell gene expression and chromatin accessibility data

Morgans,

Sharrocks,

Iqbal

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Cell-intrinsic gene programs can be identified by scoring single-cell data with gene sets that represent molecular pathways or transcriptional signatures. Scoring methods apply gene signature knowledge-bases using rank-based or factorization-based approaches 11,12,13 to obtain per-cell scores that facilitate the interpretation of cellular phenotypes across a dataset. For example, Spectra 13 is a Bayesian approach that can discern cell type-specific phenotypes from global programs while refining the definition of a gene set depending on the context.…”

Section: Introductionmentioning

confidence: 99%

“…Scoring methods apply gene signature knowledge-bases using rank-based or factorization-based approaches 11,12,13 to obtain per-cell scores that facilitate the interpretation of cellular phenotypes across a dataset. For example, Spectra 13 is a Bayesian approach that can discern cell type-specific phenotypes from global programs while refining the definition of a gene set depending on the context. Alternatively, to obtain gene programs that span across cell types, recent methods like DIALOGUE 14 , aim to uncover multicellular processes in single-cell RNAseq data.…”

Section: Introductionmentioning

confidence: 99%

Identification of cell types, states and programs by learning gene set representations

Hediyeh-zadeh,

Whitfield,

Kharbanda

et al. 2023

Preprint

View full text Add to dashboard Cite

As single cell molecular data expand, there is an increasing need for algorithms that efficiently query and prioritize gene programs, cell types and states in single-cell sequencing data, particularly in cell atlases. Here we present scDECAF, a statistical learning algorithm to identify cell types, states and programs in single-cell gene expression data using vector representation of gene sets, which improves biological interpretation by selecting a subset of most biologically relevant programs. We applied scDECAF to scRNAseq data from PBMC, Lung, Pancreas, Brain and slide-tags snRNA of human prefrontal cortex for automatic cell type annotation. We demonstrate that scDECAF can recover perturbed gene programs in Lupus PBMC cells stimulated with IFNbeta and TGFBeta-induced cells undergoing epithelial-to-mesenchymal transition. scDECAF delineates patient-specific heterogeneity in cellular programs in Ovarian Cancer data. Using a healthy PBMC reference, we apply scDECAF to a mapped query PBMC COVID-19 case-control dataset and identify multicellular programs associated with severe COVID-19. scDECAF can improve biological interpretation and complement reference mapping analysis, and provides a method for gene set and pathway analysis in single cell gene expression data.

show abstract

“…NMF-based approaches have also shown promising results in dealing with sparse SC samples (Welch et al, 2019;Argelaguet et al, 2020;Jung et al, 2020;Huizing et al, 2023). Furthermore, NMF-based methods have been used to jointly integrate SC data with molecular networks to identify types of SCs (Elyanow et al, 2020), to discover interpretable gene programs (Kunes et al, 2023) and to generate protein representations within various cellular contexts to identify therapeutic targets and nominate cell type contexts for rheumatoid arthritis and inflammatory bowel diseases (Li et al, 2023). Using matrix factorization approaches to integrate SC data with molecular networks (i.e., prior knowledge) allows us to benefit from the biologically relevant information in molecular networks and simultaneously minimize the inherent noisiness of SC data.…”

Section: Introductionmentioning

confidence: 99%

Multi-omics integration of scRNA-seq time series data predicts new intervention points for Parkinson’s disease

Mihajlović,

Ceddia,

Malod-Dognin

et al. 2023

Preprint

View full text Add to dashboard Cite

Parkinson’s disease (PD) is a complex neurodegenerative disorder without a cure. The onset of PD symptoms corresponds to 50% loss of midbrain dopaminergic (mDA) neurons, limiting early-stage understanding of PD. To shed light on early PD development, we study time series scRNA-seq datasets of mDA neurons obtained from patient-derived induced pluripotent stem cell differentiation. We develop a new data integration method based on Non-negative Matrix Tri-Factorization that integrates these datasets with molecular interaction networks, producing condition-specific “gene embeddings”. By mining these embeddings, we predict 193 PD-related genes that are largely supported (49.7%) in the literature and are specific to the investigatedPINK1mutation. Enrichment analysis in Kyoto Encyclopedia of Genes and Genomes pathways highlights 10 PD-related molecular mechanisms perturbed during early PD development. Finally, investigating the top 20 prioritized genes reveals 12 previously unrecognized genes associated with PD that represent interesting drug targets.

show abstract

Supervised discovery of interpretable gene programs from single-cell data

Cited by 10 publications

References 99 publications

Scalable joint non-negative matrix factorisation for paired single cell gene expression and chromatin accessibility data

Scalable joint non-negative matrix factorisation for paired single cell gene expression and chromatin accessibility data

Identification of cell types, states and programs by learning gene set representations

Multi-omics integration of scRNA-seq time series data predicts new intervention points for Parkinson’s disease

Contact Info

Product

Resources

About