2022
DOI: 10.1101/2022.12.20.521311
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Supervised discovery of interpretable gene programs from single-cell data

Abstract: Factor analysis can drive biological discovery by decomposing single-cell gene expression data into a minimal set of gene programs that correspond to processes executed by cells in a sample. However, matrix factorization methods are prone to technical artifacts and poor factor interpretability. We have developed Spectra, an algorithm that identifies user-provided gene programs, modifies them to dataset context as needed, and detects novel programs that together best explain expression covariation. Spectra over… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
11
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(13 citation statements)
references
References 99 publications
0
11
0
Order By: Relevance
“…Inference schemes using EM (Expectation Maximisation) or VI (Variational Inference) for updates could be implemented, but they are usually impractical for large datasets. For example, a recent scRNA-seq NMF based method implements an EM algorithm but favours the use of ADAM (a gradient descent based method) due long run times [40]. A potential extension for intNMF, given its speed, is iterative learning - e .…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Inference schemes using EM (Expectation Maximisation) or VI (Variational Inference) for updates could be implemented, but they are usually impractical for large datasets. For example, a recent scRNA-seq NMF based method implements an EM algorithm but favours the use of ADAM (a gradient descent based method) due long run times [40]. A potential extension for intNMF, given its speed, is iterative learning - e .…”
Section: Discussionmentioning
confidence: 99%
“…intNMF is implemented with an l2 loss, this is a practical decision (similar to MOFA+'s recommendation to use a Gaussian loss for large datasets) due to the efficient update schemes available given an l2 loss. Other cost functions such as the Kullback-Leibler divergence or l1/Poisson could also be applied, which are likely to attend less to highly expressed genes ( [40,41]). Additionally there are flavours of NMF with probabilistic interpretations (primarily by adding sum to one constraints for each cell and each feature) [42], which may further enhance the interpretability of the models.…”
Section: Discussionmentioning
confidence: 99%
“…Cell-intrinsic gene programs can be identified by scoring single-cell data with gene sets that represent molecular pathways or transcriptional signatures. Scoring methods apply gene signature knowledge-bases using rank-based or factorization-based approaches 11,12,13 to obtain per-cell scores that facilitate the interpretation of cellular phenotypes across a dataset. For example, Spectra 13 is a Bayesian approach that can discern cell type-specific phenotypes from global programs while refining the definition of a gene set depending on the context.…”
Section: Introductionmentioning
confidence: 99%
“…Scoring methods apply gene signature knowledge-bases using rank-based or factorization-based approaches 11,12,13 to obtain per-cell scores that facilitate the interpretation of cellular phenotypes across a dataset. For example, Spectra 13 is a Bayesian approach that can discern cell type-specific phenotypes from global programs while refining the definition of a gene set depending on the context. Alternatively, to obtain gene programs that span across cell types, recent methods like DIALOGUE 14 , aim to uncover multicellular processes in single-cell RNAseq data.…”
Section: Introductionmentioning
confidence: 99%
“…NMF-based approaches have also shown promising results in dealing with sparse SC samples (Welch et al, 2019;Argelaguet et al, 2020;Jung et al, 2020;Huizing et al, 2023). Furthermore, NMF-based methods have been used to jointly integrate SC data with molecular networks to identify types of SCs (Elyanow et al, 2020), to discover interpretable gene programs (Kunes et al, 2023) and to generate protein representations within various cellular contexts to identify therapeutic targets and nominate cell type contexts for rheumatoid arthritis and inflammatory bowel diseases (Li et al, 2023). Using matrix factorization approaches to integrate SC data with molecular networks (i.e., prior knowledge) allows us to benefit from the biologically relevant information in molecular networks and simultaneously minimize the inherent noisiness of SC data.…”
Section: Introductionmentioning
confidence: 99%