2023
DOI: 10.1101/2023.07.21.550107
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping

Abstract: In typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is used to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as "double dipping": the same data is used to define both cell clusters and DE genes, leading to false-positive DE genes even when the cell clusters are spurious. To … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
18
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(18 citation statements)
references
References 40 publications
0
18
0
Order By: Relevance
“…Differences between callback and ClusterDE. Both callback and ClusterDE [6] use synthetic null variables and the knockoff filter. The key distinction between these methods is that ClusterDE takes given cell clusters and computes knockoff data to calibrate statistical null hypothesis tests between those clusters, while callback computes knockoff data on the full dataset first and uses the augmented data matrix as input to the clustering algorithm in order to calibrate the choice of clusters.…”
Section: Overiew Of the Callback Algorithmmentioning
confidence: 99%
See 2 more Smart Citations
“…Differences between callback and ClusterDE. Both callback and ClusterDE [6] use synthetic null variables and the knockoff filter. The key distinction between these methods is that ClusterDE takes given cell clusters and computes knockoff data to calibrate statistical null hypothesis tests between those clusters, while callback computes knockoff data on the full dataset first and uses the augmented data matrix as input to the clustering algorithm in order to calibrate the choice of clusters.…”
Section: Overiew Of the Callback Algorithmmentioning
confidence: 99%
“…The most commonly used software packages, such as [4] and [5], perform these two steps on the same dataset. This double use of data is often referred to as “circular analysis” or “double-dipping,” and is known to result in highly inflated P -values, even in the null case when gene expression is identically distributed and there are no true groupings that distinguish cell populations [6, 7]. Due to the miscalibrated test statistics produced by circular analyses, it is challenging to assess whether the genes found to be differentially expressed between two putative cell groups are “real” or solely identified due to chance based on the way that the cells are being partitioned by the clustering algorithm that is being used.…”
Section: Mainmentioning
confidence: 99%
See 1 more Smart Citation
“…We highly rank genes that are close to zero expression in all other cell types and close to uniform expression in the cell type of interest. Additionally, conventional methods for detecting differential gene expression between clusters often return inflated p-values because of the double use of gene expression data, first to partition the data into clusters and then to define significance statistics along the same partitions [50]. Consequently, filtering based on p-value alone results in an increased rate of false positives, and some pipelines have turned to gene rankings instead of cutoffs [60].…”
Section: Capturing Rankings Based On Embedded Distances From Syntheti...mentioning
confidence: 99%
“…We highlight the importance of genes that are highly localized on the cellular manifold versus genes that are diffusely expressed, hypothesizing that such genes are beneficial in characterizing different areas of the manifold informatively. We also show that differential localization has several advantages compared to standard cluster-based differential expression pipelines, including reduced false positives [50] and the identification of genes with patterns of localization that do not conform to cluster boundaries [27].…”
Section: Introductionmentioning
confidence: 99%