Single cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero-inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform current practice in a downstream clustering assessment using ground-truth datasets.
Multiple algorithms are used to predict the impact of missense mutations on protein structure and function using algorithm-generated sequence alignments or manually curated alignments. We compared the accuracy with native alignment of SIFT, Align-GVGD, PolyPhen-2 and Xvar when generating functionality predictions of well characterized missense mutations (n = 267) within the BRCA1, MSH2, MLH1 and TP53 genes. We also evaluated the impact of the alignment employed on predictions from these algorithms (except Xvar) when supplied the same four alignments including alignments automatically generated by (1) SIFT, (2) Polyphen-2, (3) Uniprot, and (4) a manually curated alignment tuned for Align-GVGD. Alignments differ in sequence composition and evolutionary depth. Data-based receiver operating characteristic curves employing the native alignment for each algorithm result in area under the curve of 78-79% for all four algorithms. Predictions from the PolyPhen-2 algorithm were least dependent on the alignment employed. In contrast, Align-GVGD predicts all variants neutral when provided alignments with a large number of sequences. Of note, algorithms make different predictions of variants even when provided the same alignment and do not necessarily perform best using their own alignment. Thus, researchers should consider optimizing both the algorithm and sequence alignment employed in missense prediction.
TP53 is the most frequently altered gene in head and neck squamous cell carcinoma (HNSCC) with mutations occurring in over two third of cases, but the prognostic significance of these mutations remains elusive. In the current study, we evaluated a novel computational approach termed Evolutionary Action (EAp53) to stratify patients with tumors harboring TP53 mutations as high or low risk, and validated this system in both in vivo and in vitro models. Patients with high risk TP53 mutations had the poorest survival outcomes and the shortest time to the development of distant metastases. Tumor cells expressing high risk TP53 mutations were more invasive and tumorigenic and they exhibited a higher incidence of lung metastases. We also documented an association between the presence of high risk mutations and decreased expression of TP53 target genes, highlighting key cellular pathways that are likely to be dysregulated by this subset of p53 mutations which confer particularly aggressive tumor behavior. Overall, our work validated EAp53 as a novel computational tool that may be useful in clinical prognosis of tumors harboring p53 mutations.
TP53 is the most frequently altered gene in head and neck squamous cell carcinoma (HNSCC) with mutations occurring in over two third of cases, however, the predictive response of these mutations to cisplatin based therapy remains elusive. In the current study, we evaluate the ability of the Evolutionary Action score of TP53 coding variants (EAp53) to predict the impact of TP53 mutations on response to chemotherapy. The EAp53 approach clearly identifies a subset of high risk TP53 mutations associated with decreased sensitivity to cisplatin both in vitro and in vivo in pre-clinical models of HNSCC. Furthermore, EAp53 can predict response to treatment and more importantly a survival benefit for a subset of head and neck cancer patients treated with platinum based therapy. Prospective evaluation of this novel scoring system should enable more precise treatment selection for patients with HNSCC.
Background In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as informative covariates to prioritize, weight, and group hypotheses. However, there is currently no consensus on how the modern methods compare to one another. We investigate the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology. Results Methods that incorporate informative covariates are modestly more powerful than classic approaches, and do not underperform classic approaches, even when the covariate is completely uninformative. The majority of methods are successful at controlling the FDR, with the exception of two modern methods under certain settings. Furthermore, we find that the improvement of the modern FDR methods over the classic methods increases with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses. Conclusions Modern FDR methods that use an informative covariate provide advantages over classic FDR-controlling procedures, with the relative gain dependent on the application and informativeness of available covariates. We present our findings as a practical guide and provide recommendations to aid researchers in their choice of methods to correct for false discoveries. Electronic supplementary material The online version of this article (10.1186/s13059-019-1716-1) contains supplementary material, which is available to authorized users.
Background: In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p-values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as "informative covariates" to prioritize, weight, and group hypotheses. However, there is currently no consensus on how the modern methods compare to one another. We investigated the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology.Results: Methods that incorporate informative covariates were modestly more powerful than classic approaches, and did not underperform classic approaches, even when the covariate was completely uninformative. The majority of methods were successful at controlling the FDR, with the exception of two modern methods under certain settings. Furthermore, we found the improvement of the modern FDR methods over the classic methods increased with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses.Conclusions: Modern FDR methods that use an informative covariate provide advantages over classic FDR-controlling procedures, with the relative gain dependent on the application and informativeness of available covariates. We present our findings as a practical guide and provide recommendations to aid researchers in their choice of methods to correct for false discoveries.
6The rapid development of single-cell RNA-sequencing (scRNA-seq) technology, with increased sparsity compared to bulk RNA-sequencing (RNA-seq), has led to the emergence of many methods for preprocessing, including imputation methods. Here, we systematically evaluate the performance of 18 state-of-the-art scRNA-seq imputation methods using cell line and tissue data measured across experimental protocols. Specifically, we assess the similarity of imputed cell profiles to bulk samples as well as investigate whether methods recover relevant biological signals or introduce spurious noise in three downstream analyses: differential expression, unsupervised clustering, and inferring pseudotemporal trajectories. Broadly, we found significant variability in the performance of the methods across evaluation settings. While most scRNA-seq imputation methods recover biological expression observed in bulk RNA-seq data, the majority of the methods do not improve performance in downstream analyses compared to no imputation, in particular for clustering and trajectory analysis, and thus should be used with caution. Furthermore, we find that the performance of scRNA-seq imputation methods depends on many factors including the experimental protocol, the sparsity of the data, the number of cells in the dataset, and the magnitude of the effect sizes. We summarize our results and provide a key set of recommendations for users and investigators to navigate the current space of scRNA-seq imputation methods. 7 Figure 1. Motivation and overview of benchmark evaluation of scRNA-seq imputation methods. (A) Dimension reduction results after applying Principal Components Analysis (PCA) from either no imputation method (no_imp highlighted in red) or the 18 imputation methods using the null simulations data (Section 5.3) in which no structural pattern is expected. The color represents the simulated library size (defined as the total sum of counts across all relevant features) for each cell. (B) An overview of the benchmark comparison evaluating 18 scRNA-seq imputation methods.with model-based methods outperforming smoothing-based methods, in particular for genes with a small effect size (log2 1 fold-change) 14 . Another study found imputation methods can introduce spurious correlations between imputed expression 2 and total UMI counts 15 . Alternatively, others have shown spurious structural patterns in low dimensional representations of 3 imputed data 14, 18 , which we also find in data where we expect no structural patterns in the data but patterns associated with 4 library size emerge in the imputed data ( Figures 1A, S1). In contrast, others have found a subset of imputation methods to be 5 helpful to estimate library size factors for normalization of sparse scRNA-seq data 16 . Therefore, the answer to the question of 6 which methods can, let alone should, be used for a particular analysis is often unclear. 7To address this gap, we performed a systematic benchmark comparison and evaluation of 18 state-of-the-art scRNA-seq 8 imputation met...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.