The extent to which gene fusions function as drivers of cancer remains a critical open question. Current algorithms do not sufficiently identify false-positive fusions arising during library preparation, sequencing, and alignment. Here, we introduce Data-Enriched Efficient PrEcise STatistical fusion detection (DEEPEST), an algorithm that uses statistical modeling to minimize false-positives while increasing the sensitivity of fusion detection. In 9,946 tumor RNA-sequencing datasets from The Cancer Genome Atlas (TCGA) across 33 tumor types, DEEPEST identifies 31,007 fusions, 30% more than identified by other methods, while calling 10-fold fewer false-positive fusions in nontransformed human tissues. We leverage the increased precision of DEEPEST to discover fundamental cancer biology. Namely, 888 candidate oncogenes are identified based on overrepresentation in DEEPEST calls, and 1,078 previously unreported fusions involving long intergenic noncoding RNAs, demonstrating a previously unappreciated prevalence and potential for function. DEEPEST also reveals a high enrichment for fusions involving oncogenes in cancers, including ovarian cancer, which has had minimal treatment advances in recent decades, finding that more than 50% of tumors harbor gene fusions predicted to be oncogenic. Specific protein domains are enriched in DEEPEST calls, indicating a global selection for fusion functionality: kinase domains are nearly 2-fold more enriched in DEEPEST calls than expected by chance, as are domains involved in (anaerobic) metabolism and DNA binding. The statistical algorithms, population-level analytic framework, and the biological conclusions of DEEPEST call for increased attention to gene fusions as drivers of cancer and for future research into using fusions for targeted therapy.
Precise splice junction calls are currently unavailable in scRNA-seq pipelines such as the 10x Chromium platform but are critical for understanding single-cell biology. Here, we introduce SICILIAN, a new method that assigns statistical confidence to splice junctions from a spliced aligner to improve precision. SICILIAN is a general method that can be applied to bulk or single-cell data, but has particular utility for single-cell analysis due to that data’s unique challenges and opportunities for discovery. SICILIAN’s precise splice detection achieves high accuracy on simulated data, improves concordance between matched single-cell and bulk datasets, and increases agreement between biological replicates. SICILIAN detects unannotated splicing in single cells, enabling the discovery of novel splicing regulation through single-cell analysis workflows.
Short Abstract:The extent to which gene fusions function as drivers of cancer remains a critical open question. Current algorithms do not sufficiently identify false-positive fusions arising during library preparation, sequencing, and alignment. Here, we introduce a new algorithm, DEEPEST, that uses statistical modeling to minimize false-positives while increasing the sensitivity of fusion detection. In 9,946 tumor RNA-sequencing datasets from The Cancer Genome Atlas (TCGA) across 33 tumor types, DEEPEST identifies 31,007 fusions, 30% more than identified by other methods, while calling ten-fold fewer false-positive fusions in non-transformed human tissues. We leverage the increased precision of DEEPEST to discover new cancer biology. For example, 888 new candidate oncogenes are identified based on over-representation in DEEPEST-Fusion calls, and 1,078 previously unreported fusions involving long intergenic noncoding RNAs partners, demonstrating a previously unappreciated prevalence and potential for function. Specific protein domains are enriched in DEEPEST calls, demonstrating a global selection for fusion functionality: kinase domains are nearly 2-fold more enriched in DEEPEST calls than expected by chance, as are domains involved in (anaerobic) metabolism and DNA binding. DEEPEST also reveals a high enrichment for fusions involving known and novel oncogenes in diseases including ovarian cancer, which has had minimal treatment advances in recent decades, finding that more than 50% of tumors harbor gene fusions predicted to be oncogenic. The statistical algorithms, population-level analytic framework, and the biological conclusions of DEEPEST call for increased attention to gene fusions as drivers of cancer and for future research into using fusions for targeted therapy. Significance:Gene fusions are tumor-specific genomic aberrations and are among the most powerful biomarkers and drug targets in translational cancer biology. The advent of RNA-Seq 2 technologies over the past decade has provided a unique opportunity for detecting novel fusions via deploying computational algorithms on public sequencing databases. Yet, precise fusion detection algorithms are still out of reach. We develop DEEPEST, a highly specific and efficient statistical pipeline specially designed for mining massive sequencing databases, and apply it to all 33 tumor types and 10,500 samples in The Cancer Genome Atlas database. We systematically profile the landscape of detected fusions via employing classic statistical models and identify several signatures of selection for fusions in tumors. Software availabilityDEEPEST-Fusion workflow with a detailed readme file is available as a Github repository: https://github.com/salzmanlab/DEEPEST-Fusion. In addition to the main workflow, which is based on CWL, example input and batch scripts (for job submission on local clusters), and codes for building the SBT files and SBT querying are provided in the repository. All custom scripts used for systematic analysis of fusions are also available in the s...
Next-generation sequencing has led to the generation of petabytes of public data with the potential to significantly advance biomedical research. The Cancer Genome Atlas (TCGA) network alone, for example, has produced more than 2.5 petabytes of data. The logistical difficulties that researchers face while accessing such large datasets continue to present challenges, however. Downloading the complete TCGA dataset to a local data store can take several weeks or more, and, traditionally, integrated analysis has required resources available only to a limited number of researchers with access to large institutional compute clusters. In 2015, the National Cancer Institute (NCI) launched three Cancer Genomics Cloud Pilots, including the Seven Bridges Cancer Genomics Cloud (CGC; cancergenomicscloud.org), to democratize access to datasets such as TCGA by colocalizing data and computational resources in the cloud. In 2017, NCI expanded this effort to the development of an NCI Cancer Research Data Commons in which the CGC and other Cloud Pilots, now known as Cloud Resources, continue to deliver cloud-based access to petabyte-scale data and analysis resources. The Seven Bridges CGC is a customizable and scalable data access and analysis platform that connects users via the web to extensive public datasets, including multi-omic data from TCGA, the Simons Genome Diversity Project, the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiative, the International Cancer Genome Consortium (ICGC), the Cancer Cell Line Encyclopedia, The Cancer Imaging Archive (TCIA), and the Clinical Proteomic Tumor Analysis Consortium (CPTAC). The CGC enables collaborative, reproducible analysis across both public and private cohorts through access to customizable workspaces, a public toolkit containing more than 300 common analytical tools and workflows, and additional resources including an open-source Software Development Kit known as Rabix. Since the launch of the CGC in early 2016, more than 2500 researchers from more than 150 institutions in 30 countries have used the platform to deploy more than 5,000 applications to perform analyses representing more than 100 years of computation time. To illustrate the potential of the CGC to provide a customizable and scalable research environment, we present a collaborative project that enables unprecedented precision in detection of gene fusions and splice variants using novel statistical algorithm called Machete. We describe how this software was refactored in order to optimize deployment to the cloud for cost-effective analysis of thousands of samples at scale. We also provide the results of benchmarking that demonstrates the substantial savings in wall-clock time that can be obtained by processing large datasets on the cloud. Citation Format: Milos Jordanski, Robert Bierman, Erik Lehnert, Ana Damljanovic, Eric Freeman, Gillian Hsieh, Julia Salzman. The Seven Bridges Cancer Genomics Cloud: Enabling reproducible and cost-effective analysis in the cloud [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 5386.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.