Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge. The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct large-scale genomics datasets. These modules are combined to predict effects of variants on exon skipping, splice site choice, splicing efficiency, and pathogenicity, with matched or higher performance than state-of-the-art. Our models, available in the repository Kipoi, apply to variants including indels directly from VCF files. Electronic supplementary material The online version of this article (10.1186/s13059-019-1653-z) contains supplementary material, which is available to authorized users.
Aberrant splicing is a major cause of rare diseases. However, its prediction from genome sequence alone remains in most cases inconclusive. Recently, RNA sequencing has proven to be an effective complementary avenue to detect aberrant splicing. Here, we develop FRASER, an algorithm to detect aberrant splicing from RNA sequencing data. Unlike existing methods, FRASER captures not only alternative splicing but also intron retention events. This typically doubles the number of detected aberrant events and identified a pathogenic intron retention in MCOLN1 causing mucolipidosis. FRASER automatically controls for latent confounders, which are widespread and affect sensitivity substantially. Moreover, FRASER is based on a count distribution and multiple testing correction, thus reducing the number of calls by two orders of magnitude over commonly applied z score cutoffs, with a minor loss of sensitivity. Applying FRASER to rare disease diagnostics is demonstrated by reprioritizing a pathogenic aberrant exon truncation in TAZ from a published dataset. FRASER is easy to use and freely available.
Highlights d Accurate quantification of 4,000 proteins in embryo nuclei by mass spectrometry d Data reveal stoichiometric relationships between chromatin regulators and histones d Repressive histone PTM abundance in embryos differ from those in cultured cells d Copy number information enables quantitative analysis of chromatin state changes
With increased usage of long-read sequencing technologies to perform transcriptome analyses, there becomes a greater need to evaluate different methodologies including library preparation, sequencing platform, and computational analysis tools. Here, we report the study design of a community effort called the Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium, whose goals are characterizing the strengths and remaining challenges in using long-read approaches to identify and quantify the transcriptomes of both model and non-model organisms. The LRGASP organizers have generated cDNA and direct RNA datasets in human, mouse, and manatee samples using different protocols followed by sequencing on Illumina, Pacific Biosciences, and Oxford Nanopore Technologies platforms. Participants will use the provided data to submit predictions for three challenges: transcript isoform detection with a high-quality genome, transcript isoform quantification, and de novo transcript isoform identification. Evaluators from different institutions will determine which pipelines have the highest accuracy for a variety of metrics using benchmarks that include spike-in synthetic transcripts, simulated data, and a set of undisclosed, manually curated transcripts by GENCODE. We also describe plans for experimental validation of predictions that are platform-specific and computational tool-specific. We believe that a community effort to evaluate long-read RNA-seq methods will help move the field toward a better consensus on the best approaches to use for transcriptome analyses.
We develop the free and open-source model Multi-tissue Splicing (MTSplice) to predict the effects of genetic variants on splicing of cassette exons in 56 human tissues. MTSplice combines MMSplice, which models constitutive regulatory sequences, with a new neural network that models tissue-specific regulatory sequences. MTSplice outperforms MMSplice on predicting tissue-specific variations associated with genetic variants in most tissues of the GTEx dataset, with largest improvements on brain tissues. Furthermore, MTSplice predicts that autism-associated de novo mutations are enriched for variants affecting splicing specifically in the brain. We foresee that MTSplice will aid interpreting variants associated with tissue-specific disorders.
12Aberrant splicing is a major cause of rare diseases, yet its prediction from genome 13 sequence remains in most cases inconclusive. Recently, RNA sequencing has proven to 14 be an effective complementary avenue to detect aberrant splicing. Here, we developed 15 FRASER, an algorithm to detect aberrant splicing from RNA sequencing data. Unlike 16 existing methods, FRASER captures not only alternative splicing but also intron retention 17 events. This typically doubles the number of detected aberrant events and identified a 18 pathogenic intron retention in MCOLN1. FRASER automatically controls for latent 19 confounders, which are widespread and substantially affect sensitivity. Moreover, 20 FRASER is based on a count distribution and multiple testing correction, reducing the 21 number of calls by two orders of magnitude over commonly applied z score cutoffs, with 1 a minor sensitivity loss. The application to rare disease diagnostics is demonstrated by 2 reprioritizing a pathogenic aberrant exon truncation in TAZ from a published dataset. 3 FRASER is easy to use and freely available. 4
Precision medicine and sequence‐based clinical diagnostics seek to predict disease risk or to identify causative variants from sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype‐phenotype prediction challenges; participants build models, undergo assessment, and share key findings. In the past, few CAGI challenges have addressed the impact of sequence variants on splicing. In CAGI5, two challenges (Vex‐seq and MaPSY) involved prediction of the effect of variants, primarily single‐nucleotide changes, on splicing. Although there are significant differences between these two challenges, both involved prediction of results from high‐throughput exon inclusion assays. Here, we discuss the methods used to predict the impact of these variants on splicing, their performance, strengths, and weaknesses, and prospects for predicting the impact of sequence variation on splicing and disease phenotypes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.