Variant effect prediction has traditionally focused on training supervised models on labeled data. Motivated by recent advances in natural language processing that have demonstrated substantial gains on diverse tasks by pre-training on large unlabeled data, however, unsupervised pre-training on massive databases of protein sequences has proven to be an effective approach to extracting complex information about proteins. Such models have been shown to learn variant effects in coding regions in a zero-shot manner. In a similar vein, we here introduce GPN (Genomic Pre-trained Network) which can learn variant effects in non-coding DNA using unsupervised pre-training on genomic DNA sequence alone. Our model is also able to learn gene structure and DNA motifs without any supervision. We demonstrate the utility of GPN by showing that it outperforms supervised deep learning models such as DeepSEA trained on vast amounts of functional genomics data in Arabidopsis thaliana, a model organism for plant biology. Additionally, GPN trained on a single genome outperforms popular conservation scores such as phyloP and PhastCons, which are computed using aligned genomes from multiple species and can be used to predict the pathogenicity of variants that perturb highly conserved positions. We provide code (https://github.com/songlab-cal/gpn) to train GPN for any given species using its DNA sequence alone and learn corresponding zero-shot variant effects genome-wide.
Although alternative splicing is a fundamental and pervasive aspect of gene expression in higher eukaryotes, it is often omitted from single-cell studies due to quantification challenges inherent to commonly used short-read sequencing technologies. Here, we undertake the analysis of alternative splicing across numerous diverse murine cell types from two large-scale single-cell datasets-the Tabula Muris and BRAIN Initiative Cell Census Network-while accounting for understudied technical artifacts and unannotated events. We find strong and general cell-type-specific alternative splicing, complementary to total gene expression but of similar discriminatory value, and identify a large volume of novel splicing events. We specifically highlight splicing variation across different cell types in primary motor cortex neurons, bone marrow B cells, and various epithelial cells, and we show that the implicated transcripts include many genes which do not display total expression differences. To elucidate the regulation of alternative splicing, we build a custom predictive model based on splicing factor activity, recovering several known interactions while generating new hypotheses, including potential regulatory roles for novel alternative splicing events in critical genes like Khdrbs3 and Rbfox1. We make our results available using public interactive browsers to spur further exploration by the community.
Ribosome profiling quantifies translation genome-wide by sequencing ribosome-protected fragments, or footprints. Its single-codon resolution allows identification of translation regulation, such as ribosome stalls or pauses, on individual genes. However, enzyme preferences during library preparation lead to pervasive sequence artifacts that obscure translation dynamics. Widespread over- and under-representation of ribosome footprints can dominate local footprint densities and skew estimates of elongation rates by up to five fold. To address these biases and uncover true patterns of translation, we present choros, a computational method that models ribosome footprint distributions to provide bias-corrected footprint counts. choros uses negative binomial regression to accurately estimate two sets of parameters: (i) biological contributions from codon-specific translation elongation rates; and (ii) technical contributions from nuclease digestion and ligation efficiencies. We use these parameter estimates to generate bias correction factors that eliminate sequence artifacts. Applying choros to multiple ribosome profiling datasets, we are able to accurately quantify and attenuate ligation biases to provide more faithful measurements of ribosome distribution. We show that a pattern interpreted as pervasive ribosome pausing near the beginning of coding regions is likely to arise from technical biases. Incorporating choros into standard analysis pipelines will improve biological discovery from measurements of translation.
Although isoform diversity is acknowledged as a fundamental and pervasive aspect of gene expression in higher eukaryotes, it is often omitted from single-cell studies due to quantification challenges inherent to commonly used short-read sequencing technologies. To address this issue, we have developed a suite of computational tools to investigate isoform variation by focusing on splice junction usage patterns, which can often be well characterized in spite of technical difficulties. Our method, which we name scQuint (Single-Cell QUantification of INTrons), can perform accurate quantification, dimensionality reduction, and differential splicing analysis using short-read, full-length single-cell RNA-seq data. Notably, scQuint does not require transcriptome annotations and is robust to technical artifacts. In applications across diverse mouse tissues from Tabula Muris and the primary motor cortex from the BRAIN Initiative Cell Census Network, we find evidence of strong cell-type-specific isoform variation, complementary to total gene expression, and also identify a large volume of previously unannotated splice junctions. As a community resource, we provide ways to interactively visualize and explore these results, accessible at https://github.com/songlab-cal/scquint-analysis/ .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.