Gene expression is orchestrated by distinct regulatory regions to ensure a wide variety of cell types and functions. A challenge is to identify which regulatory regions are active, what are their associated features and how they work together in each cell type. Several approaches have tackled this problem by modeling gene expression based on epigenetic marks, with the ultimate goal of identifying driving regions and associated genomic variations that are clinically relevant in particular in precision medicine. However, these models rely on experimental data, which are limited to specific samples (even often to cell lines) and cannot be generated for all regulators and all patients. In addition, we show here that, although these approaches are accurate in predicting gene expression, inference of TF combinations from this type of models is not straightforward. Furthermore these methods are not designed to capture regulation instructions present at the sequence level, before the binding of regulators or the opening of the chromatin. Here, we probe sequence-level instructions for gene expression and develop a method to explain mRNA levels based solely on nucleotide features. Our method positions nucleotide composition as a critical component of gene expression. Moreover, our approach, able to rank regulatory regions according to their contribution, unveils a strong influence of the gene body sequence, in particular introns. We further provide evidence that the contribution of nucleotide content can be linked to co-regulations associated with genome 3D architecture and to associations of genes within topologically associated domains.
Rapid and transient down-regulation of glycolytic mRNAs was observed after replenishment of nitrogen-starved yeast. Glucose sensing, protein elongation, nitrogen metabolism, and TOR signaling are factors affecting glycolytic mRNA stability via carbon/nitrogen cross-talk. Destabilization occurs by the general mRNA decay pathway.
Using the Cap Analysis of Gene Expression (CAGE) technology, the FANTOM5 consortium provided one of the most comprehensive maps of transcription start sites (TSSs) in several species. Strikingly, ~72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers. Here, we probe these unassigned TSSs and show that, in all species studied, a significant fraction of CAGE peaks initiate at microsatellites, also called short tandem repeats (STRs). To confirm this transcription, we develop Cap Trap RNA-seq, a technology which combines cap trapping and long read MinION sequencing. We train sequence-based deep learning models able to predict CAGE signal at STRs with high accuracy. These models unveil the importance of STR surrounding sequences not only to distinguish STR classes, but also to predict the level of transcription initiation. Importantly, genetic variants linked to human diseases are preferentially found at STRs with high transcription initiation level, supporting the biological and clinical relevance of transcription initiation at STRs. Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism.
Background The development of RNA sequencing (RNAseq) and the corresponding emergence of public datasets have created new avenues of transcriptional marker search. The long non-coding RNAs (lncRNAs) constitute an emerging class of transcripts with a potential for high tissue specificity and function. Therefore, we tested the biomarker potential of lncRNAs on Mesenchymal Stem Cells (MSCs), a complex type of adult multipotent stem cells of diverse tissue origins, that is frequently used in clinics but which is lacking extensive characterization. Results We developed a dedicated bioinformatics pipeline for the purpose of building a cell-specific catalogue of unannotated lncRNAs. The pipeline performs ab initio transcript identification, pseudoalignment and uses new methodologies such as a specific k-mer approach for naive quantification of expression in numerous RNAseq data. We next applied it on MSCs, and our pipeline was able to highlight novel lncRNAs with high cell specificity. Furthermore, with original and efficient approaches for functional prediction, we demonstrated that each candidate represents one specific state of MSCs biology. Conclusions We showed that our approach can be employed to harness lncRNAs as cell markers. More specifically, our results suggest different candidates as potential actors in MSCs biology and propose promising directions for future experimental investigations.
The huge body of publicly available RNA-sequencing (RNA-seq) libraries is a treasure of functional information allowing to quantify the expression of known or novel transcripts in tissues. However, transcript quantification commonly relies on alignment methods requiring a lot of computational resources and processing time, which does not scale easily to large datasets. K-mer decomposition constitutes a new way to process RNA-seq data for the identification of transcriptional signatures, as k-mers can be used to quantify accurately gene expression in a less resource-consuming way. We present the Kmerator Suite, a set of three tools designed to extract specific k-mer signatures, quantify these k-mers into RNA-seq datasets and quickly visualize large dataset characteristics. The core tool, Kmerator, produces specific k-mers for 97% of human genes, enabling the measure of gene expression with high accuracy in simulated datasets. KmerExploR, a direct application of Kmerator, uses a set of predictor gene-specific k-mers to infer metadata including library protocol, sample features or contaminations from RNA-seq datasets. KmerExploR results are visualized through a user-friendly interface. Moreover, we demonstrate that the Kmerator Suite can be used for advanced queries targeting known or new biomarkers such as mutations, gene fusions or long non-coding RNAs for human health applications.
Background: Using the Cap Analysis of Gene Expression technology, the FANTOM5 consortium provided one of the most comprehensive maps of Transcription Start Sites (TSSs) in several species. Strikingly, ∼ 72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers.Results: Here, we probe these unassigned TSSs and show that, in all species studied, a significant fraction of CAGE peaks initiate at short tandem repeats (STRs) corresponding to homopolymers of thymidines (T). Additional analyses confirm that these CAGEs are truly associated with transcriptionally active chromatin marks. Furthermore, we train a sequence-based deep learning model able to predict CAGE signal at T STRs with high accuracy (∼ 81%). Extracting features learned by this model reveals that transcription at T STRs is mostly directed by STR length but also instructions lying in the downstream sequence. Excitingly, our model also predicts that genetic variants linked to human diseases affect this STR-associated transcription.Conclusions: Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism. We also provide a new metric that can be considered in future studies of STR-related complex traits.
Anaplastic large cell lymphomas associated with ALK translocation have a good outcome after CHOP treatment; however, the 2-year relapse rate remains at 30%. Microarray gene-expression profiling of 48 samples obtained at diagnosis was used to identify 47 genes that were differentially expressed between patients with early relapse/progression and no relapse. In the relapsing group, the most significant overrepresented genes were related to the regulation of the immune response and T-cell activation while those in the non-relapsing group were involved in the extracellular matrix. Fluidigm technology gave concordant results for 29 genes, of which FN1, FAM179A, and SLC40A1 had the strongest predictive power after logistic regression and two classification algorithms. In parallel with 39 samples, we used a Kallisto/Sleuth pipeline to analyze RNA sequencing data and identified 20 genes common to the 28 genes validated by Fluidigm technology—notably, the FAM179A and FN1 genes. Interestingly, FN1 also belongs to the gene signature predicting longer survival in diffuse large B-cell lymphomas treated with CHOP. Thus, our molecular signatures indicate that the FN1 gene, a matrix key regulator, might also be involved in the prognosis and the therapeutic response in anaplastic lymphomas.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.