Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction

Chen, Ken; Zhou, Yue; Ding, Maolin; Wang, Yu; Ren, Zhixiang; Yang, Yuedong

doi:10.1101/2023.01.31.526427

Cited by 10 publications

(14 citation statements)

References 73 publications

(74 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Notably, we elected to not fine-tune weights of the gLM on each downstream task, which is how gLMs have been previously benchmarked 23,24,26,30,40 . While gLM performance would likely improve with fine-tuning, the scope of this study was to strictly gauge the knowledge of cis -regulatory biology learned during pre-training.…”

Section: Discussionmentioning

confidence: 99%

“…The utility of gLMs pre-trained on whole genomes for studying the non-coding genome has been limited. Previous benchmarks have largely considered gLMs that have been fine-tuned (i.e., adjusting the weights of the gLM) on each downstream task 23,24,26,30,40 . In each benchmark, a fine-tuned gLM has demonstrated improved predictions on a host of downstream prediction tasks, often based on the classification of functional elements, such as histone marks or promoter annotations.…”

Section: Introductionmentioning

confidence: 99%

“…Recently, there has been a surge of pre-trained gLMs [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][38][39][40][41][42][43] . gLMs take as input DNA sequences that have undergone tokenization, an encoding scheme applied to either a single nucleotide or k-mer of nucleotides.…”

Section: Introductionmentioning

confidence: 99%

“…Alternatively, the base architecture has also been constructed with a stack of residual-connected convolution blocks, either with dilated convolutional layers 20 or implicit convolutions with a Hyena operator 21,22,49 . The pre-training data can vary significantly, encompassing the whole genome of a single species 20,24,32 or the whole genomes across multiple species 23,25,26,28,33 or focused only within specific regions of the genomes, such as the untranslated regions (UTRs) 29 , pre-mRNA 30 , promoters 22 , coding regions [35][36][37] , non-coding RNA 40 , or conserved sites 34 .…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Evaluating the representational power of pre-trained DNA language models for regulatory genomics

Tang,

Somia,

et al. 2024

Preprint

View full text Add to dashboard Cite

The emergence of genomic language models (gLMs) offers an unsupervised approach to learn a wide diversity ofcis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown pre-trained gLMs can be leveraged to improve prediction performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding ofcis-regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that current gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major limitation with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Evaluating the representational power of pre-trained DNA language models for regulatory genomics

Tang,

Somia,

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Existing transformer models for predicting RNA splicing from sequence (e.g. SpliceBERT 8 ) have generally been trained on transcriptomic reference databases such as RefSeq or GENCODE, thus ignoring the large amount of information encoded by cell-type specific RNA processing patterns. Additionally, long-read sequencing has emerged as the gold standard in accurately capturing the complexity of RNA splicing events, yet the current datasets derived from long-read sequencing are relatively small and primarily observational.…”

Section: Introductionmentioning

confidence: 99%

Perturbation-aware predictive modeling of RNA splicing using bidirectional transformers

McNally,

Abdulhay,

Khalaj

et al. 2024

Preprint

View full text Add to dashboard Cite

Predicting molecular function directly from DNA sequence remains a grand challenge in computational and molecular biology. Here, we engineer and train bidirectional transformer models to predict the chemical grammar of alternative human mRNA splicing leveraging the largest perturbative full-length RNA dataset to date. By combining high-throughput single-molecule long-read "chemical transcriptomics" in human cells with transformer models, we train AllSplice - a nucleotide foundation model that achieves state-of-the-art prediction of canonical and noncanonical splice junctions across the human transcriptome. We demonstrate improved performance achieved through incorporation of diverse noncanonical splice sites in its training set that were identified through long-read RNA data. Leveraging chemical perturbations and multiple cell types in the data, we fine-tune AllSplice to train ChemSplice - the first predictive model of sequence-dependent and cell-type specific alternative splicing following programmed cellular perturbation. We anticipate the broad application of AllSplice, ChemSplice, and other models fine-tuned on this foundation to myriad areas of RNA therapeutics development.

show abstract

Uni-Rna: Universal Pre-Trained Models Revolutionize Rna Research

Wang¹,

Gu²,

Chen³

et al. 2023

Preprint

View full text Add to dashboard Cite

RNA molecules play a crucial role as intermediaries in diverse biological processes. Attaining a profound understanding of their function can substantially enhance our comprehension of life’s activities and facilitate drug development for numerous diseases. The advent of high-throughput sequencing technologies makes vast amounts of RNA sequence data accessible, which contains invaluable information and knowledge. However, deriving insights for further application from such an immense volume of data poses a significant challenge. Fortunately, recent advancements in pre-trained models have surfaced as a revolutionary solution for addressing such challenges owing to their exceptional ability to automatically mine and extract hidden knowledge from massive datasets. Inspired by the past successes, we developed a novel context-aware deep learning model named Uni-RNA that performs pre-training on the largest dataset of RNA sequences at the unprecedented scale to date. During this process, our model autonomously unraveled the obscured evolutionary and structural information embedded within the RNA sequences. As a result, through fine-tuning, our model achieved the state-of-the-art (SOTA) performances in a spectrum of downstream tasks, including both structural and functional predictions. Overall, Uni-RNA established a new research paradigm empowered by the large pre-trained model in the field of RNA, enabling the community to unlock the power of AI at a whole new level to significantly expedite the pace of research and foster groundbreaking discoveries.

show abstract

Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction

Cited by 10 publications

References 73 publications

Evaluating the representational power of pre-trained DNA language models for regulatory genomics

Evaluating the representational power of pre-trained DNA language models for regulatory genomics

Perturbation-aware predictive modeling of RNA splicing using bidirectional transformers

Uni-Rna: Universal Pre-Trained Models Revolutionize Rna Research

Contact Info

Product

Resources

About