2023
DOI: 10.1101/2023.01.31.526427
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction

Abstract: RNA splicing is an important post-transcriptional process of gene expression in eukaryotic organisms. Here, we developed a novel language model, SpliceBERT, pre-trained on the precursor messenger RNA sequences of 72 vertebrates to improve sequence-based modelling of RNA splicing. SpliceBERT is capable of generating embeddings that preserve the evolutionary information of nucleotides and functional characteristics of splice sites. Moreover, the pre-trained model can be utilized to prioritize potential splice-di… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
8
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 10 publications
(14 citation statements)
references
References 73 publications
(74 reference statements)
0
8
0
Order By: Relevance
“…Notably, we elected to not fine-tune weights of the gLM on each downstream task, which is how gLMs have been previously benchmarked 23,24,26,30,40 . While gLM performance would likely improve with fine-tuning, the scope of this study was to strictly gauge the knowledge of cis -regulatory biology learned during pre-training.…”
Section: Discussionmentioning
confidence: 99%
See 3 more Smart Citations
“…Notably, we elected to not fine-tune weights of the gLM on each downstream task, which is how gLMs have been previously benchmarked 23,24,26,30,40 . While gLM performance would likely improve with fine-tuning, the scope of this study was to strictly gauge the knowledge of cis -regulatory biology learned during pre-training.…”
Section: Discussionmentioning
confidence: 99%
“…The utility of gLMs pre-trained on whole genomes for studying the non-coding genome has been limited. Previous benchmarks have largely considered gLMs that have been fine-tuned (i.e., adjusting the weights of the gLM) on each downstream task 23,24,26,30,40 . In each benchmark, a fine-tuned gLM has demonstrated improved predictions on a host of downstream prediction tasks, often based on the classification of functional elements, such as histone marks or promoter annotations.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…Existing transformer models for predicting RNA splicing from sequence (e.g. SpliceBERT 8 ) have generally been trained on transcriptomic reference databases such as RefSeq or GENCODE, thus ignoring the large amount of information encoded by cell-type specific RNA processing patterns. Additionally, long-read sequencing has emerged as the gold standard in accurately capturing the complexity of RNA splicing events, yet the current datasets derived from long-read sequencing are relatively small and primarily observational.…”
Section: Introductionmentioning
confidence: 99%