2021
DOI: 10.1101/2021.05.24.445464
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ProteinBERT: A universal deep-learning model of protein sequence and function

Abstract: Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme consists of masked language modeling combined with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements th… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
74
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 48 publications
(74 citation statements)
references
References 33 publications
0
74
0
Order By: Relevance
“…Another viable option are the recurrent and attention-based neural networks, which have enough computational power to describe relevant dependencies in protein sequences [108, 109, 110]. However, while modern neural networks have been successfully applied to annotation of protein families [111, 112], their performance in modeling short protein sequence fragments is yet too be evaluated.…”
Section: Discussionmentioning
confidence: 99%
“…Another viable option are the recurrent and attention-based neural networks, which have enough computational power to describe relevant dependencies in protein sequences [108, 109, 110]. However, while modern neural networks have been successfully applied to annotation of protein families [111, 112], their performance in modeling short protein sequence fragments is yet too be evaluated.…”
Section: Discussionmentioning
confidence: 99%
“…Among them, UDSMProt 60 , a LSTM sequence model trained on unlabeled Swiss-Prot protein sequences in a self-supervised autoregressive manner has shown remarkable performance on protein-level classification tasks after fine tuning. Another convolutional transformation and attention-based model ProteinBERT 61 , pre-trained on sequence-correction and GO annotation prediction tasks, has shown impressive performance on protein-level regression tasks after fine tuning. We want to explore the possibility of combining ACP-MHCNN for fine tuning these pre-trained models for ACP identification in future work.…”
Section: Discussionmentioning
confidence: 99%
“…Second, we use UniRef50 ( 62) clustering to split the data, to model a challenging use-case in which an unseen sequence has low sequence similarity to anything that has been previously annotated. Note there are alternative methods for splitting (48,63,64), such as reserving the most recentlyannotated proteins for evaluating models. This approach, which is used in CAFA and CASP (63,64), helps ensure a fair competition because labels for the evaluation data are not available to participants, or the scientific community at large, until after the competition submissions are due.…”
Section: A Machine-learning Compatible Dataset For Protein Function Predictionmentioning
confidence: 99%
“…Beyond functional annotation, deep learning has enabled significant advances in protein structure prediction (31)(32)(33)(34)(35)(36), predicting the functional effects of mutations (37)(38)(39)(40), and protein design (41)(42)(43)(44)(45)(46)(47). A key departure from traditional approaches is that researchers have started to incorporate vast amounts of raw, uncurated sequence data into model training, an approach which also shows promise for functional prediction (48). Of particular relevance to the present work is Bileschi et al (2019) (49), where it is shown that models with residual layers (50) of dilated convolutions (51) can precisely and efficiently categorise protein domains.…”
Section: Introductionmentioning
confidence: 99%