2022
DOI: 10.1093/bioinformatics/btac020
|View full text |Cite
|
Sign up to set email alerts
|

ProteinBERT: a universal deep-learning model of protein sequence and function

Abstract: Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model h… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
232
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
9
1

Relationship

0
10

Authors

Journals

citations
Cited by 330 publications
(271 citation statements)
references
References 32 publications
0
232
0
Order By: Relevance
“…In genetics, that covers protein functions and how they are affected by mutations, deep learning is also a game changer but this whole field will not be discussed here. Interested readers may refer to this recent work and reference therein for gene function predictions ( Brandes et al, 2022 ) and to this comparative study for protein physical and chemical properties predictions ( Xu et al, 2020 ). Non-coding variants can statistically be associated with phenotypic traits or diseases but their mechanistic role cannot be immediately inferred.…”
Section: Survey Methodologymentioning
confidence: 99%
“…In genetics, that covers protein functions and how they are affected by mutations, deep learning is also a game changer but this whole field will not be discussed here. Interested readers may refer to this recent work and reference therein for gene function predictions ( Brandes et al, 2022 ) and to this comparative study for protein physical and chemical properties predictions ( Xu et al, 2020 ). Non-coding variants can statistically be associated with phenotypic traits or diseases but their mechanistic role cannot be immediately inferred.…”
Section: Survey Methodologymentioning
confidence: 99%
“…Language models have also been employed in repertoire analysis. Before that, language models have been intensively applied to general protein sequences (129)(130)(131)(132). BERTMHC (133) showed utilizing the pre-trained model of ( 129) actually increases the performance in the peptide-MHC (Class II) binding prediction task.…”
Section: Embedding Methods Based On Representation Learningmentioning
confidence: 99%
“…For example, one group of researchers has used concepts from semantic processing, e.g., the frequency of correlated words, to identify potential mutagenic sites in viruses including SARS-CoV-2 ( 64 ). An emerging approach to deep sequencing learning is to transform protein sequences to embeddings that reflect their semantic structure, using the BERT (bidirectional encoder representations from transformers) neural network architecture, which Google developed to handle natural language search ( 65 – 68 ). An example of this approach is k -means clustering of “ProtBERT” SARS-CoV-2 protein embeddings generated by pretraining a BERT model on millions of UniProt sequences, which can be used to identify mutational hot spots within the genome that may give rise to future variants ( 69 ).…”
Section: Can Deep Sequence Learning Help?mentioning
confidence: 99%