2022
DOI: 10.1109/tpami.2021.3095381
|View full text |Cite
|
Sign up to set email alerts
|

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

Abstract: Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduc… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

7
1,154
2
3

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 877 publications
(1,417 citation statements)
references
References 77 publications
7
1,154
2
3
Order By: Relevance
“…Following previous work (Heinzinger et al, 2019;Elnaggar et al, 2021), we began with a data set introduced by DeepLoc (Almagro Armenteros et al, 2017) for training (13 858 proteins) and testing (2 768 proteins). All proteins have experimental evidence for one of ten location classes (nucleus, cytoplasm, extracellular space, mitochondrion, cell membrane, Endoplasmatic Reticulum, plastid, Golgi apparatus, lysosome/vacuole, peroxisome).…”
Section: Standard Setdeeplocmentioning
confidence: 99%
See 4 more Smart Citations
“…Following previous work (Heinzinger et al, 2019;Elnaggar et al, 2021), we began with a data set introduced by DeepLoc (Almagro Armenteros et al, 2017) for training (13 858 proteins) and testing (2 768 proteins). All proteins have experimental evidence for one of ten location classes (nucleus, cytoplasm, extracellular space, mitochondrion, cell membrane, Endoplasmatic Reticulum, plastid, Golgi apparatus, lysosome/vacuole, peroxisome).…”
Section: Standard Setdeeplocmentioning
confidence: 99%
“…We compared embeddings from five main and a sixth additional pre-trained pLMs (Table 1): (1) SeqVec (Heinzinger et al, 2019) is a bidirectional LSTM based on on ELMo (Peters et al, 2018) that was trained on UniRef50 (Suzek et al, 2015). (2) ProtBert (Elnaggar et al, 2021) is an encoder-only model based on BERT (Devlin et al, 2019) that was trained on BFD (Steinegger & Söding, 2018). (3) ProtT5-XL-UniRef50 (Elnaggar et al, 2021) (for simplicity: ProtT5) is an encoder-only model based on T5 (Raffel et al, 2020) that was trained on BFD and fine-tuned on Uniref50.…”
Section: Modelsmentioning
confidence: 99%
See 3 more Smart Citations