2020
DOI: 10.1101/2020.07.12.199554
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning

Abstract: Motivation: Natural Language Processing (NLP) continues improving substantially through auto-regressive (AR) and auto-encoding (AE) Language Models (LMs). These LMs require expensive computing resources for self-supervised or un-supervised learning from huge unlabelled text corpora. The information learned is transferred through so-called embeddings to downstream prediction tasks. Computational biology and bioinformatics provide vast gold-mines of structured and sequentially ordered text data leading to extrao… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

10
673
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
5

Relationship

2
8

Authors

Journals

citations
Cited by 422 publications
(740 citation statements)
references
References 130 publications
10
673
1
Order By: Relevance
“…Relationship Between Pretrained Model Size and Downstream Performance Table 1 and Figure 1 show that there is no clear connection between increasing the number of parameters (in the pretrained model only) and downstream performance, contrary to the philosophy behind 567 million parameter NLP-inspired models for protein representations (Elnaggar et al, 2020). This is true even for variants of CPCProt, which were trained using the same self-supervised objective.…”
Section: Discussionmentioning
confidence: 95%
“…Relationship Between Pretrained Model Size and Downstream Performance Table 1 and Figure 1 show that there is no clear connection between increasing the number of parameters (in the pretrained model only) and downstream performance, contrary to the philosophy behind 567 million parameter NLP-inspired models for protein representations (Elnaggar et al, 2020). This is true even for variants of CPCProt, which were trained using the same self-supervised objective.…”
Section: Discussionmentioning
confidence: 95%
“…Inter-residue distance (30 by N by N, where N is protein size) predictions from trRosetta 1 gives indirect access to evolutionary multiple sequence alignments Bert embeddings Attention heads from the last attention layer of the ProtBert-BFD100 model 16 (16 by N by N, where N is protein size) Table S3: Generated features for all 9 major feature classes. Some features are scaled and normalized to a reasonable range.…”
Section: Multiple Sequence Alignmentmentioning
confidence: 99%
“…Convolutional neural network (CNN) based approaches pre-train weights of convolutional layers on large datasets that can be fine-tuned on smaller datasets 75 . Transformer based approaches, frequently used in natural language processing, have been applied to functional predictions of variants in proteins 85,86 .…”
Section: Opportunities In Rare Variant Evaluationmentioning
confidence: 99%