ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

El‐Naggar, A. M.; Heinzinger, Michael; Dallago, Christian; Rihawi, Ghalia; Wang, Yu; Jones, Llion; Gibbs, Tom; Fehér, Tamás; Angerer, Christoph; Steinegger, Martin; Bhowmik, Debsindhu; Rost, Burkhard

doi:10.48550/arxiv.2007.06225

Cited by 97 publications

(215 citation statements)

References 0 publications

Supporting

Mentioning

191

Contrasting

Order By: Relevance

“…• However, the same is not true for pretraining with a large-scale protein sequence prediction task (Elnaggar et al, 2020). Pretraining with this task in fact mostly deteriorates the performance on the downstream semantic parsing tasks, suggesting, contrary to some recent claims (Lu et al, 2021), that pretrained representations do not transfer universally and that there has to be a certain kind and degree of similarity between the pretraining and downstream tasks for successful transfer.…”

mentioning

confidence: 79%

Compositional generalization in semantic parsing with pretrained transformers

Orhan¹

2021

Preprint

View full text Add to dashboard Cite

Large-scale pretraining instills large amounts of knowledge in deep neural networks. This, in turn, improves the generalization behavior of these models in downstream tasks. What exactly are the limits to the generalization benefits of large-scale pretraining? Here, we report observations from some simple experiments aimed at addressing this question in the context of two semantic parsing tasks involving natural language, SCAN and COGS. We show that language models pretrained exclusively with non-English corpora, or even with programming language corpora, significantly improve out-of-distribution generalization in these benchmarks, compared with models trained from scratch, even though both benchmarks are English-based. This demonstrates the surprisingly broad transferability of pretrained representations and knowledge. Pretraining with a large-scale protein sequence prediction task, on the other hand, mostly deteriorates the generalization performance in SCAN and COGS, suggesting that pretrained representations do not transfer universally and that there are constraints on the similarity between the pretraining and downstream domains for successful transfer. Finally, we show that larger models are harder to train from scratch and their generalization accuracy is lower when trained up to convergence on the relatively small SCAN and COGS datasets, but the benefits of large-scale pretraining become much clearer with larger models.

show abstract

mentioning

confidence: 79%

Compositional generalization in semantic parsing with pretrained transformers

Orhan¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…As we mentioned in related work, pre-trained LM, such as SeqVec [23] and ProtBert [24], already proved their performance to capture rudimentary features of proteins such as secondary structures, biological activities, and functions [22,21]. Especially, it was shown that SeqVec [23] is better than ProtBert [24] to extract high-level features related functions for PFP [5]. SeqVec [23] is utilized as a protein sequence encoder.…”

Section: Protein Sequence Encodingmentioning

confidence: 96%

“…With the advent of transformers [19], which are attention-based model, in Natural Language Processing (NLP), various attention-based LMs were applied to protein sequence embedding [20,21,22,23,24]. As protein sequences can be considered as sentences, these learned the relationship between amino acids constituting the sequence and learned contextual information.…”

Section: Protein Sequence Feature Extractionmentioning

confidence: 99%

“…As protein sequences can be considered as sentences, these learned the relationship between amino acids constituting the sequence and learned contextual information. SeqVec [23] and ProtBert [24], which were learned protein sequences using ElMo [25] and BERT [26], showed that these mostly extracted biophysical features of protein structures and functions, such as secondary structures, binding sites, and homology detections.…”

Section: Protein Sequence Feature Extractionmentioning

confidence: 99%

See 1 more Smart Citation

An Effective GCN-based Hierarchical Multi-label classification for Protein Function Prediction

Choi¹,

Lee²,

Cheongwon³

et al. 2021

Preprint

View full text Add to dashboard Cite

We propose an effective method to improve Protein Function Prediction (PFP) utilizing hierarchical features of Gene Ontology (GO) terms. Our method consists of a language model for encoding the protein sequence and a Graph Convolutional Network (GCN) for representing GO terms. To reflect the hierarchical structure of GO to GCN, we employ node(GO term)-wise representations containing the whole hierarchical information. Our algorithm shows effectiveness in a large-scale graph by expanding the GO graph compared to previous models. Experimental results show that our method outperformed state-of-the-art PFP approaches.

show abstract

“…We compare with CNN-based models and GNN-based models which learn the protein annotations using 3D structures from scratch. For fair comparison, we do not include LSTM-based or transformer-based methods, as they all pre-train their models using millions of protein sequences and only fine-tune their models on 3D structures (Bepler & Berger, 2019;Alley et al, 2019;Rao et al, 2019;Strodthoff et al, 2020;Elnaggar et al, 2020).…”

Section: Model Quality Assessmentmentioning

confidence: 99%

Directed Weight Neural Networks for Protein Structure Representation Learning

Li¹,

Luo²,

Deng³

et al. 2022

Preprint

View full text Add to dashboard Cite

A protein performs biological functions by folding to a particular 3D structure. To accurately model the protein structures, both the overall geometric topology and local fine-grained relations between amino acids (e.g. side-chain torsion angles and inter-amino-acid orientations) should be carefully considered. In this work, we propose the Directed Weight Neural Network for better capturing geometric relations among different amino acids. Extending a single weight from a scalar to a 3D directed vector, our new framework supports a rich set of geometric operations on both classical and SO(3)-representation features, on top of which we construct a perceptron unit for processing amino-acid information. In addition, we introduce an equivariant message passing paradigm on proteins for plugging the directed weight perceptrons into existing Graph Neural Networks, showing superior versatility in maintaining SO(3)-equivariance at the global scale. Experiments show that our network has remarkably better expressiveness in representing geometric relations in comparison to classical neural networks and the (globally) equivariant networks. It also achieves state-of-the-art performance on various computational biology applications related to protein 3D structures. All codes and models will be published upon acceptance.

show abstract

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

Cited by 97 publications

References 0 publications

Compositional generalization in semantic parsing with pretrained transformers

Compositional generalization in semantic parsing with pretrained transformers

An Effective GCN-based Hierarchical Multi-label classification for Protein Function Prediction

Directed Weight Neural Networks for Protein Structure Representation Learning

Contact Info

Product

Resources

About