ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning

Elnaggar, Ahmed; Heinzinger, Michael; Dallago, Christian; Rehawi, Ghalia; Wang, Yu; Jones, Llion; Gibbs, Tom; Fehér, Tamás; Angerer, Christoph; Steinegger, Martin; Bhowmik, Debsindhu; Rost, Burkhard

doi:10.1101/2020.07.12.199554

Cited by 422 publications

(740 citation statements)

References 130 publications

Supporting

Mentioning

673

Contrasting

Order By: Relevance

“…Relationship Between Pretrained Model Size and Downstream Performance Table 1 and Figure 1 show that there is no clear connection between increasing the number of parameters (in the pretrained model only) and downstream performance, contrary to the philosophy behind 567 million parameter NLP-inspired models for protein representations (Elnaggar et al, 2020). This is true even for variants of CPCProt, which were trained using the same self-supervised objective.…”

Section: Discussionmentioning

confidence: 95%

Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization

Zhang

Ghassemi

et al. 2020

Preprint

View full text Add to dashboard Cite

Pretrained embedding representations of biological sequences which capture meaningful properties can alleviate many problems associated with supervised learning in biology. We apply the principle of mutual information maximization between local and global information as a self-supervised pretraining signal for protein embeddings. To do so, we divide protein sequences into fixed size fragments, and train an autoregressive model to distinguish between subsequent fragments from the same protein and fragments from random proteins. Our model, CPCProt, achieves comparable performance to state-of-the-art self-supervised models for protein sequence embeddings on various downstream tasks, but reduces the number of parameters down to 0.9% to 8.9% of benchmarked models. Further, we explore how downstream assessment protocols affect embedding evaluation, and the effect of contrastive learning hyperparameters on empirical performance. We hope that these results will inform the development of contrastive learning methods in protein biology and other modalities.

show abstract

Section: Discussionmentioning

confidence: 95%

Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization

Zhang

Ghassemi

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Inter-residue distance (30 by N by N, where N is protein size) predictions from trRosetta 1 gives indirect access to evolutionary multiple sequence alignments Bert embeddings Attention heads from the last attention layer of the ProtBert-BFD100 model 16 (16 by N by N, where N is protein size) Table S3: Generated features for all 9 major feature classes. Some features are scaled and normalized to a reasonable range.…”

Section: Multiple Sequence Alignmentmentioning

confidence: 99%

Improved protein structure refinement guided by deep learning based accuracy estimation

Hiranuma

Park

Baek

et al. 2020

Preprint

View full text Add to dashboard Cite

We develop a deep learning framework (DeepAccNet) that estimates per-residue accuracy and residue-residue distance signed error in protein models and uses these predictions to guide Rosetta protein structure refinement. The network uses 3D convolutions to evaluate local atomic environments followed by 2D convolutions to provide their global contexts. The network was trained on approximately 1 million alternative local energy minima for 7,510 different proteins exhibiting a wide diversity of errors, and outperforms other methods that similarly predict the accuracy of protein structure models without template or evolutionary information. Overall accuracy predictions for X-ray and cryoEM structures in the PDB correlate with resolution, and the network should be broadly useful for assessing accuracy of both predicted structure models and experimentally determined structures, and identifying specific regions likely to be in error. Guiding protein structure refinement by incorporation of the accuracy predictions at multiple stages in the Rosetta refinement protocol led to improvements in model quality in 63 out of 73 test cases, illustrating how deep learning can improve search for global energy minima.

show abstract

“…Convolutional neural network (CNN) based approaches pre-train weights of convolutional layers on large datasets that can be fine-tuned on smaller datasets 75 . Transformer based approaches, frequently used in natural language processing, have been applied to functional predictions of variants in proteins 85,86 .…”

Section: Opportunities In Rare Variant Evaluationmentioning

confidence: 99%

Opportunities and Challenges for Interpreting Rare Variation in Clinically Important Genes

McInnes¹,

Ag²,

Ml³

et al. 2020

Preprint

View full text Add to dashboard Cite

Genome sequencing is enabling precision medicine—tailoring treatment to the unique constellation of variants in an individual’s genome. The impact of recurrent pathogenic variants is often understood, leaving a long tail of rare genetic variants that are uncharacterized. The problem of uncharacterized rare variation is especially acute when it occurs in genes of known clinical importance with functionally consequent frequent variants and associated mechanisms. Variants of unknown significance (VUS) in these genes are discovered at a rate that outpaces current ability to classify them using databases of previous cases, experimental evaluation, and computational predictors. Clinicians are thus left without guidance about the significance of variants that may have actionable consequences. Computational prediction of the impact of rare genetic variation is increasingly becoming an important capability. In this paper, we review the technical and ethical challenges of interpreting the function of rare variants in two settings: inborn errors of metabolism in newborns, and pharmacogenomics. We propose a framework for a genomic learning healthcare system with an initial focus on early-onset treatable disease in newborns and actionable pharmacogenomics. We argue that (1) a genomic learning healthcare system must allow for continuous collection and assessment of rare variants, (2) emerging machine learning methods will enable algorithms to predict the clinical impact of rare variants on protein function, and (3) ethical considerations must inform the construction and deployment of all rare-variation triage strategies, particularly with respect to health disparities arising from unbalanced ancestry representation.

show abstract

ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning

Cited by 422 publications

References 130 publications

Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization

Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization

Improved protein structure refinement guided by deep learning based accuracy estimation

Opportunities and Challenges for Interpreting Rare Variation in Clinically Important Genes

Contact Info

Product

Resources

About