ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

Elnaggar, Ahmed; Heinzinger, Michael; Dallago, Christian; Rehawi, Ghalia; Wang, Yu; Jones, Llion; Gibbs, Tom; Fehér, Tamás; Angerer, Christoph; Steinegger, Martin; Bhowmik, Debsindhu; Rost, Burkhard

doi:10.1109/tpami.2021.3095381

Cited by 877 publications

(1,417 citation statements)

References 77 publications

Supporting

Mentioning

1,154

Contrasting

Unclassified

Order By: Relevance

“…Following previous work (Heinzinger et al, 2019;Elnaggar et al, 2021), we began with a data set introduced by DeepLoc (Almagro Armenteros et al, 2017) for training (13 858 proteins) and testing (2 768 proteins). All proteins have experimental evidence for one of ten location classes (nucleus, cytoplasm, extracellular space, mitochondrion, cell membrane, Endoplasmatic Reticulum, plastid, Golgi apparatus, lysosome/vacuole, peroxisome).…”

Section: Standard Setdeeplocmentioning

confidence: 99%

“…We compared embeddings from five main and a sixth additional pre-trained pLMs (Table 1): (1) SeqVec (Heinzinger et al, 2019) is a bidirectional LSTM based on on ELMo (Peters et al, 2018) that was trained on UniRef50 (Suzek et al, 2015). (2) ProtBert (Elnaggar et al, 2021) is an encoder-only model based on BERT (Devlin et al, 2019) that was trained on BFD (Steinegger & Söding, 2018). (3) ProtT5-XL-UniRef50 (Elnaggar et al, 2021) (for simplicity: ProtT5) is an encoder-only model based on T5 (Raffel et al, 2020) that was trained on BFD and fine-tuned on Uniref50.…”

Section: Modelsmentioning

confidence: 99%

“…(2) ProtBert (Elnaggar et al, 2021) is an encoder-only model based on BERT (Devlin et al, 2019) that was trained on BFD (Steinegger & Söding, 2018). (3) ProtT5-XL-UniRef50 (Elnaggar et al, 2021) (for simplicity: ProtT5) is an encoder-only model based on T5 (Raffel et al, 2020) that was trained on BFD and fine-tuned on Uniref50. (4) ESM-1b (Rives et al, 2021) is a transformer model that was trained on UniRef50.…”

Section: Modelsmentioning

confidence: 99%

“…Protein Language Models (pLMs) better represent sequences. Recently, protein sequence representations (embeddings) have been learned from databases (Steinegger & Söding, 2018;Consortium, 2021) using language models (LMs) (Bepler & Berger, 2019;Alley et al, 2019;Heinzinger et al, 2019;Rives et al, 2021;Elnaggar et al, 2021) initially used in natural language processing (NLP) (Peters et al, 2018;Devlin et al, 2019;Raffel et al, 2020). Models trained on protein embeddings via transfer learning tend to be outperformed by approaches using MSAs (Rao et al, 2019;Heinzinger et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

“…However, embeddingbased solutions can outshine HBI (Littmann et al, 2021) and advanced protein structure prediction methods (Bhattacharya et al, 2020;Rao et al, 2020;Weißenow et al, 2021). Yet, for location prediction, embedding-based models (Heinzinger et al, 2019;Elnaggar et al, 2021;Littmann et al, 2021) remained inferior to the state-of-the-art (SOTA) using MSAs, such as DeepLoc (Almagro Armenteros et al, 2017).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Light Attention Predicts Protein Location from the Language of Life

Staerk

Dallago

Heinzinger

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Although knowing where a protein functions in a cell is important to characterize biological processes, this information remains unavailable for most known proteins. Machine learning narrows the gap through predictions from expertly chosen input features leveraging evolutionary information that is resource expensive to generate. We showcase using embeddings from protein language models for competitive localization predictions not relying on evolutionary information. Our lightweight deep neural network architecture uses a softmax weighted aggregation mechanism with linear complexity in sequence length referred to as light attention (LA). The method significantly outperformed the state-of-the-art for ten localization classes by about eight percentage points (Q10). The novel models are available as a web-service and as a stand-alone application at embed.protein.properties.

show abstract

Section: Standard Setdeeplocmentioning

confidence: 99%

Section: Modelsmentioning

confidence: 99%

Section: Modelsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Light Attention Predicts Protein Location from the Language of Life

Staerk

Dallago

Heinzinger

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Pre‐Training of Equivariant Graph Matching Networks with Conformation Flexibility for Drug Binding

Jin

Jiang³

et al. 2022

Advanced Science

View full text Add to dashboard Cite

The latest biological findings observe that the motionless “lock‐and‐key” theory is not generally applicable and that changes in atomic sites and binding pose can provide important information for understanding drug binding. However, the computational expenditure limits the growth of protein trajectory‐related studies, thus hindering the possibility of supervised learning. A spatial‐temporal pre‐training method based on the modified equivariant graph matching networks, dubbed ProtMD which has two specially designed self‐supervised learning tasks: atom‐level prompt‐based denoising generative task and conformation‐level snapshot ordering task to seize the flexibility information inside molecular dynamics (MD) trajectories with very fine temporal resolutions is presented. The ProtMD can grant the encoder network the capacity to capture the time‐dependent geometric mobility of conformations along MD trajectories. Two downstream tasks are chosen to verify the effectiveness of ProtMD through linear detection and task‐specific fine‐tuning. A huge improvement from current state‐of‐the‐art methods, with a decrease of 4.3% in root mean square error for the binding affinity problem and an average increase of 13.8% in the area under receiver operating characteristic curve and the area under the precision‐recall curve for the ligand efficacy problem is observed. The results demonstrate a strong correlation between the magnitude of conformation's motion in the 3D space and the strength with which the ligand binds with its receptor.

show abstract

Explainable Deep Hypergraph Learning Modeling the Peptide Secondary Structure Prediction

et al. 2023

View full text Add to dashboard Cite

Accurately predicting peptide secondary structures remains a challenging task due to the lack of discriminative information in short peptides. In this study, PHAT is proposed, a deep hypergraph learning framework for the prediction of peptide secondary structures and the exploration of downstream tasks. The framework includes a novel interpretable deep hypergraph multi-head attention network that uses residue-based reasoning for structure prediction. The algorithm can incorporate sequential semantic information from large-scale biological corpus and structural semantic information from multi-scale structural segmentation, leading to better accuracy and interpretability even with extremely short peptides. The interpretable models are able to highlight the reasoning of structural feature representations and the classification of secondary substructures. The importance of secondary structures in peptide tertiary structure reconstruction and downstream functional analysis is further demonstrated, highlighting the versatility of our models. To facilitate the use of the model, an online server is established which is accessible via http://inner.wei-group.net/PHAT/. The work is expected to assist in the design of functional peptides and contribute to the advancement of structural biology research.

show abstract

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

Cited by 877 publications

References 77 publications

Light Attention Predicts Protein Location from the Language of Life

Light Attention Predicts Protein Location from the Language of Life

Pre‐Training of Equivariant Graph Matching Networks with Conformation Flexibility for Drug Binding

Explainable Deep Hypergraph Learning Modeling the Peptide Secondary Structure Prediction

Contact Info

Product

Resources

About