Protein embeddings and deep learning predict binding residues for various ligand classes

Littmann, Maria; Heinzinger, Michael; Dallago, Christian; Weißenow, Konstantin; Rost, Burkhard

doi:10.1101/2021.09.03.458869

Cited by 11 publications

(12 citation statements)

References 58 publications

(166 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…5 , Fig. 6 ), although for SAV effect predictions, embedding-based methods are still not yet outperforming the MSA-based SOTA as for other prediction tasks (Elnaggar et al 2021 ; Littmann et al 2021a , b , c ; Stärk et al 2021 ). Embedding-based predictions are blazingly fast, thereby they save computing, and ultimately energy resources when applied to daily sequence analysis.…”

Section: Discussionmentioning

confidence: 95%

“…1 in (Elnaggar et al 2021 )]. Embeddings have succeeded as exclusive input to predicting secondary structure and subcellular location at performance levels almost reaching (Alley et al 2019 ; Heinzinger et al 2019 ; Rives et al 2021 ) or even exceeding (Elnaggar et al 2021 ; Littmann et al 2021c ; Stärk et al 2021 ) state-of-the-art (SOTA) methods using EI from MSAs as input. Embeddings even succeed in substituting sequence similarity for homology-based annotation transfer (Littmann et al 2021a , b ) and in predicting the effect of mutations on protein–protein interactions (Zhou et al 2020 ).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Embeddings from protein language models predict conservation and variant effects

et al. 2021

Self Cite

View full text Add to dashboard Cite

The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient—MCC—for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.

show abstract

Section: Discussionmentioning

confidence: 95%

Section: Introductionmentioning

confidence: 99%

Embeddings from protein language models predict conservation and variant effects

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…In their simplest form, embeddings mirror the last "hidden" states/values of pLMs. In analogy to NLPs implicitly learning grammar, embeddings from pLMs capture some aspects of the language of life as written in protein sequences (Alley et al, 2019;Heinzinger et al, 2019;Ofer et al, 2021;Rives et al, 2021), which suffices as exclusive input to many methods predicting aspects of protein structure and function (Asgari and Mofrad, 2015;Alley et al, 2019;Heinzinger et al, 2019;Littmann et al, 2021a;Littmann et al, 2021b;Littmann et al, 2021c;Elnaggar et al, 2021;Heinzinger et al, 2021;Marquet et al, 2021;Rives et al, 2021).…”

Section: Many Prediction Methods Availablementioning

confidence: 99%

SETH predicts nuances of residue disorder from protein embeddings

Ilzhöfer

Heinzinger²,

Rost³

2022

Front. Bioinform.

Self Cite

View full text Add to dashboard Cite

Predictions for millions of protein three-dimensional structures are only a few clicks away since the release of AlphaFold2 results for UniProt. However, many proteins have so-called intrinsically disordered regions (IDRs) that do not adopt unique structures in isolation. These IDRs are associated with several diseases, including Alzheimer’s Disease. We showed that three recent disorder measures of AlphaFold2 predictions (pLDDT, “experimentally resolved” prediction and “relative solvent accessibility”) correlated to some extent with IDRs. However, expert methods predict IDRs more reliably by combining complex machine learning models with expert-crafted input features and evolutionary information from multiple sequence alignments (MSAs). MSAs are not always available, especially for IDRs, and are computationally expensive to generate, limiting the scalability of the associated tools. Here, we present the novel method SETH that predicts residue disorder from embeddings generated by the protein Language Model ProtT5, which explicitly only uses single sequences as input. Thereby, our method, relying on a relatively shallow convolutional neural network, outperformed much more complex solutions while being much faster, allowing to create predictions for the human proteome in about 1 hour on a consumer-grade PC with one NVIDIA GeForce RTX 3060. Trained on a continuous disorder scale (CheZOD scores), our method captured subtle variations in disorder, thereby providing important information beyond the binary classification of most methods. High performance paired with speed revealed that SETH’s nuanced disorder predictions for entire proteomes capture aspects of the evolution of organisms. Additionally, SETH could also be used to filter out regions or proteins with probable low-quality AlphaFold2 3D structures to prioritize running the compute-intensive predictions for large data sets. SETH is freely publicly available at: https://github.com/Rostlab/SETH.

show abstract

“…Although falling substantially short of AlphaFold2 (Jumper et al, 2021). Per-protein embeddings outperform the best MSA-based methods in the prediction of sub-cellular location (Staerk et al, 2021), signal peptides (Teufel et al, 2021) and binding residues (Littmann et al, 2021c).…”

Section: Protein Language Models Capture Crucial Constraintsmentioning

confidence: 95%

Nearest neighbor search on embeddings rapidly identifies distant protein relations

et al. 2022

Self Cite

View full text Add to dashboard Cite

Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as “homology detection”) use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly capturing the same constraints as PSSMs, e.g., through embeddings. Here, we explored how to use such embeddings for nearest neighbor searches to identify relations between protein pairs with diverged sequences (remote homology detection for levels of <20% pairwise sequence identity, PIDE). While this approach excelled for proteins with single domains, we demonstrated the current challenges applying this to multi-domain proteins and presented some ideas how to overcome existing limitations, in principle. We observed that sufficiently challenging data set separations were crucial to provide deeply relevant insights into the behavior of nearest neighbor search when applied to the protein embedding space, and made all our methods readily available for others.

show abstract

Protein embeddings and deep learning predict binding residues for various ligand classes

Cited by 11 publications

References 58 publications

Embeddings from protein language models predict conservation and variant effects

Embeddings from protein language models predict conservation and variant effects

SETH predicts nuances of residue disorder from protein embeddings

Nearest neighbor search on embeddings rapidly identifies distant protein relations

Contact Info

Product

Resources

About