The language of proteins: NLP, machine learning &amp; protein sequences

Ofer, Dan; Brandes, Nadav; Linial, Michal

doi:10.1016/j.csbj.2021.03.022

Cited by 225 publications

(195 citation statements)

References 82 publications

Supporting

Mentioning

178

Contrasting

Order By: Relevance

“…Every year, algorithms improve natural language processing (NLP), in particular by feeding large text corpora into Deep Learning (DL)-based Language Models (LMs). These advances have been transferred to protein sequences by learning to predict masked or missing amino acids using large databases of raw protein sequences as input (Alley et al 2019 ; Bepler and Berger 2019a , 2021 ; Elnaggar et al 2021 ; Heinzinger et al 2019 ; Madani et al 2020 ; Ofer et al 2021 ; Rao et al 2020 ; Rives et al 2021 ). Processing the information learned by such protein LMs (pLMs), e.g., by constructing 1024-dimensional vectors of the last hidden layers, yields a representation of protein sequences referred to as embeddings [Fig.…”

Section: Introductionmentioning

confidence: 99%

Embeddings from protein language models predict conservation and variant effects

et al. 2021

View full text Add to dashboard Cite

The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient—MCC—for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.

show abstract

Section: Introductionmentioning

confidence: 99%

Embeddings from protein language models predict conservation and variant effects

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Protein language models (pLMs) decode aspects of the language of life. In analogy to the recent leaps in Natural Language Processing (NLP), protein language models (pLMs) learn to "predict" masked amino acids given their context using no other annotation than the amino acid sequences of 10^7-10^9 proteins (Alley et al, 2019;Asgari & Mofrad, 2015;Bepler & Berger, 2019Elnaggar et al, 2021;Heinzinger et al, 2019;Madani et al, 2020;Ofer et al, 2021;Rao et al, 2019;Rives et al, 2021;Wu et al, 2021). Toward this end, NLP words/tokens correspond to amino acids, while sentences correspond to full-length proteins in the current pLMs.…”

Section: Introduction mentioning

confidence: 99%

“…Embeddings extract the information learned by the pLMs. In analogy to LMs in NLP implicitly learning grammar, pLM embeddings decode some aspects of the language of life as written in protein sequences (Heinzinger et al, 2019;Ofer et al, 2021) which suffices as exclusive input to many methods predicting aspects of protein structure and function without any further optimization of the pLM using a second step of supervised training (Alley et al, 2019;Asgari & Mofrad, 2015;Elnaggar et al, 2021;Heinzinger et al, 2019;Madani et al, 2020;Rao et al, 2019;Rives et al, 2021), or by refining the pLM through another supervised task (Bepler & Berger, 2019. Embeddings can outperform homology-based inference based on the traditional sequence comparisons optimized over five decades (Littmann, Bordin, et al, 2021;.…”

Section: Introduction mentioning

confidence: 99%

Protein language model embeddings for fast, accurate, alignment-free protein structure prediction

Weißenow

Heinzinger

Rost

2021

Preprint

View full text Add to dashboard Cite

All state-of-the-art (SOTA) protein structure predictions rely on evolutionary information captured in multiple sequence alignments (MSAs), primarily on evolutionary couplings (co-evolution). Such information is not available for all proteins and is computationally expensive to generate. Prediction models based on Artificial Intelligence (AI) using only single sequences as input are easier and cheaper but perform so poorly that speed becomes irrelevant. Here, we described the first competitive AI solution exclusively inputting embeddings extracted from pre-trained protein Language Models (pLMs), namely from the transformer pLM ProtT5, from single sequences into a relatively shallow (few free parameters) convolutional neural network (CNN) trained on inter-residue distances, i.e. protein structure in 2D. The major advance originated from processing the attention heads learned by ProtT5. Although these models required at no point any MSA, they matched the performance of methods relying on co-evolution. Although not reaching the very top, our lean approach came close at substantially lower costs thereby speeding up development and each future prediction. By generating protein-specific rather than family-averaged predictions, these new solutions could distinguish between structural features differentiating members of the same family of proteins with similar structure predicted alike by all other top methods.

show abstract

“…As all these approaches bypass the need of an intermediate structure and use a simple sequence representation, they are not the core of this review. For further reading, we refer the reader to recent reviews [80,81].…”

Section: One-hot Encodingmentioning

confidence: 99%

Protein Design with Deep Learning

Defresne

Barbe

Schiex³

2021

IJMS

View full text Add to dashboard Cite

Computational Protein Design (CPD) has produced impressive results for engineering new proteins, resulting in a wide variety of applications. In the past few years, various efforts have aimed at replacing or improving existing design methods using Deep Learning technology to leverage the amount of publicly available protein data. Deep Learning (DL) is a very powerful tool to extract patterns from raw data, provided that data are formatted as mathematical objects and the architecture processing them is well suited to the targeted problem. In the case of protein data, specific representations are needed for both the amino acid sequence and the protein structure in order to capture respectively 1D and 3D information. As no consensus has been reached about the most suitable representations, this review describes the representations used so far, discusses their strengths and weaknesses, and details their associated DL architecture for design and related tasks.

show abstract

The language of proteins: NLP, machine learning & protein sequences

Cited by 225 publications

References 82 publications

Embeddings from protein language models predict conservation and variant effects

Embeddings from protein language models predict conservation and variant effects

Protein language model embeddings for fast, accurate, alignment-free protein structure prediction

Protein Design with Deep Learning

Contact Info

Product

Resources

About