2021
DOI: 10.1016/j.csbj.2021.03.022
|View full text |Cite
|
Sign up to set email alerts
|

The language of proteins: NLP, machine learning & protein sequences

Abstract: Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences be… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
178
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 225 publications
(195 citation statements)
references
References 82 publications
0
178
0
Order By: Relevance
“…Every year, algorithms improve natural language processing (NLP), in particular by feeding large text corpora into Deep Learning (DL)-based Language Models (LMs). These advances have been transferred to protein sequences by learning to predict masked or missing amino acids using large databases of raw protein sequences as input (Alley et al 2019 ; Bepler and Berger 2019a , 2021 ; Elnaggar et al 2021 ; Heinzinger et al 2019 ; Madani et al 2020 ; Ofer et al 2021 ; Rao et al 2020 ; Rives et al 2021 ). Processing the information learned by such protein LMs (pLMs), e.g., by constructing 1024-dimensional vectors of the last hidden layers, yields a representation of protein sequences referred to as embeddings [Fig.…”
Section: Introductionmentioning
confidence: 99%
“…Every year, algorithms improve natural language processing (NLP), in particular by feeding large text corpora into Deep Learning (DL)-based Language Models (LMs). These advances have been transferred to protein sequences by learning to predict masked or missing amino acids using large databases of raw protein sequences as input (Alley et al 2019 ; Bepler and Berger 2019a , 2021 ; Elnaggar et al 2021 ; Heinzinger et al 2019 ; Madani et al 2020 ; Ofer et al 2021 ; Rao et al 2020 ; Rives et al 2021 ). Processing the information learned by such protein LMs (pLMs), e.g., by constructing 1024-dimensional vectors of the last hidden layers, yields a representation of protein sequences referred to as embeddings [Fig.…”
Section: Introductionmentioning
confidence: 99%
“…Protein language models (pLMs) decode aspects of the language of life. In analogy to the recent leaps in Natural Language Processing (NLP), protein language models (pLMs) learn to "predict" masked amino acids given their context using no other annotation than the amino acid sequences of 10^7-10^9 proteins (Alley et al, 2019;Asgari & Mofrad, 2015;Bepler & Berger, 2019Elnaggar et al, 2021;Heinzinger et al, 2019;Madani et al, 2020;Ofer et al, 2021;Rao et al, 2019;Rives et al, 2021;Wu et al, 2021). Toward this end, NLP words/tokens correspond to amino acids, while sentences correspond to full-length proteins in the current pLMs.…”
Section: Introduction mentioning
confidence: 99%
“…Embeddings extract the information learned by the pLMs. In analogy to LMs in NLP implicitly learning grammar, pLM embeddings decode some aspects of the language of life as written in protein sequences (Heinzinger et al, 2019;Ofer et al, 2021) which suffices as exclusive input to many methods predicting aspects of protein structure and function without any further optimization of the pLM using a second step of supervised training (Alley et al, 2019;Asgari & Mofrad, 2015;Elnaggar et al, 2021;Heinzinger et al, 2019;Madani et al, 2020;Rao et al, 2019;Rives et al, 2021), or by refining the pLM through another supervised task (Bepler & Berger, 2019. Embeddings can outperform homology-based inference based on the traditional sequence comparisons optimized over five decades (Littmann, Bordin, et al, 2021;.…”
Section: Introduction mentioning
confidence: 99%
“…As all these approaches bypass the need of an intermediate structure and use a simple sequence representation, they are not the core of this review. For further reading, we refer the reader to recent reviews [80,81].…”
Section: One-hot Encodingmentioning
confidence: 99%