Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling

El‐Naggar, A. M.; Essam, Hazem; Salah-Eldin, Wafaa; Moustafa, Walid; Elkerdawy, Mohamed; Rochereau, Charlotte; Rost, Burkhard

doi:10.48550/arxiv.2301.06568

Cited by 18 publications

(30 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a baseline, the receptive field was represented as a one-hot encoding of the amino acids, i.e., no pLM was used. Then, multiple pLMs were considered in this study: ESM-1 small and ESM-1b 31 , ESM-2 34 , ProtT5-XL-U50 32 , CARP-640M 50 , Ankh-base 35 , and Ankh-large 35 (see Suppl. Table 4 for more details).…”

Section: Methodsmentioning

confidence: 99%

“…Recently, LMPhosSite utilized protein language models (pLMs) to improve phosphosite prediction by adding single-position embeddings as input features 28 . PLMs are pretrained models that yield enriched, structure-aware sequence representations, instead of merely encoding the amino acid composition of a receptive field in a protein [29][30][31][32][33][34][35] . They have demonstrated value in various tasks, such as few-shot contact map prediction 36 , protein structure prediction 34 , zero-shot mutation impact prediction 37 , or phylogenetic relationship modelling 38 .…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A study on experimental bias in post-translational modification predictors

Zuallaert

Ramasamy

Bouwmeester

et al. 2022

Preprint

View full text Add to dashboard Cite

Motivation: With a regulatory impact on numerous biological processes, protein phosphorylation is one of the most studied post-translational modifications. Effective computational methods that provide a sequence-based prediction of probable phosphorylation sites are desirable to guide functional experiments or constrain search spaces for proteomics-based experimental pipelines. Currently, the most successful methods train deep learning models on amino acid composition representations. However, recently proposed protein language models provide enriched sequence representations that may contain higher-level pattern information on which more performant phosphorylation site predictions may be based. Results: We explored the applicability of protein language models to general phosphorylation site prediction. We found that training prediction models on top of protein language models yield a relative improvement of between 13.4% and 63.3% in terms of area under the precision-recall curve over the state-of-the-art predictors. Advanced model interpretation and model transferability experiments reveal that across models, protease-specific cleavage patterns give rise to a protease-specific training bias that results in an overly optimistic estimation of phosphorylation site prediction performance, an important caveat in the application of advanced machine learning approaches to protein modification prediction based on proteomics data.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A study on experimental bias in post-translational modification predictors

Zuallaert

Ramasamy

Bouwmeester

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…TAPE (Rao et al, 2019) employed selfsupervised pretraining on large protein sequences datasets and fine-tuning it on specific tasks to predict protein properties. Ankh (Elnaggar et al, 2023) Combination of sequence and structure. Some other methods merged both sequence and 3D structure information.…”

Section: A1 Related Workmentioning

confidence: 99%

“…TAPE (Rao et al, 2019) employed self-supervised pretraining on large protein sequences datasets and fine-tuning it on specific tasks to predict protein properties. Ankh (Elnaggar et al, 2023) utilized protein sequences as input and generates predictions related to protein structure and function. ProGen2 (Madani et al, 2023) generated protein sequences with protein sequences and controllable tags specifying protein properties.…”

Section: Related Workmentioning

confidence: 99%

Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling

Pourmirzaei,

Esmaili,

Pourmirzaei

et al. 2024

Preprint

View full text Add to dashboard Cite

This paper proposes a versatile tokenization method and introduces Prot2Token, a model that combines autoregressive language modeling with protein language models (PLMs) to tackle various protein prediction tasks using protein sequences. Leveraging our tokenization method, Prot2Token adapts existing PLMs for multiple tasks such as protein-level prediction, residue-level prediction, and protein-protein interaction prediction through next-token prediction of tokenized target label sequences. By incorporating prompt tokens into the decoder, Prot2Token enables multi-task training in a single end-to-end session. Our results demonstrate that Prot2Token not only matches the performance of specialized models across various tasks but also paves the way for integrating protein tasks with large language models (LLMs), representing an important step towards creating general-purpose PLMs for advanced protein language processing (PLP). Additionally, we use Prot2Token to develop S-ESM, a structure-aware version of the ESM model, which achieves competitive performance with state-of-the-art methods in 3D structure-related tasks using only protein sequences. Code is available at:https://github.com/mahdip72/prot2token.

show abstract

“…Some notable contributions include AlphaFold2, RoseTTAFold, ESMFold, OmegaFold, and EMBER2, which have successfully estimated amino acid sequence-to-structure mapping [7, 8, 9, 10, 11]. More generalized models such as ProtBERT, ProtT5, Ankh, and xTrimoPGLM offer highly effective contextualized sequence representations that map intuitively to protein function, gene ontology, physiochemical properties, and more [12, 13, 14]. Interestingly, some pLM projects have opted for different vocabularies outside of the traditional single-letter amino acid code.…”

Section: Introductionmentioning

confidence: 99%

cdsBERT - Extending Protein Language Models with Codon Awareness

Hallee,

Rafailidis,

Gleghorn

2023

Preprint

View full text Add to dashboard Cite

Recent advancements in Protein Language Models (pLMs) have enabled high-throughput analysis of proteins through primary sequence alone. At the same time, newfound evidence illustrates that codon usage bias is remarkably predictive and can even change the final structure of a protein. Here, we explore these findings by extending the traditional vocabulary of pLMs from amino acids to codons to encapsulate more information inside CoDing Sequences (CDS). We build upon traditional transfer learning techniques with a novel pipeline of token embedding matrix seeding, masked language modeling, and student-teacher knowledge distillation, called MELD. This transformed the pretrained ProtBERT into cdsBERT; a pLM with a codon vocabulary trained on a massive corpus of CDS. Interestingly, cdsBERT variants produced a highly biochemically relevant latent space, outperforming their amino acid-based counterparts on enzyme commission number prediction. Further analysis revealed that synonymous codon token embeddings moved distinctly in the embedding space, showcasing unique additions of information across broad phylogeny inside these traditionally silent mutations. This embedding movement correlated significantly with average usage bias across phylogeny. Future fine-tuned organism-specific codon pLMs may potentially have a more significant increase in codon usage fidelity. This work enables an exciting potential in using the codon vocabulary to improve current state-of-the-art structure and function prediction that necessitates the creation of a codon pLM foundation model alongside the addition of high-quality CDS to large-scale protein databases.

show abstract

Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling

Cited by 18 publications

References 0 publications

A study on experimental bias in post-translational modification predictors

A study on experimental bias in post-translational modification predictors

Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling

cdsBERT - Extending Protein Language Models with Codon Awareness

Contact Info

Product

Resources

About