Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks

Nambiar, Ananthan; Liu, Simon; Hopkins, Mark; Heflin, Maeve; Maslov, Sergei; Ritz, Anna

doi:10.1101/2020.06.15.153643

Cited by 65 publications

(79 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Self-supervised pretraining has been shown to boost model performance for natural language processing and computer vision tasks (Devlin et al , 2019; Chen et al , 2020). Recent research has also shown the potential benefits of self-supervised pretraining on protein related tasks (Rao et al , 2019; Nambiar et al , 2020), such as contact prediction. However, to date, no work has explored self-supervised pretraining on MHC–peptide related tasks.…”

Section: Resultsmentioning

confidence: 99%

“…Such models are trained to predict words masked out in a sentence or to predict the next word or sentence following some context. Similar techniques have also been applied to proteins (Rao et al , 2019; Nambiar et al , 2020; Heinzinger et al , 2019). Since these models do not require labels to train, they can be trained on very large corpora of protein sequences across many species.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

BERTMHC: Improves MHC-peptide class II interaction prediction with transformer and multiple instance learning

Cheng¹,

Bendjama

Rittner

et al. 2020

Preprint

View full text Add to dashboard Cite

MotivationIncreasingly comprehensive characterisation of cancer associated genetic alteration has paved the way for the development of highly specific therapeutic vaccines. Predicting precisely binding and presentation of peptides by MHC alleles is an important step towards such therapies. Recent data suggest that presentation of both class I and II epitopes is critical for the induction of a sustained effective immune response. However, the prediction performance for MHC class II has been limited compared to class I.ResultsWe present a transformer neural network model which leverages on self-supervised pretraining from a large corpus of protein sequences. We also propose a multiple instance learning (MIL) framework to deconvolve mass spectrometry data where multiple potential MHC alleles may have presented each peptide. We show that pretraining boosted the performance for these tasks. Combining pretraining and the novel MIL approach, our model outperforms state-of-the-art models for both binding and mass spectrometry presentation predictions.AvailabilityOur model is available at https://github.com/s6juncheng/BERTMHCContactjun.cheng@neclab.eu, brandon.malone@neclab.eu

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

BERTMHC: Improves MHC-peptide class II interaction prediction with transformer and multiple instance learning

Cheng¹,

Bendjama

Rittner

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Recent studies show that models based on natural language processing inspired techniques such as Transformer, [217] BERT, [218] and GPT-2 [219] can learn features from a large corpus of protein sequences in a self-supervised fashion, with applications in a variety of downstream tasks. [220,221] Besides a linear sequence of amino acids, proteins can also be modeled as a graph to capture both structure and sequence information. Graph neural networks [222] are powerful deep learning architectures for learning representations of nodes and edges from such data.…”

Section: Discussionmentioning

confidence: 99%

Deep Learning in Proteomics

et al. 2020

View full text Add to dashboard Cite

Proteomics, the study of all the proteins in biological systems, is becoming a data-rich science. Protein sequences and structures are comprehensively catalogued in online databases. With recent advancements in tandem mass spectrometry (MS) technology, protein expression and post-translational modifications (PTMs) can be studied in a variety of biological systems at the global scale. Sophisticated computational algorithms are needed to translate the vast amount of data into novel biological insights. Deep learning automatically extracts data representations at high levels of abstraction from data, and it thrives in data-rich scientific research domains. Here, a comprehensive overview of deep learning applications in proteomics, including retention time prediction, MS/MS spectrum prediction, de novo peptide sequencing, PTM prediction, major histocompatibility complex-peptide binding prediction, and protein structure prediction, is provided. Limitations and the future directions of deep learning in proteomics are also discussed. This review will provide readers an overview of deep learning and how it can be used to analyze proteomics data.

show abstract

“…Natural language processing models, specifically language modeling techniques, have also made an impact in the domain of COVID-19 vaccine discovery. Pre-trained transformers were used to predict protein interaction (Nambiar et al, 2020) and model molecular reactions in carbohydrate chemistry (Pesciullesi et al, 2020), which can be utilized in the process of vaccine development. Chen et al discussed the use-case of an LSTMbased seq-2-seq model for predicting the secondary structure of certain SARS-COV-2 proteins (Karpov et al, 2019) 3 .…”

Section: Covid-19 Vaccine Discoverymentioning

confidence: 99%

Artificial Intelligence for COVID-19 Drug Discovery and Vaccine Development

Arshadi

Webb

Salem

et al. 2020

Front. Artif. Intell.

163

View full text Add to dashboard Cite

SARS-COV-2 has roused the scientific community with a call to action to combat the growing pandemic. At the time of this writing, there are as yet no novel antiviral agents or approved vaccines available for deployment as a frontline defense. Understanding the pathobiology of COVID-19 could aid scientists in their discovery of potent antivirals by elucidating unexplored viral pathways. One method for accomplishing this is the leveraging of computational methods to discover new candidate drugs and vaccines in silico. In the last decade, machine learning-based models, trained on specific biomolecules, have offered inexpensive and rapid implementation methods for the discovery of effective viral therapies. Given a target biomolecule, these models are capable of predicting inhibitor candidates in a structural-based manner. If enough data are presented to a model, it can aid the search for a drug or vaccine candidate by identifying patterns within the data. In this review, we focus on the recent advances of COVID-19 drug and vaccine development using artificial intelligence and the potential of intelligent training for the discovery of COVID-19 therapeutics. To facilitate applications of deep learning for SARS-COV-2, we highlight multiple molecular targets of COVID-19, inhibition of which may increase patient survival. Moreover, we present CoronaDB-AI, a dataset of compounds, peptides, and epitopes discovered either in silico or in vitro that can be potentially used for training models in order to extract COVID-19 treatment. The information and datasets provided in this review can be used to train deep learning-based models and accelerate the discovery of effective viral therapies.

show abstract

Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks

Cited by 65 publications

References 49 publications

BERTMHC: Improves MHC-peptide class II interaction prediction with transformer and multiple instance learning

BERTMHC: Improves MHC-peptide class II interaction prediction with transformer and multiple instance learning

Deep Learning in Proteomics

Artificial Intelligence for COVID-19 Drug Discovery and Vaccine Development

Contact Info

Product

Resources

About