2020
DOI: 10.1101/2020.09.04.283929
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization

Abstract: Pretrained embedding representations of biological sequences which capture meaningful properties can alleviate many problems associated with supervised learning in biology. We apply the principle of mutual information maximization between local and global information as a self-supervised pretraining signal for protein embeddings. To do so, we divide protein sequences into fixed size fragments, and train an autoregressive model to distinguish between subsequent fragments from the same protein and fragments from… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
58
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 56 publications
(59 citation statements)
references
References 42 publications
1
58
0
Order By: Relevance
“…Alternatives to the masked language modeling objective have also been explored, such as conditional generation (Madani et al, 2020) and contrastive loss functions (Lu et al, 2020). Most relevant to our work, Sturmfels et al (2020) and Sercu et al (2020) study alternative learning objectives using sets of sequences for supervision.…”
Section: Related Workmentioning
confidence: 99%
“…Alternatives to the masked language modeling objective have also been explored, such as conditional generation (Madani et al, 2020) and contrastive loss functions (Lu et al, 2020). Most relevant to our work, Sturmfels et al (2020) and Sercu et al (2020) study alternative learning objectives using sets of sequences for supervision.…”
Section: Related Workmentioning
confidence: 99%
“…These tasks are not directly useful, but are intended to teach the model transferable skills and representations, and are designed so the models learn autonomously without expert labels. Several self-supervised learning methods have been applied to protein sequences, and have been effective in teaching the models features that are useful for downstream analyses (Alley et al, 2019;Heinzinger et al, 2019;Rao et al, 2019Rao et al, , 2021Lu et al, 2020;Rives et al, 2021). However, the majority of these tasks directly repurpose methods from natural language processing (Alley et al, 2019;Heinzinger et al, 2019;Rao et al, 2019;Rives et al, 2021), and it is unclear what kinds of features the tasks induce the models to learn in the context of protein sequences.…”
Section: Introductionmentioning
confidence: 99%
“…Since we have only sequence, we need to capture the structural propensities of a given sequence. Many potential choices for embeddings apropos to structurally informed tasks are computationally expensive [11, 29, 30, 31, 32, 33, 34]. For this paper we have selected a well vetted language model [11] to cast the protein residues to a latent space with the intention of recovering the underlying grammars behind protein sequences.…”
Section: Methodsmentioning
confidence: 99%
“…Often these models will learn these representations by being trained to predict randomly masked residues within a protein sequence. Multiple studies have shown the merits of these models when performing protein structure prediction, remote homology and protein design [30, 35, 29, 32, 31, 33, 34]. Here, we have used the pretrained LSTM PFam model from [11].…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation