2023
DOI: 10.1101/2023.02.22.529597
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Retrieved Sequence Augmentation for Protein Representation Learning

Abstract: Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, as well as the de novo and orphan proteins remain great challenges in protein representation learning. In this work,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 68 publications
0
2
0
Order By: Relevance
“…Along the same lines, predicted 3Di can help speed up recently developed structural phylogenetics based on 3Di 71 . Deriving embeddings from structures will also expand the power of embedding-based alignments 72,73 , and retrieval Transformers 74 . Our proposed integration of 3D information into pLMs constitutes the first step toward building truly multi-model pLMs that capture the multiple facets of protein structure, function and evolution.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Along the same lines, predicted 3Di can help speed up recently developed structural phylogenetics based on 3Di 71 . Deriving embeddings from structures will also expand the power of embedding-based alignments 72,73 , and retrieval Transformers 74 . Our proposed integration of 3D information into pLMs constitutes the first step toward building truly multi-model pLMs that capture the multiple facets of protein structure, function and evolution.…”
Section: Discussionmentioning
confidence: 99%
“…Constant speed-ups of LLMs [57] also decrease translation latency which in turn renders the translation from AA to 3Di an attractive alternative to searching large metagenomic datasets at the sensitivity of structure comparison while avoiding the overhead of actually having to predict 3D structures. Deriving embeddings from structures will also expand the power of embedding-based alignments [67], [68], and retrieval Transformers [69]. Our proposed integration of 3D information into pLMs may only constitute the first step towards building truly multi-model pLMs that capture the multiple facets of protein structure, function and evolution.…”
Section: Discussionmentioning
confidence: 99%
“…Second, while a variety of other pretraining tasks have been proposed for protein transfer learning, and different pretraining tasks could potentially learn different aspects of protein biology, we remain uncertain if they will result in significant differences from MLMs. Many pretraining tasks still aim to reconstruct natural sequences (He et al, 2021; Notin et al, 2022; Tan et al, 2023; Ma et al, 2023) and so are also likely to primarily learn coevolutionary patterns. Other tasks use structure as an additional input or target, but they generally make only modest improvements on function prediction tasks (Mansoor et al, 2021; Wang et al, 2022; Yang et al, 2023; Su et al, 2023).…”
Section: Discussionmentioning
confidence: 99%