2023
DOI: 10.1101/2023.11.29.569288
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multimodal Pretraining for Unsupervised Protein Representation Learning

Viet Thanh Duy Nguyen,
Truong Son Hy

Abstract: In this paper, we introduce a framework of symmetry-preserving multimodal pretraining to learn a unified representation of proteins in an unsupervised manner, encompassing both primary and tertiary structures. Our approach involves proposing specific pretraining methods for sequences, graphs, and 3D point clouds associated with each protein structure, leveraging the power of large language models and generative models. We present a novel way to combining representations from multiple sources of information int… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 84 publications
0
2
0
Order By: Relevance
“…types at an aggregate level (Strokach et al, 2021;Høie et al, 2022;Cagiada et al, 2023;Nguyen and Hy, 2023), although some results suggest that a richer representation might be learned by combining multiple data types at the input level (Mansoor et al, 2021;Wu et al, 2023;Wang et al, 2022;Yang et al, 2022;Chen et al, 2023;Cheng et al, 2023;Zhang et al, 2023).…”
Section: Gnnmentioning
confidence: 99%
See 1 more Smart Citation
“…types at an aggregate level (Strokach et al, 2021;Høie et al, 2022;Cagiada et al, 2023;Nguyen and Hy, 2023), although some results suggest that a richer representation might be learned by combining multiple data types at the input level (Mansoor et al, 2021;Wu et al, 2023;Wang et al, 2022;Yang et al, 2022;Chen et al, 2023;Cheng et al, 2023;Zhang et al, 2023).…”
Section: Gnnmentioning
confidence: 99%
“…Examples of the types of data used as input include the wild-type amino acid sequence ( Lin et al, 2022; Brandes et al, 2022 ), a multiple sequence alignment (MSA) ( Ng and Henikoff, 2001; Balakrishnan et al, 2011; Lui and Tiana, 2013; Nielsen et al, 2017; Hopf et al, 2017; Riesselman et al, 2018; Laine et al, 2019 ) or the protein structure ( Boomsma and Frellsen, 2017; Jing et al, 2021a; Hsu et al, 2022 ). Some methods have combined predictions from multiple protein data types at an aggregate level ( Strokach et al, 2021; Høie et al, 2022; Cagiada et al, 2023; Nguyen and Hy, 2023 ), although some results suggest that a richer representation might be learned by combining multiple data types at the input level ( Mansoor et al, 2021; Wu et al, 2023; Wang et al, 2022; Yang et al, 2022; Chen et al, 2023; Cheng et al, 2023; Zhang et al, 2023 ).…”
Section: Introductionmentioning
confidence: 99%