Multimodal Pretraining for Unsupervised Protein Representation Learning

Nguyen, Viet Thanh Duy; Hy, Truong Son

doi:10.1101/2023.11.29.569288

Cited by 2 publications

(2 citation statements)

References 84 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…types at an aggregate level (Strokach et al, 2021;Høie et al, 2022;Cagiada et al, 2023;Nguyen and Hy, 2023), although some results suggest that a richer representation might be learned by combining multiple data types at the input level (Mansoor et al, 2021;Wu et al, 2023;Wang et al, 2022;Yang et al, 2022;Chen et al, 2023;Cheng et al, 2023;Zhang et al, 2023).…”

Section: Gnnmentioning

confidence: 99%

“…Examples of the types of data used as input include the wild-type amino acid sequence ( Lin et al, 2022; Brandes et al, 2022 ), a multiple sequence alignment (MSA) ( Ng and Henikoff, 2001; Balakrishnan et al, 2011; Lui and Tiana, 2013; Nielsen et al, 2017; Hopf et al, 2017; Riesselman et al, 2018; Laine et al, 2019 ) or the protein structure ( Boomsma and Frellsen, 2017; Jing et al, 2021a; Hsu et al, 2022 ). Some methods have combined predictions from multiple protein data types at an aggregate level ( Strokach et al, 2021; Høie et al, 2022; Cagiada et al, 2023; Nguyen and Hy, 2023 ), although some results suggest that a richer representation might be learned by combining multiple data types at the input level ( Mansoor et al, 2021; Wu et al, 2023; Wang et al, 2022; Yang et al, 2022; Chen et al, 2023; Cheng et al, 2023; Zhang et al, 2023 ).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A joint embedding of protein sequence and structure enables robust variant effect predictions

Blaabjerg,

Jonsson,

Boomsma

et al. 2023

Preprint

View full text Add to dashboard Cite

The ability to predict how amino acid changes may affect protein function has a wide range of applications including in disease variant classification and protein engineering. Many existing methods focus on learning from patterns found in either protein sequences or protein structures. Here, we present a method for integrating information from protein sequences and structures in a single model that we term SSEmb (Sequence Structure Embedding). SSEmb combines a graph representation for the protein structure with a transformer model for processing multiple sequence alignments, and we show that by integrating both types of information we obtain a variant effect prediction model that is more robust to cases where sequence information is scarce. Furthermore, we find that SSEmb learns embeddings of the sequence and structural properties that are useful for other downstream tasks. We exemplify this by training a downstream model to predict protein-protein binding sites at high accuracy using only the SSEmb embeddings as input. We envisage that SSEmb may be useful both for zero-shot predictions of variant effects and as a representation for predicting protein properties that depend on protein sequence and structure.

show abstract

Section: Gnnmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A joint embedding of protein sequence and structure enables robust variant effect predictions

Blaabjerg,

Jonsson,

Boomsma

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

Complex-based Ligand-Binding Proteins Redesign by Equivariant Diffusion-based Generative Models

Nguyen,

2024

Preprint

Self Cite

View full text Add to dashboard Cite

Proteins, serving as the fundamental architects of biological processes, interact with ligands to perform a myriad of functions essential for life. The design and optimization of ligand-binding proteins are pivotal for advancing drug development and enhancing therapeutic efficacy. In this study, we introduce ProteinReDiff, a novel computational framework designed to revolutionize the redesign of ligand-binding proteins. Distinguished by its utilization of Equivariant Diffusion-based Generative Models and advanced computational modules, ProteinReDiff enables the creation of high-affinity ligand-binding proteins without the need for detailed structural information, leveraging instead the potential of initial protein sequences and ligand SMILES strings. Our thorough evaluation across sequence diversity, structural preservation, and ligand binding affinity underscores ProteinReDiff's potential to significantly advance computational drug discovery and protein engineering. Our source code is publicly available at https://github.com/HySonLab/Protein_Redesign

show abstract

Multimodal Pretraining for Unsupervised Protein Representation Learning

Cited by 2 publications

References 84 publications

A joint embedding of protein sequence and structure enables robust variant effect predictions

A joint embedding of protein sequence and structure enables robust variant effect predictions

Complex-based Ligand-Binding Proteins Redesign by Equivariant Diffusion-based Generative Models

Contact Info

Product

Resources

About