Self-Supervised Pre-training for Protein Embeddings Using Tertiary Structures

Guo, Yuzhi; Wu, Jin V.; Ma, Hehuan; Huang, Junzhou

doi:10.1609/aaai.v36i6.20636

Cited by 18 publications

(13 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We adopt the structural representation network proposed in [61] as our teacher network, utilizing the pretrained weights they provided. The teacher network is solely engaged in the process of structural information distillation, extracting representations from structural data, and does not participate in the training or inference processes of downstream tasks.…”

Section: Methodsmentioning

confidence: 99%

Interpretable antibody-antigen interaction prediction by introducing route and priors guidance

Liu,

Nie,

Chen

et al. 2024

Preprint

View full text Add to dashboard Cite

With the application of personalized and precision medicine, more precise and efficient antibody drug development technology is urgently needed. Identification of antibody-antigen interactions is key to antibody engineering. The time-consuming and expensive nature of wet-lab experiments calls for efficient computational methods. Previous deep-learning-based computing methods for antibody-antigen interaction prediction are distinctly divided into two categories: structure-based and sequence-based. Taking into account the non-overlapping advantage of these two major categories, we propose an interpretable antibody-antigen interaction prediction method, S3AI, that bridges structures to sequences through structural information distillation. Furthermore, non-covalent interactions are modeled explicitly to guide neural networks in understanding the underlying patterns in antigen-antibody docking. Supported by the two innovative designs mentioned above, S3AI significantly and comprehensively surpasses the state-of-the-art models. S3AI maintains excellent robustness when predicting unknown antibody-antigen pairs, surpassing specialized prediction methods designed for out-of-distribution generalization in fair comparisons. More importantly, S3AI captures the universal pattern of antibody-antigen interactions, which not only identifies the CDRs responsible for specific binding to the antigen but also unearthed the importance of CDR-H3 for the interaction. The implicit introduction of knowledge of structure modality and the explicit modeling of chemical constraints build a 'sequence-to-function' route, thereby facilitating S3AI's understanding of complex molecular interactions through providing route and priors guidance. S3AI, which does not require structure input, is suitable for large-scale, parallelized antibody optimization and screening while outperforming state-of-the-art prediction methods. It helps to quickly and accurately identify potential candidates in the vast antibody space, thereby accelerating the development process of antibody drugs.

show abstract

Section: Methodsmentioning

confidence: 99%

Interpretable antibody-antigen interaction prediction by introducing route and priors guidance

Liu,

Nie,

Chen

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…These edges are typically based on the Cα distances between the residues. GNNs utilize 𝒢for diverse pretraining strategies like contrastive learning (Hermosilla & Ropinski, 2022; Zhang et al, 2023b;a), self-prediction (Yang et al, 2022; Chen et al, 2023) and denoising score matching (Guo et al, 2022; Wu et al, 2022a). Another way inspired by AF2 involves incorporating structure features as contact biases into the attention maps within the self-attention module, e.g., Uni-Mol (Zhou et al, 2023).…”

Section: Related Workmentioning

confidence: 99%

SaProt: Protein Language Modeling with Structure-aware Vocabulary

Su,

Han,

Zhou

et al. 2023

Preprint

View full text Add to dashboard Cite

Large-scale protein language models (PLMs), such as the ESM family, have achieved remarkable performance in various downstream tasks related to protein structure and function by undergoing unsupervised training on residue sequences. They have become essential tools for researchers and practitioners in biology. However, a limitation of vanilla PLMs is their lack ofexplicitconsideration for protein structure information, which suggests the potential for further improvement. Motivated by this, we introduce the concept of a “structure-aware vocabulary” that integrates residue tokens with structure tokens. The structure tokens are derived by encoding the 3D structure of proteins using Foldseek. We then propose SaProt, a large-scale general-purpose PLM trained on an extensive dataset comprising approximately 40 million protein sequences and structures. Through extensive evaluation, our SaProt model surpasses well-established and renowned baselines across 10 significant downstream tasks, demonstrating its exceptional capacity and broad applicability. We have made the code1, pretrained model, and all relevant materials available athttps://github.com/westlake-repl/SaProt.

show abstract

“…However, a simple similarity measure with a pre-set threshold is insufficient to assign high-confident protein function. DL-based PFP methods include function prediction from AA-sequence (Rao et al, 2019;Alley et al, 2019;Elnaggar et al, 2020;Dallago et al, 2021;Kulmanov & Hoehndorf, 2020;Meier et al, 2021;Biswas et al, 2021;Gelman et al, 2021;Yang et al, 2022a), 3-dimensional structure (Gligorijević et al, 2021;Smaili et al, 2021;Guo et al, 2022), evolutionary relationships and genomic context (Rao et al, 2021;Engelhardt et al, 2005), and their combinations Gligorijević et al, 2021). Here, we mainly restrict our scope to the sequence and structure.…”

Section: Related Workmentioning

confidence: 99%

Sequence vs. Structure: Delving Deep Into Data-Driven Protein Function Prediction

Tian

Wang

Yang

et al. 2023

Preprint

View full text Add to dashboard Cite

Predicting protein function is a longstanding challenge that has significant scientific implications. The success of amino acid sequence-based learning methods depends on the relationship between sequence, structure, and function. However, recent advances in AlphaFold have led to highly accurate protein structure data becoming more readily available, prompting a fundamental question: given sufficient experimental and predicted structures, should we use structure-based learning methods instead of sequence-based learning methods for predicting protein function, given the intuition that a protein's structure has a closer relationship to its function than its amino acid sequence? To answer this question, we explore several key factors that affect function prediction accuracy. Firstly, we learn protein representations using state-of-the-art graph neural networks (GNNs) and compare graph construction(GC) methods at the residue and atomic levels. Secondly, we investigate whether protein structures generated by AlphaFold are as effective as experimental structures for function prediction when protein graphs are used as input. Finally, we compare the accuracy of sequence-only, structure-only, and sequence-structure fusion-based learning methods for predicting protein function. Additionally, we make several observations, provide useful tips, and share code and datasets to encourage further research and enhance reproducibility.

show abstract

Self-Supervised Pre-training for Protein Embeddings Using Tertiary Structures

Cited by 18 publications

References 29 publications

Interpretable antibody-antigen interaction prediction by introducing route and priors guidance

Interpretable antibody-antigen interaction prediction by introducing route and priors guidance

SaProt: Protein Language Modeling with Structure-aware Vocabulary

Sequence vs. Structure: Delving Deep Into Data-Driven Protein Function Prediction

Contact Info

Product

Resources

About