CopulaNet: Learning residue co-evolution directly from multiple sequence alignment for protein structure prediction

Ju, Fusong; Zhu, Jianwei; Shao, Bin; Kong, Lupeng; Liu, Tie-Yan; Zheng, Wei-Mou; Bu, Dongbo

doi:10.1038/s41467-021-22869-8

Cited by 65 publications

(54 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By learning from thousands of experimentally solved protein structures, ResNet greatly reduced the number of sequence homologs needed for satisfactory contact prediction, doubling or even tripling the precision over traditional methods on the CASP13 hard test proteins [ 45 ]. Recent studies have shown that ResNet was able to predict accurate contacts and correct folds for most proteins with more than 30 non-redundant sequence homologs [ 95 , 96 ]. One of the major differences between DBN and RaptorX’s ResNet is that the former predicts inter-residue contacts one by one while the latter predicts the whole contact matrix simultaneously.…”

Section: Neoantigen Identificationmentioning

confidence: 99%

A tale of solving two computational challenges in protein science: neoantigen prediction and protein structure prediction

Tran

2021

Briefings in Bioinformatics

View full text Add to dashboard Cite

In this article, we review two challenging computational questions in protein science: neoantigen prediction and protein structure prediction. Both topics have seen significant leaps forward by deep learning within the past five years, which immediately unlocked new developments of drugs and immunotherapies. We show that deep learning models offer unique advantages, such as representation learning and multi-layer architecture, which make them an ideal choice to leverage a huge amount of protein sequence and structure data to address those two problems. We also discuss the impact and future possibilities enabled by those two applications, especially how the data-driven approach by deep learning shall accelerate the progress towards personalized biomedicine.

show abstract

Section: Neoantigen Identificationmentioning

confidence: 99%

A tale of solving two computational challenges in protein science: neoantigen prediction and protein structure prediction

Tran

2021

Briefings in Bioinformatics

View full text Add to dashboard Cite

show abstract

“…Most neural network models, including Al-phaFold (AlQuraishi, 2019) and RaptorX (Xu, 2019), rely on this feature. However, due to the considerable information loss after transforming MSAs into hand-crafted features, supervised models, such as CopulaNet (Ju et al, 2021) and AlphaFold2 (Jumper et al, 2021), are proposed to directly build on the raw MSA. The superior performance over the baselines demonstrates that residue co-evolution information can be mined from the raw sequences by the model.…”

Section: Related Workmentioning

confidence: 99%

“…For protein structure prediction, the key step is to predict inter-residue contacts/distances, while the shared cornerstone of prediction is performing evolutionary coupling analysis, i.e. residue co-evolution analysis, on the constructed MSA for a target protein (Ju et al, 2021). The underlying rational is that two residues which are spatially close in the three-dimensional structure tend to coevolve, which in turns can be exploited to estimate contacts/distances between residues (Seemayer et al, 2014;Jones & Kandathil, 2018).…”

Section: Introductionmentioning

confidence: 99%

Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model

He¹,

Zhang²,

Wu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Understanding protein sequences is vital and urgent for biology, healthcare, and medicine. Labeling approaches are expensive yet time-consuming, while the amount of unlabeled data is increasing quite faster than that of the labeled data due to low-cost, high-throughput sequencing methods. In order to extract knowledge from these unlabeled data, representation learning is of significant value for protein-related tasks and has great potential for helping us learn more about protein functions and structures. The key problem in the protein sequence representation learning is to capture the co-evolutionary information reflected by the interresidue co-variation in the sequences. Instead of leveraging multiple sequence alignment as is usually done, we propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM). In a conventional masked language model, the masked tokens (i.e. amino acid residues) are modeled by conditioning on the unmasked tokens only, but processed independently to each other. However, our proposed PMLM takes the dependency among masked tokens into consideration, i.e., the probability of a token pair is not equal to the product of the probability of the two tokens. By applying this model, the pre-trained encoder is able to generate a better representation for protein sequences. Our result shows that the proposed method can effectively capture the inter-residue correlations and improves the performance of contact prediction by up to 9% compared to the MLM baseline under the same setting. The proposed model also significantly outperforms the MSA baseline by more than 7% on the TAPE contact prediction benchmark when pretrained on a subset of the sequence database which the MSA is generated from, revealing the potential of the sequence pre-training method to surpass MSA based methods in general.

show abstract

“…These huge amounts TA B L E 1 Overview of X-to-end and end-to-X deep learning approaches for protein structure prediction. End-to-end learning AlphaFold2 [19] The MSA, along with templates, is fed into a translation and rotation equivairant transformer architecture, which outputs a 3D structural model DMPfold2 (new) [35] The MSA, along with the precision matrix, is fed into a GRU, which outputs a 3D structure End-to-X learning MSA Transformer [45] Transformer architecture rawMSA [34] The MSA is fed into a 2DCNN (the first convolutional layer creates an embedding) which outputs a contact map CopulaNet [46] Extracts all sequence pairs from the MSA and feeds them to a dilated resCNN TOWER…”

Section: The Importance Of Data and Data Representationsmentioning

confidence: 99%

“…Stacking dilated convolutions with increasingly large d allows operating on exponentially large receptive fields, while retaining short backpropagations [110,111,7]. In CASP14, dilated convolutions were used by several groups, including ProSPr [22], DESTINI2 [26], CopulaNet [46], PrayogRealDistance [29,30], and also EMBER, TOWER, ICOS, and LAW/MASS. Another solution lies in the self-attention mechanism, where parametric filters capture high-order dependencies between the input observations at arbitrary range and with high precision (Fig.…”

Section: Volumetric Representations 3dcnn[84]mentioning

confidence: 99%

Protein sequence-to-structure learning: Is this the end(-to-end revolution)?

Laine¹,

Eismann²,

Elofsson³

et al. 2021

Preprint

View full text Add to dashboard Cite

CopulaNet: Learning residue co-evolution directly from multiple sequence alignment for protein structure prediction

Cited by 65 publications

References 30 publications

A tale of solving two computational challenges in protein science: neoantigen prediction and protein structure prediction

A tale of solving two computational challenges in protein science: neoantigen prediction and protein structure prediction

Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model

Protein sequence-to-structure learning: Is this the end(-to-end revolution)?

Contact Info

Product

Resources

About