TripletProt: Deep Representation Learning of Proteins Based On Siamese Networks

Nourani, Esmaeil; Asgari, Ehsaneddin; Hardy, Alice Mc; Mofrad, Mohammad R. K.

doi:10.1109/tcbb.2021.3108718

Cited by 7 publications

(5 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here we covered the core challenges and considerations in supervising the models in fitness prediction, yet additional downstream analysis and posing insightful questions will give us more understanding and directions in discriminating the protein sequences based on their fitness. In order to improve the pretraining step, we might adopt techniques such as adjusting the masking rate [64], adding biological priors [60,65], increasing the model parameters [57], and building specialized language models for the desired fitness [66], given the growing data availability and computational resources. Additional studies are required for improved downstream fitness predictions, such as fine-tuning with a reduced chance of overfitting [67], incorporating the effect of post-translational modifications, and characterizing the performance of embeddings in different data setups [68] with varying protein types and finesses for supporting the development of novel proteins in diagnostics and therapeutics.…”

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Protein Fitness Prediction is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Mardikoraem

Woldring

2023

Preprint

View full text Add to dashboard Cite

Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed our ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations. Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling methods and protein rep-resentations to improve model performance in two different datasets with binding affinity and thermal stability prediction tasks. For protein sequence representations, we incor-porate two widely used methods (One-Hot encoding, physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, length, data size, and sampling methods. In addition, an ensemble of representation methods is generated to discover the contribution of distinct representations to the final prediction score. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. In addition, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Protein Fitness Prediction is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Mardikoraem

Woldring

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Here, we covered the core challenges and considerations in supervising the models in fitness prediction, yet additional downstream analysis and posing insightful questions will give us more understanding and directions in discriminating the protein sequences based on their fitness. In order to improve the pretraining step, we might adopt techniques such as adjusting the masking rate [72], adding biological priors [69,73], increasing the model parameters [66], and building specialized language models for the desired fitness [74], given the growing data availability and computational resources. Additional studies are required for improved downstream fitness predictions, such as fine-tuning with a reduced chance of overfitting [75], incorporating the effect of post-translational modifications, and characterizing the performance of embeddings in different data setups [76] with varying protein types and finesses for supporting the development of novel proteins in diagnostics and therapeutics.…”

Section: Discussionmentioning

confidence: 99%

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Mardikoraem

Woldring

2023

Pharmaceutics

View full text Add to dashboard Cite

Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations (numerical encodings). Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and protein encoding methods to improve binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding and physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, protein size, and sampling techniques. In addition, an ensemble of protein representation methods is generated to discover the contribution of distinct representations and improve the final prediction score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in ranking our methods. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. Moreover, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).

show abstract

“…There is growing interest in developing protein language models ( p LMs) at the scale of evolution due to the abundance of 1D amino acid sequences, such as the series of ESM (Rives et al, 2019; Lin et al, 2022), TAPE (Rao et al, 2019), ProtTrans (Elnaggar et al, 2021), PRoBERTa (Nambiar et al, 2020), PMLM (He et al, 2021), ProteinLM (Xiao et al, 2021), PLUS (Min et al, 2021), Adversarial MLM (McDermott et al, 2021), ProteinBERT (Brandes et al, 2022), CARP (Yang et al, 2022a) in masked language modeling (MLM) fashion, ProtGPT2 (Ferruz et al, 2022) in causal language modeling fashion, and several others (Melnyk et al, 2022a; Madani et al, 2021; Unsal et al, 2022; Nourani et al, 2021; Lu et al, 2020; Sturmfels et al, 2020; Strodthoff et al, 2020). These protein language models are able to generalize across a wide range of downstream applications and can capture evolutionary information about secondary and tertiary structures from sequences alone.…”

Section: Related Workmentioning

confidence: 99%

Structure-informed Language Models Are Protein Designers

Zheng

Deng

Xue

et al. 2023

Preprint

View full text Add to dashboard Cite

This paper demonstrates that language models are strong structure-based protein designers. We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs), that have learned massive sequential evolutionary knowledge from the universe of natural protein sequences, to acquire an immediate capability to design preferable protein sequences for given folds. We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness. During inference, iterative refinement is performed to effectively optimize the generated protein sequences. Experiments show that our approach outperforms the state-of-the-art methods by a large margin, leading to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65% and 56.63% on CATH 4.2 and 4.3 single-chain benchmarks, and >60% when designing protein complexes). We provide extensive and in-depth analyses, which verify that LM-Design can (1) indeed leverage both structural and sequential knowledge to accurately handle structurally non-deterministic regions, (2) benefit from scaling data and model size, and (3) generalize to other proteins (e.g., antibodies and de novo proteins).

show abstract

TripletProt: Deep Representation Learning of Proteins Based On Siamese Networks

Cited by 7 publications

References 40 publications

Protein Fitness Prediction is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Protein Fitness Prediction is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Structure-informed Language Models Are Protein Designers

Contact Info

Product

Resources

About