TEMPROT: protein function annotation using transformers embeddings and homology search

Oliveira, Gabriel B.; Pedrini, Hélio; Dias, Zanoni

doi:10.1186/s12859-023-05375-0

Cited by 9 publications

(6 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To represent an entire protein sequence by a single vector, we averaged embeddings across the sequence length dimension. For sequences exceeding 1000 amino acids, we employed a sliding window approach with a 500 amino acid size to process the data, subsequently averaging these embeddings to obtain a comprehensive representation (Oliveira, Pedrini and Dias 2023). We then concatenated the vectors of all proteins belonging to a species to create a single data point.…”

Section: Acquiring Embeddings With Esm-2 and Handling Long Protein Se...mentioning

confidence: 99%

TraitProtNet: Deciphering the Genome for Trait Prediction with Interpretable Deep Learning

Wang

2024

Preprint

View full text Add to dashboard Cite

Genome data is far from fully explored. We present TraitProtNet, an innovative deep learning framework for predictive trait profiling in fungi, leveraging genome data and pretrained language models. The use of Integrated Gradients and bioinformatic analysis provides insights into the interpretability of the model, complementing traditional omics by highlighting the difference between protein importance and expression levels. This framework offers significant potential for future applications in both agriculture and medicine.

show abstract

Section: Acquiring Embeddings With Esm-2 and Handling Long Protein Se...mentioning

confidence: 99%

TraitProtNet: Deciphering the Genome for Trait Prediction with Interpretable Deep Learning

Wang

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Hyperparameters are determined by manual grid search on the validation data (see In the MF category of mouse, DualNetGO produces slightly worse results than Graph2GO, which is not surprising because Graph2GO not only uses GAE on the PPI network, but also on the sequence similarity network which provides additional and valuable information that is not included in other models in the comparison. Several studies (Fan et al, 2020;Oliveira et al, 2023) suggest that MF is more related to sequence patterns that may not be reflected by Pfam protein domains and PPI networks. However, retrieval of sequence similarity through BLAST requires quadratic time with respect to the number of sequences, which is costly when this number scales up.…”

Section: Experiments Setupmentioning

confidence: 99%

“…Sequence-based methods are based on the hypothesis that if two protein sequences are similar, they will share similar functions. Such prior models assign protein functions based on similarity in raw sequences (Hamp et al, 2013) or annotated motifs (Gong et al, 2016), similarity in learned features from neural networks (Kulmanov et al, 2018;Kulmanov and Hoehndorf, 2021;Cao and Shen, 2021) or even protein language models (Wang et al, 2023;Oliveira et al, 2023). However, protein functions can not be determined when the query protein shares low sequence similarity with others, and proteins with similar functions are not necessarily similar in sequences.…”

Section: Introductionmentioning

confidence: 99%

DualNetGO: A Dual Network Model for Protein Function Prediction via Effective Feature Selection

Chen,

Luo

2023

Preprint

View full text Add to dashboard Cite

MotivationProtein-protein Interaction (PPI) networks are crucial for automatically annotating protein functions. As there are different types of evidence to define PPI networks, multiple PPI networks exist for the same set of proteins to capture their properties from different aspects, creating challenges in effectively utilizing these heterogeneous graphs for protein function prediction. Recently, several deep learning models have combined PPI networks from all evidence, or concatenated all graph embeddings. However, the lack of a delicate selection procedure prevents the effective harness of information from different PPI networks as they vary in densities, structures and noise levels. Consequently, combining protein features indiscriminately could increase the noise level, leading to decreased model performance.ResultsWe develop DualNetGO, a dual network model comprised of a classifier and a selector, to predict protein functions by effectively selecting features from different sources including graph embeddings of PPI networks, protein domain and subcellular location information. Evaluation of DualNetGO on human and mouse datasets in comparison with other network-based models show at least 4.5%, 6.2% and 14.2% improvement on Fmax in BP, MF and CC Gene Ontology categories respectively for human, and 3.3%, 10.6% and 7.7% improvement on Fmax for mouse. We further show that our model is insensitive to the choice of graph embedding method and is time- and memory-saving. These results demonstrate that combining a subset of features including PPI networks and protein attributes selected by our model is more effective in utilizing PPI network information than only using one kind of or concatenating graph embeddings from all kinds of PPI networks.Availability and implementationThe source code of DualNetGO and some of the experiment data are available at:https://github.com/georgedashen/DualNetGO.

show abstract

“…Over the last decade or so, multiple techniques have been developed to produce automated function prediction (AFP) of proteins based on the protein primary sequence using machine learning and statistical methods [2][3][4][5][6][7][8][9][10] . In order to benchmark these methods against one another as well as to spur further innovation in developing new methods for predicting protein function, the Critical Assessment of Protein Function Annotation (CAFA) is regularly held with the fifth iteration (CAFA 5) having been recently concluded on Kaggle 11 .…”

Section: Introductionmentioning

confidence: 99%

“…Among the best models that have emerged are the various sized ESM2 models produced by Meta and Prot-T5 by the Rost lab that have been trained on billions of protein sequences [18][19][20] . It has been shown that use of these language models can be an effective way to generate predictions about protein characteristics, including localization, protein-protein interactions, and function 8,[21][22][23] . We present here PROTGOAT (PROTein Gene Ontology Annotation Tool) that integrates the output of multiple diverse PLMs with literature and taxonomy information about a protein to predict its function.…”

Section: Introductionmentioning

confidence: 99%

PROTGOAT : Improved automated protein function predictions using Protein Language Models

Chua,

Rajesh,

Sinha

et al. 2024

Preprint

View full text Add to dashboard Cite

Accurate prediction of protein function is crucial for understanding biological processes and various disease mechanisms. Current methods for protein function prediction relies primarily on sequence similarities and often misses out on important aspects of protein function. New developments in protein function prediction methods have recently shown exciting progress via the use of large transformer-based Protein Language Models (PLMs) that allow for the capture of nuanced relationships between amino acids in protein sequences which are crucial for understanding their function. This has enabled an unprecedented level of accuracy in predicting the functions of previously little understood proteins. We here developed an ensemble method called PROTGOAT based on embeddings extracted from multiple and diverse pre-trained PLMs and existing text information about the protein in published literature. PROTGOAT outperforms most current state-of-the-art methods, ranking fourth in the Critical Assessment of Functional Annotation (CAFA 5), a global competition benchmarking such developments among 1600 methods tested. The high performance of our method demonstrates how protein function prediction can be improved through the use of an ensemble of diverse PLMs. PROTGOAT is publicly available for academic use and can be accessed here: https://github.com/zongmingchua/cafa5

show abstract

TEMPROT: protein function annotation using transformers embeddings and homology search

Cited by 9 publications

References 26 publications

TraitProtNet: Deciphering the Genome for Trait Prediction with Interpretable Deep Learning

TraitProtNet: Deciphering the Genome for Trait Prediction with Interpretable Deep Learning

DualNetGO: A Dual Network Model for Protein Function Prediction via Effective Feature Selection

PROTGOAT : Improved automated protein function predictions using Protein Language Models

Contact Info

Product

Resources

About