2023
DOI: 10.1186/s12859-023-05375-0
|View full text |Cite
|
Sign up to set email alerts
|

TEMPROT: protein function annotation using transformers embeddings and homology search

Abstract: Background Although the development of sequencing technologies has provided a large number of protein sequences, the analysis of functions that each one plays is still difficult due to the efforts of laboratorial methods, making necessary the usage of computational methods to decrease this gap. As the main source of information available about proteins is their sequences, approaches that can use this information, such as classification based on the patterns of the amino acids and the inference … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(6 citation statements)
references
References 26 publications
0
6
0
Order By: Relevance
“…To represent an entire protein sequence by a single vector, we averaged embeddings across the sequence length dimension. For sequences exceeding 1000 amino acids, we employed a sliding window approach with a 500 amino acid size to process the data, subsequently averaging these embeddings to obtain a comprehensive representation (Oliveira, Pedrini and Dias 2023). We then concatenated the vectors of all proteins belonging to a species to create a single data point.…”
Section: Acquiring Embeddings With Esm-2 and Handling Long Protein Se...mentioning
confidence: 99%
“…To represent an entire protein sequence by a single vector, we averaged embeddings across the sequence length dimension. For sequences exceeding 1000 amino acids, we employed a sliding window approach with a 500 amino acid size to process the data, subsequently averaging these embeddings to obtain a comprehensive representation (Oliveira, Pedrini and Dias 2023). We then concatenated the vectors of all proteins belonging to a species to create a single data point.…”
Section: Acquiring Embeddings With Esm-2 and Handling Long Protein Se...mentioning
confidence: 99%
“…Hyperparameters are determined by manual grid search on the validation data (see In the MF category of mouse, DualNetGO produces slightly worse results than Graph2GO, which is not surprising because Graph2GO not only uses GAE on the PPI network, but also on the sequence similarity network which provides additional and valuable information that is not included in other models in the comparison. Several studies (Fan et al, 2020;Oliveira et al, 2023) suggest that MF is more related to sequence patterns that may not be reflected by Pfam protein domains and PPI networks. However, retrieval of sequence similarity through BLAST requires quadratic time with respect to the number of sequences, which is costly when this number scales up.…”
Section: Experiments Setupmentioning
confidence: 99%
“…Sequence-based methods are based on the hypothesis that if two protein sequences are similar, they will share similar functions. Such prior models assign protein functions based on similarity in raw sequences (Hamp et al, 2013) or annotated motifs (Gong et al, 2016), similarity in learned features from neural networks (Kulmanov et al, 2018;Kulmanov and Hoehndorf, 2021;Cao and Shen, 2021) or even protein language models (Wang et al, 2023;Oliveira et al, 2023). However, protein functions can not be determined when the query protein shares low sequence similarity with others, and proteins with similar functions are not necessarily similar in sequences.…”
Section: Introductionmentioning
confidence: 99%
“…Over the last decade or so, multiple techniques have been developed to produce automated function prediction (AFP) of proteins based on the protein primary sequence using machine learning and statistical methods [2][3][4][5][6][7][8][9][10] . In order to benchmark these methods against one another as well as to spur further innovation in developing new methods for predicting protein function, the Critical Assessment of Protein Function Annotation (CAFA) is regularly held with the fifth iteration (CAFA 5) having been recently concluded on Kaggle 11 .…”
Section: Introductionmentioning
confidence: 99%
“…Among the best models that have emerged are the various sized ESM2 models produced by Meta and Prot-T5 by the Rost lab that have been trained on billions of protein sequences [18][19][20] . It has been shown that use of these language models can be an effective way to generate predictions about protein characteristics, including localization, protein-protein interactions, and function 8,[21][22][23] . We present here PROTGOAT (PROTein Gene Ontology Annotation Tool) that integrates the output of multiple diverse PLMs with literature and taxonomy information about a protein to predict its function.…”
Section: Introductionmentioning
confidence: 99%