2019
DOI: 10.1002/prot.25842
|View full text |Cite
|
Sign up to set email alerts
|

Learning a functional grammar of protein domains using natural language word embedding techniques

Abstract: In this paper, using Word2vec, a widely‐used natural language processing method, we demonstrate that protein domains may have a learnable implicit semantic “meaning” in the context of their functional contributions to the multi‐domain proteins in which they are found. Word2vec is a group of models which can be used to produce semantically meaningful embeddings of words or tokens in a fixed‐dimension vector space. In this work, we treat multi‐domain proteins as “sentences” where domain identifiers are tokens wh… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
14
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 22 publications
(14 citation statements)
references
References 27 publications
0
14
0
Order By: Relevance
“…For instance, when analysing a new FA-like protein, a potential annotated adhesive domain can infer the existence of a specific stalk domain and vice versa. Domain grammar approaches have shown that a knowledge of the domain grammar of proteins can enhance sequence similarity searching [ 32 , 33 ]. For a representative analysis of the phylogenetic distribution of FA-like proteins we decided to use the UniProt Reference Proteomes sequences only.…”
Section: Discussionmentioning
confidence: 99%
“…For instance, when analysing a new FA-like protein, a potential annotated adhesive domain can infer the existence of a specific stalk domain and vice versa. Domain grammar approaches have shown that a knowledge of the domain grammar of proteins can enhance sequence similarity searching [ 32 , 33 ]. For a representative analysis of the phylogenetic distribution of FA-like proteins we decided to use the UniProt Reference Proteomes sequences only.…”
Section: Discussionmentioning
confidence: 99%
“…For instance, when analysing a new FA-like protein, a potential annotated adhesive domain can infer the existence of a specific stalk domain and vice versa. Domain grammar approaches have shown that a knowledge of the domain grammar of proteins can enhance sequence similarity searching [28,29]. For a representative analysis of the phylogenetic distribution of FA-like proteins we decided to use the UniProt Reference Proteomes sequences only.…”
Section: Discussionmentioning
confidence: 99%
“…Comparison with Pfam domain embeddings From all proposed protein embeddings works, only [22] developed intrinsic quantitative benchmarks. They applied word2vec for Pfam domain annotations for only eykaryotic proteins.…”
Section: Discussionmentioning
confidence: 99%
“…Various methods to create embeddings for proteins were proposed [16,17,18,19,20,21,22]. ProtVec fragmented the protein sequence in 3-mers for all possible starting shifts.…”
Section: Introductionmentioning
confidence: 99%