2018
DOI: 10.1101/345843
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Abstract: In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE can be inferred over a large set of protein seque… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 10 publications
(10 citation statements)
references
References 55 publications
(59 reference statements)
0
10
0
Order By: Relevance
“…High-scoring segment pair (HSP) has been used in previous methods for PPI prediction [27]. One-hot vectors [51,52] and amino acid embedding [5,6,19] have also been empirically explored to represent protein sequences.…”
Section: Introductionmentioning
confidence: 99%
“…High-scoring segment pair (HSP) has been used in previous methods for PPI prediction [27]. One-hot vectors [51,52] and amino acid embedding [5,6,19] have also been empirically explored to represent protein sequences.…”
Section: Introductionmentioning
confidence: 99%
“…ProNA2020 (3) predicts whether or not a protein interacts with other proteins, RNA or DNA, and if the binding residues. Per-protein predictions rely on homology and machine learning models employing profile-kernel SVMs (49) and embeddings from an in-house implementation of ProtVec (50). Per-residue predictions are based on simple neural networks due to the lack of experimental high-resolution annotations (5153).…”
Section: Methodsmentioning
confidence: 99%
“…Since proteins do not have a well-defined vocabulary of words, word-level tokenization is not a well-defined option in the case of proteins. Subword segmentation, on the other hand, does not require any predefined knowledge of words in the target language, making it a potentially interesting approach for discovering “words'' or motifs in proteins [107] , [8] , [12] , [53] .…”
Section: The Atomic Unit Of Information: Tokenizationmentioning
confidence: 99%
“…In proteins we have only ~20 AAs. While we can embed AAs onto a lower-dimensional space, it is not as clearly beneficial [8] . While dimensionality reduction is of limited use when working on single AAs, it can provide useful compact representations when considering extended AA combinations.…”
Section: Word Embeddingsmentioning
confidence: 99%