2019
DOI: 10.1101/704874
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

UDSMProt: Universal Deep Sequence Models for Protein Classification

Abstract: Motivation:Inferring the properties of a protein from its amino acid sequence is one of the key problems in bioinformatics. Most state-of-the-art approaches for protein classification tasks are tailored to single classification tasks and rely on handcrafted features such as position-specific-scoring matrices from expensive database searches. We argue that this level of performance can be reached or even be surpassed by learning a task-agnostic representation once, using self-supervised language modeling, and t… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
40
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 34 publications
(40 citation statements)
references
References 29 publications
0
40
0
Order By: Relevance
“…This is interesting considering the fact that for alleles with fewer than 1000 training measurements, MHCFlurry was pretrained on an augmented training set with measurements from BLOSUM similar alleles, USMPep LM ens was pretrained on a large corpus of unlabeled peptides and USMPep FS ens in contrast only saw the training sequences corresponding to one MHC molecule. These results stress that further efforts might be required to truly leverage the potential of unlabeled peptide data in order to observe similar improvements as seen for proteins [12] in particular for small datasets. Turning to MHC Class II binding prediction, we aim to demonstrate the universality of our approach beyond its applicability to different MHC I alleles.…”
Section: ) Iedb16 Datasetmentioning
confidence: 59%
See 4 more Smart Citations
“…This is interesting considering the fact that for alleles with fewer than 1000 training measurements, MHCFlurry was pretrained on an augmented training set with measurements from BLOSUM similar alleles, USMPep LM ens was pretrained on a large corpus of unlabeled peptides and USMPep FS ens in contrast only saw the training sequences corresponding to one MHC molecule. These results stress that further efforts might be required to truly leverage the potential of unlabeled peptide data in order to observe similar improvements as seen for proteins [12] in particular for small datasets. Turning to MHC Class II binding prediction, we aim to demonstrate the universality of our approach beyond its applicability to different MHC I alleles.…”
Section: ) Iedb16 Datasetmentioning
confidence: 59%
“…Additionally, the model does not only have to learn the normal language model task for protein data but implicitly has to learn to stochastically predict cleavage sites. Second, even we evaluated on protein data, the protein language model only reaches an accuarcy of 0.137, which is is considerably lower than the accuracy of 0.41 reported in the literature [12]. This effect is a direct consequence of the considerably smaller model size (1 instead of 3 layers; 64 instead of 1150 hidden units; embedding size of 50 instead of 400).…”
Section: Language Modeling On Peptide Data and Its Impact On Downsmentioning
confidence: 63%
See 3 more Smart Citations