2024
DOI: 10.1101/2024.07.23.604678
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Beware of Data Leakage from Protein LLM Pretraining

Leon Hermann,
Tobias Fiedler,
Hoang An Nguyen
et al.

Abstract: Pretrained protein language models are becoming increasingly popular as a backbone for protein property inference tasks such as structure prediction or function annotation, accelerating biological research. However, related research oftentimes does not consider the effects of data leakage from pretraining on the actual downstream task, resulting in potentially unrealistic performance estimates. Reported generalization might not necessarily be reproducible for proteins highly dissimilar from the pretraining set… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
references
References 24 publications
0
0
0
Order By: Relevance