2024
DOI: 10.1101/2024.03.07.584001
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Protein language models are biased by unequal sequence sampling across the tree of life

Frances Ding,
Jacob Steinhardt

Abstract: Protein language models (pLMs) trained on large protein sequence databases have been used to understand disease and design novel proteins. In design tasks, the likelihood of a protein sequence under a pLM is often used as a proxy for protein fitness, so it is critical to understand what signals likelihoods capture. In this work we find that pLM likelihoods unintentionally encode a species bias: likelihoods of protein sequences from certain species are systematically higher, independent of the protein in questi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 60 publications
0
4
0
Order By: Relevance
“…The ability of ZymCTRL to generate functional enzymes in a zero-shot manner made us wonder to what extent the model would benefit from fine-tuning in a saturated sequence space. Recent work highlighted the uneven sampling of the tree of life in public sequence databases and the impact this can have on the performance of pLMs trained on these datasets 33 . For the purpose of fine-tuning ZymCTRL with sequences sampled from sections of the tree of life absent from public databases, expanding into sequence space beyond the public databases ( Fig.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…The ability of ZymCTRL to generate functional enzymes in a zero-shot manner made us wonder to what extent the model would benefit from fine-tuning in a saturated sequence space. Recent work highlighted the uneven sampling of the tree of life in public sequence databases and the impact this can have on the performance of pLMs trained on these datasets 33 . For the purpose of fine-tuning ZymCTRL with sequences sampled from sections of the tree of life absent from public databases, expanding into sequence space beyond the public databases ( Fig.…”
Section: Resultsmentioning
confidence: 99%
“…Seven of the artificial carbonic anhydrases showed activities, with two close to natural ones, with sequence identities in the range of 35 - 50%. Second, in order to address potential biases in pLMs due to unequal sequence sampling across the tree of life in public databases 33 , we fine-tuned ZymCTRL on a diverse set of metagenomic lactate dehydrogenase (LDH) sequences derived from Basecamp Research’s internal graph database. We show that LDH sequences generated after fine-tuning are more likely to pass in silico quality metrics than zero-shot generated sequences.…”
Section: Introductionmentioning
confidence: 99%
“…Evolutionary data, consisting of massive collections of naturally evolved protein sequences, captures information relevant to organismal fitness, including protein expression, folding, stability, and biological function. However, the precise selective pressures for each protein are different and largely unknown, and evolutionary patterns can be confounded by historical events, phylogenetic biases, and unequal sequence sampling [38]. In contrast, biophysical simulations allow precise control of the input sequence distribution, even sequences with non-natural amino acids [39, 40], and capture fundamental properties of protein structure and energetics.…”
Section: Discussionmentioning
confidence: 99%
“…It further supports the hypothesis that the embeddings computed by the pre-trained pLMs for viral proteins are intrinsically noisy. Possibly, viral proteins are simply too under-represented in the training of pLMs, due to a comparatively small number and low diversity (Ding and Steinhardt 2024; Elnaggar et al 2021; Lin et al 2023; The UniProt Consortium et al 2023). In addition, the pLMs may struggle to capture the inherent peculiarities of viral protein evolution (Koonin, Dolja, and Krupovic 2022).…”
Section: Discussionmentioning
confidence: 99%