2023
DOI: 10.1101/2023.04.06.535863
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The pitfalls of negative data bias for the T-cell epitope specificity challenge

Abstract: Even high-performing machine learning models can have problems when deployed in a real-world setting if the data used to train and test the model contains biases. TCR–epitope binding prediction for novel epitopes is a very important but yet unsolved problem in immunology. In this article, we describe how the technique used to create negative data for the TCR–epitope interaction prediction task can lead to a strong bias and makes that the performance drops to random when tested in a more realistic scenario.

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
9
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 8 publications
(9 citation statements)
references
References 16 publications
0
9
0
Order By: Relevance
“…The negative data generation strategy is worth noting. The utilization of a background TCR pool might lead to bias and overestimation of model performance (Dens et al, 2023;Moris et al, 2021), so we used random shuffle strategy to obtain our negative pairs. This strategy might bring false negative samples although the possibility is relatively low.…”
Section: Discussionmentioning
confidence: 99%
“…The negative data generation strategy is worth noting. The utilization of a background TCR pool might lead to bias and overestimation of model performance (Dens et al, 2023;Moris et al, 2021), so we used random shuffle strategy to obtain our negative pairs. This strategy might bring false negative samples although the possibility is relatively low.…”
Section: Discussionmentioning
confidence: 99%
“…Future progress in AIRR-ML research is required in areas as diverse as evaluating the impact of sample size on prediction accuracy (Pavlović et al 2021;Kanduri et al 2022), negative dataset definition (Montemurro, Jessen, and Nielsen 2022;Deng et al 2023;Dens et al 2023), and unbiased estimation of prediction accuracy (Meysman et al 2023;Moris et al 2021b). For all of these use cases, LIgO simulations may be employed for benchmarking and developing interpretable AIRR-ML methods.…”
Section: Discussionmentioning
confidence: 99%
“…For example, (i) if the train and test set overlap or contain highly similar sequences, then the accuracy of the trained ML model may be overly optimistic (a case of data leakage). (ii) If the training data is not comprehensive and representative, it may lead to generalization problems and underperformance on unseen data (Montemurro, Jessen, and Nielsen 2022;Deng et al 2023;Dens et al 2023;Moris et al 2021a;Robert et al 2022;Meysman et al 2023;Walsh, Pollastri, and Tosatto 2016;Petti and Eddy 2022;A. Weber, Born, and Rodriguez Martínez 2021).…”
Section: Comparison Of Simulation Strategies Implemented In the Ligo ...mentioning
confidence: 99%
See 1 more Smart Citation
“…However, in the TCR-peptide recognition community, it is well known that the cross-reactivity of TCRs is a significant challenge in the TCR-peptide recognition problem and it can pose difficulties for many models in the pre-processing stage 2 . Experimental methods can have a high false-negative rate 1 , resulting in many potential cross-reactivity in existing known binding TCRs, making PU Learning more challenging. In the absence of experimental negatives, two main negative sampling strategies have emerged, including reshuffling based on positive pairs (first strategy) and randomly drawing from background repertories (second strategy).…”
mentioning
confidence: 99%