Proceedings of the 2020 Conference on Human Information Interaction and Retrieval 2020
DOI: 10.1145/3343413.3378004
|View full text |Cite
|
Sign up to set email alerts
|

Estimating Error and Bias in Offline Evaluation Results

Abstract: Offline evaluations of recommender systems attempt to estimate users' satisfaction with recommendations using static data from prior user interactions. These evaluations provide researchers and developers with first approximations of the likely performance of a new system and help weed out bad ideas before presenting them to users. However, offline evaluation cannot accurately assess novel, relevant recommendations, because the most novel items were previously unknown to the user, so they are missing from the … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
7

Relationship

3
4

Authors

Journals

citations
Cited by 9 publications
(11 citation statements)
references
References 12 publications
(19 reference statements)
0
11
0
Order By: Relevance
“…We can use such simulated data to study the distortions in recommender system behavior (and metrics of that behavior or performance) between what would be observed in an experiment with observable data, and what would be observed in an experiment with access to the actual underlying truth through an oracle. One of my students has used this approach to measure biases in evaluation metrics that are induced by data missingness [16], and Cañamares and Castells [3] employed a probabilistic model to better-understand popularity bias.…”
Section: Retrospective Simulation: Studying Assumptionsmentioning
confidence: 99%
See 1 more Smart Citation
“…We can use such simulated data to study the distortions in recommender system behavior (and metrics of that behavior or performance) between what would be observed in an experiment with observable data, and what would be observed in an experiment with access to the actual underlying truth through an oracle. One of my students has used this approach to measure biases in evaluation metrics that are induced by data missingness [16], and Cañamares and Castells [3] employed a probabilistic model to better-understand popularity bias.…”
Section: Retrospective Simulation: Studying Assumptionsmentioning
confidence: 99%
“…One of the major things that higher-degree simulation (anything above static data) affords in both of these, and other, scenarios is the ability to map out response curves for a recommender system or its surrounding experiments. In our study of recommender system metric bias [16], for example, we could extend the simulation to specifically model a variety of known degrees of popularity bias or of data sparsity, and estimate how the evaluation metric bias changes as a function of known changes in data biases. We don't necessarily know the degree of bias that is present in real data, but if we can understand the evaluation process's response curve to that bias, it will produce knowledge that can be combined with future research that may provide a better idea of where in the curve any particular actual system lies.…”
Section: Recommender System Response Curvesmentioning
confidence: 99%
“…Statistical biases are another factor that may influence the outcome of significance test for RecSys evaluation data. It has become well known that biases such as sparsity and popularity biases in RecSys evaluation data considerably distort the evaluation measures [13,14,15,16,4]. Bellogín et al [17] showed that the long-tailed distribution of RecSys evaluation data has a drastic effect on how recommendation algorithms compare to each other.…”
Section: Gaps For Fixing Recsys Evaluation Practicementioning
confidence: 99%
“…My own research team has used LKPY to study errors in evaluation protocols [36], and to rebuild the experiments from our work on author gender biases [12] for an expanded version currently under review. Narayan [25] used LKPY to study the effect of rating obscure items, and we expect more such projects to use the software in the coming years.…”
Section: Offline Recommender System Researchmentioning
confidence: 99%
“…• An example experiment, published on GitHub, that demonstrates a more realistic comparison of the effectiveness of recommendation algorithms on public data sets. • Source code for experiments using LKPY [10,36].…”
Section: Documentation and Examplesmentioning
confidence: 99%