Estimating Error and Bias in Offline Evaluation Results

Tian, Mucun; Ekstrand, Michael D.

doi:10.1145/3343413.3378004

Cited by 9 publications

(11 citation statements)

References 12 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We can use such simulated data to study the distortions in recommender system behavior (and metrics of that behavior or performance) between what would be observed in an experiment with observable data, and what would be observed in an experiment with access to the actual underlying truth through an oracle. One of my students has used this approach to measure biases in evaluation metrics that are induced by data missingness [16], and Cañamares and Castells [3] employed a probabilistic model to better-understand popularity bias.…”

Section: Retrospective Simulation: Studying Assumptionsmentioning

confidence: 99%

“…One of the major things that higher-degree simulation (anything above static data) affords in both of these, and other, scenarios is the ability to map out response curves for a recommender system or its surrounding experiments. In our study of recommender system metric bias [16], for example, we could extend the simulation to specifically model a variety of known degrees of popularity bias or of data sparsity, and estimate how the evaluation metric bias changes as a function of known changes in data biases. We don't necessarily know the degree of bias that is present in real data, but if we can understand the evaluation process's response curve to that bias, it will produce knowledge that can be combined with future research that may provide a better idea of where in the curve any particular actual system lies.…”

Section: Recommender System Response Curvesmentioning

confidence: 99%

See 1 more Smart Citation

Multiversal Simulacra: Understanding Hypotheticals and Possible Worlds Through Simulation

Ekstrand

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Section: Retrospective Simulation: Studying Assumptionsmentioning

confidence: 99%

Section: Recommender System Response Curvesmentioning

confidence: 99%

Multiversal Simulacra: Understanding Hypotheticals and Possible Worlds Through Simulation

Ekstrand

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Statistical biases are another factor that may influence the outcome of significance test for RecSys evaluation data. It has become well known that biases such as sparsity and popularity biases in RecSys evaluation data considerably distort the evaluation measures [13,14,15,16,4]. Bellogín et al [17] showed that the long-tailed distribution of RecSys evaluation data has a drastic effect on how recommendation algorithms compare to each other.…”

Section: Gaps For Fixing Recsys Evaluation Practicementioning

confidence: 99%

Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse

Ihemelandu,

Ekstrand

2021

Preprint

Self Cite

View full text Add to dashboard Cite

This paper calls attention to the missing component of the recommender system evaluation process: Statistical Inference. There is active research in several components of the recommender system evaluation process: selecting baselines, standardizing benchmarks, and target item sampling. However, there has not yet been significant work on the role and use of statistical inference for analyzing recommender system evaluation results.In this paper, we argue that the use of statistical inference is a key component of the evaluation process that has not been given sufficient attention. We support this argument with systematic review of recent RecSys papers to understand how statistical inference is currently being used, along with a brief survey of studies that have been done on the use of statistical inference in the information retrieval community. We present several challenges that exist for inference in recommendation experiment which buttresses the need for empirical studies to aid with appropriately selecting and applying statistical inference techniques.

show abstract

“…My own research team has used LKPY to study errors in evaluation protocols [36], and to rebuild the experiments from our work on author gender biases [12] for an expanded version currently under review. Narayan [25] used LKPY to study the effect of rating obscure items, and we expect more such projects to use the software in the coming years.…”

Section: Offline Recommender System Researchmentioning

confidence: 99%

“…• An example experiment, published on GitHub, that demonstrates a more realistic comparison of the effectiveness of recommendation algorithms on public data sets. • Source code for experiments using LKPY [10,36].…”

Section: Documentation and Examplesmentioning

confidence: 99%

LensKit for Python: Next-Generation Software for Recommender Systems Experiments

Ekstrand

2020

Proceedings of the 29th ACM International Conference on Information &Amp; Knowledge Management

Self Cite

View full text Add to dashboard Cite

LensKit is an open-source toolkit for building, researching, and learning about recommender systems. First released in 2010 as a Java framework, it has supported diverse published research, smallscale production deployments, and education in both MOOC and traditional classroom settings. In this paper, I present the next generation of the LensKit project, re-envisioning the original tool's objectives as flexible Python package for supporting recommender systems research and development. LensKit for Python (LKPY) enables researchers and students to build robust, flexible, and reproducible experiments that make use of the large and growing PyData and Scientific Python ecosystem, including scikit-learn, and TensorFlow. To that end, it provides classical collaborative filtering implementations, recommender system evaluation metrics, data preparation routines, and tools for efficiently batch running recommendation algorithms, all usable in any combination with each other or with other Python software. This paper describes the design goals, use cases, and capabilities of LKPY, contextualized in a reflection on the successes and failures of the original LensKit for Java software. CCS CONCEPTS • Information systems → Recommender systems; Evaluation of retrieval results; • General and reference → Experimentation.

show abstract

Estimating Error and Bias in Offline Evaluation Results

Cited by 9 publications

References 12 publications

Multiversal Simulacra: Understanding Hypotheticals and Possible Worlds Through Simulation

Multiversal Simulacra: Understanding Hypotheticals and Possible Worlds Through Simulation

Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse

LensKit for Python: Next-Generation Software for Recommender Systems Experiments

Contact Info

Product

Resources

About