Fourteenth ACM Conference on Recommender Systems 2020
DOI: 10.1145/3383313.3412489
|View full text |Cite
|
Sign up to set email alerts
|

Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
106
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 101 publications
(121 citation statements)
references
References 32 publications
4
106
1
Order By: Relevance
“…However, the lack of a standard framework or implementation of algorithms and evaluation methodologies impedes the progress in the field, as evidenced in other areas. In this context, there are examples showing that, in, e.g., the Information Retrieval area, even when standard datasets are used and a well-established set of baselines is known, no overall improvement over the years is guaranteed (Armstrong et al 2009) (Sun et al 2020). Analogously, in a series of prior works focusing on the evaluation, replication, and reproducibility of Recommender Systems algorithms and evaluation results, we have identified a set of aspects that need to be taken into consideration when comparing the results of recommender systems from different research papers, software frameworks, or evaluation contexts (Said and Bellogín 2014;Said and Bellogín 2015).…”
Section: Recommender Systemsmentioning
confidence: 99%
See 1 more Smart Citation
“…However, the lack of a standard framework or implementation of algorithms and evaluation methodologies impedes the progress in the field, as evidenced in other areas. In this context, there are examples showing that, in, e.g., the Information Retrieval area, even when standard datasets are used and a well-established set of baselines is known, no overall improvement over the years is guaranteed (Armstrong et al 2009) (Sun et al 2020). Analogously, in a series of prior works focusing on the evaluation, replication, and reproducibility of Recommender Systems algorithms and evaluation results, we have identified a set of aspects that need to be taken into consideration when comparing the results of recommender systems from different research papers, software frameworks, or evaluation contexts (Said and Bellogín 2014;Said and Bellogín 2015).…”
Section: Recommender Systemsmentioning
confidence: 99%
“…-DaisyRec (unversioned 30 ) is a recent framework developed in pyTorch focused on benchmarking previous research works, as described by Sun et al (2020) 2020). -RiVal (version 0.2 33 ), presented in a previous work (Said and Bellogín 2014) as a framework oriented to the evaluation of external recommender systems.…”
Section: Instantiations Of Accountable Experimental Frameworkmentioning
confidence: 99%
“…We choose to consider as positive only ratings of 4 and 5, which surely reflect a positive preference, and every other rating as a negative preference. In addition, and as is common in experimentation with recommenders, the absence of a rating is also taken as a negative interaction in all datasets [42]. As shown by Cañamares et al [10], although users with few ratings might exist in commercial services, they are usually filtered out in offline experiments because the lack of data leads to unreliable performance measurements.…”
Section: Datasetsmentioning
confidence: 99%
“…Nevertheless, while it is paramount to constantly develop new algorithms to advance the state-of-the-art, we believe that the evaluation procedure needs to be well-defined and robust so as to guarantee the validity of the obtained results. Very recently, several studies [30,70,132] have pinpointed worrisome problems that appear to undermine years of hard work within the recommendation systems community. Notably, [30] have discovered that many recently proposed methods are in fact not reproducible, and [132] noted that researchers often choose different datasets heuristically and there are in fact many seemingly trivial factors which can influence the recommendation performance.…”
Section: Introduction 11 Backgroundmentioning
confidence: 99%
“…Very recently, several studies [30,70,132] have pinpointed worrisome problems that appear to undermine years of hard work within the recommendation systems community. Notably, [30] have discovered that many recently proposed methods are in fact not reproducible, and [132] noted that researchers often choose different datasets heuristically and there are in fact many seemingly trivial factors which can influence the recommendation performance.…”
Section: Introduction 11 Backgroundmentioning
confidence: 99%