Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison

Sun, Zhu; Yu, Di; Fang, Hui; Yang, Jie; Qu, Xinghua; Zhou, Jie; Geng, Cong

doi:10.1145/3383313.3412489

Cited by 101 publications

(121 citation statements)

References 32 publications

Supporting

Mentioning

106

Contrasting

Order By: Relevance

“…However, the lack of a standard framework or implementation of algorithms and evaluation methodologies impedes the progress in the field, as evidenced in other areas. In this context, there are examples showing that, in, e.g., the Information Retrieval area, even when standard datasets are used and a well-established set of baselines is known, no overall improvement over the years is guaranteed (Armstrong et al 2009) (Sun et al 2020). Analogously, in a series of prior works focusing on the evaluation, replication, and reproducibility of Recommender Systems algorithms and evaluation results, we have identified a set of aspects that need to be taken into consideration when comparing the results of recommender systems from different research papers, software frameworks, or evaluation contexts (Said and Bellogín 2014;Said and Bellogín 2015).…”

Section: Recommender Systemsmentioning

confidence: 99%

“…-DaisyRec (unversioned 30 ) is a recent framework developed in pyTorch focused on benchmarking previous research works, as described by Sun et al (2020) 2020). -RiVal (version 0.2 33 ), presented in a previous work (Said and Bellogín 2014) as a framework oriented to the evaluation of external recommender systems.…”

Section: Instantiations Of Accountable Experimental Frameworkmentioning

confidence: 99%

See 1 more Smart Citation

Improving accountability in recommender systems research through reproducibility

Bellogín

Said

2021

User Model User-Adap Inter

View full text Add to dashboard Cite

Reproducibility is a key requirement for scientific progress. It allows the reproduction of the works of others, and, as a consequence, to fully trust the reported claims and results. In this work, we argue that, by facilitating reproducibility of recommender systems experimentation, we indirectly address the issues of accountability and transparency in recommender systems research from the perspectives of practitioners, designers, and engineers aiming to assess the capabilities of published research works. These issues have become increasingly prevalent in recent literature. Reasons for this include societal movements around intelligent systems and artificial intelligence striving toward fair and objective use of human behavioral data (as in Machine Learning, Information Retrieval, or Human–Computer Interaction). Society has grown to expect explanations and transparency standards regarding the underlying algorithms making automated decisions for and around us. This work surveys existing definitions of these concepts and proposes a coherent terminology for recommender systems research, with the goal to connect reproducibility to accountability. We achieve this by introducing several guidelines and steps that lead to reproducible and, hence, accountable experimental workflows and research. We additionally analyze several instantiations of recommender system implementations available in the literature and discuss the extent to which they fit in the introduced framework. With this work, we aim to shed light on this important problem and facilitate progress in the field by increasing the accountability of research.

show abstract

Section: Recommender Systemsmentioning

confidence: 99%

Section: Instantiations Of Accountable Experimental Frameworkmentioning

confidence: 99%

Improving accountability in recommender systems research through reproducibility

Bellogín

Said

2021

User Model User-Adap Inter

View full text Add to dashboard Cite

show abstract

“…We choose to consider as positive only ratings of 4 and 5, which surely reflect a positive preference, and every other rating as a negative preference. In addition, and as is common in experimentation with recommenders, the absence of a rating is also taken as a negative interaction in all datasets [42]. As shown by Cañamares et al [10], although users with few ratings might exist in commercial services, they are usually filtered out in offline experiments because the lack of data leads to unreliable performance measurements.…”

Section: Datasetsmentioning

confidence: 99%

New Insights into Metric Optimization for Ranking-based Recommendation

Urbano

Hanjalic

2021

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

Direct optimization of IR metrics has often been adopted as an approach to devise and develop ranking-based recommender systems. Most methods following this approach (e.g. TFMAP, CLiMF, Top-N-Rank) aim at optimizing the same metric being used for evaluation, under the assumption that this will lead to the best performance. A number of studies of this practice bring this assumption, however, into question. In this paper, we dig deeper into this issue in order to learn more about the effects of the choice of the metric to optimize on the performance of a ranking-based recommender system. We present an extensive experimental study conducted on different datasets in both pairwise and listwise learning-to-rank (LTR) scenarios, to compare the relative merit of four popular IR metrics, namely , , and , when used for optimization and assessment of recommender systems in various combinations. For the first three, we follow the practice of loss function formulation available in literature. For the fourth one, we propose novel loss functions inspired by for both the pairwise and listwise scenario. Our results confirm that the best performance is indeed not necessarily achieved when optimizing the same metric being used for evaluation. In fact, we find that -inspired losses perform at least as well as other metrics in a consistent way, and offer clear benefits in several cases. Interesting to see is that -inspired losses, while improving the recommendation performance for all uses, may lead to an individual performance gain that is correlated with the activity level of a user in interacting with items. The more active the users, the more they benefit. Overall, our results challenge the assumption behind the current research practice of optimizing and evaluating the same metric, and point to -based optimization instead as a promising alternative when learning to rank in the recommendation context. CCS CONCEPTS• Information systems → Recommender systems; Learning to rank; • General and reference → Metrics.

show abstract

“…Nevertheless, while it is paramount to constantly develop new algorithms to advance the state-of-the-art, we believe that the evaluation procedure needs to be well-defined and robust so as to guarantee the validity of the obtained results. Very recently, several studies [30,70,132] have pinpointed worrisome problems that appear to undermine years of hard work within the recommendation systems community. Notably, [30] have discovered that many recently proposed methods are in fact not reproducible, and [132] noted that researchers often choose different datasets heuristically and there are in fact many seemingly trivial factors which can influence the recommendation performance.…”

Section: Introduction 11 Backgroundmentioning

confidence: 99%

“…Very recently, several studies [30,70,132] have pinpointed worrisome problems that appear to undermine years of hard work within the recommendation systems community. Notably, [30] have discovered that many recently proposed methods are in fact not reproducible, and [132] noted that researchers often choose different datasets heuristically and there are in fact many seemingly trivial factors which can influence the recommendation performance.…”

Section: Introduction 11 Backgroundmentioning

confidence: 99%

Personalised recommendation : challenges and experimental issues

Chin¹

View full text Add to dashboard Cite

In summary, we address both specific challenges, e.g. the cold-start problem, as well as broader experimental issues, e.g. the choice of datasets for empirical evaluation, in this dissertation. Our contributions bring about two clear benefits to the literature of recommendation systems: (1) We are able to better handle the cold-start problem by considering different types of content information with our newly proposed hybrid models, and (2) We are able to ensure the robustness of obtained experimental results by consciously selecting distinct datasets based on their characteristics.

show abstract

Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison

Cited by 101 publications

References 32 publications

Improving accountability in recommender systems research through reproducibility

Improving accountability in recommender systems research through reproducibility

New Insights into Metric Optimization for Ranking-based Recommendation

Personalised recommendation : challenges and experimental issues

Contact Info

Product

Resources

About