Online experiments such as Randomised Controlled Trials (RCTs) or A/B-tests are the bread and butter of modern platforms on the web. They are conducted continuously to allow platforms to estimate the causal effect of replacing system variant "A" with variant "B", on some metric of interest. These variants can differ in many aspects. In this paper, we focus on the common use-case where they correspond to machine learning models. The online experiment then serves as the final arbiter to decide which model is superior, and should thus be shipped.The statistical literature on causal effect estimation from RCTs has a substantial history, which contributes deservedly to the level of trust researchers and practitioners have in this "gold standard" of evaluation practices. Nevertheless, in the particular case of machine learning experiments, we remark that certain critical issues remain. Specifically, the assumptions that are required to ascertain that A/B-tests yield unbiased estimates of the causal effect, are seldom met in practical applications. We argue that, because variants typically learn using pooled data, a lack of model interference cannot be guaranteed. This undermines the conclusions we can draw from online experiments with machine learning models. We discuss the implications this has for practitioners, and for the research literature. Randomised Controlled Trials and their AssumptionsRandomised experiments have existed in the scientific literature for close to 140 years, first introduced in psychology [Peirce and Jastrow, 1884]. Since then, they have been a popular topic of study in the statistical literature -a feat often ascribed to the seminal works of Fisher [1925, 1936]-and are generally well-understood [Imbens and Rubin, 2015]. Randomised Controlled Trials (RCTs) form the theoretical basis for the online experiments that modern web platforms run continuously [Gupta et al., 2019], colloquially known as A/B-tests [Kohavi et al., 2020].Generally speaking, RCTs deal with treatments being applied to units, leading to certain outcomes [Rubin, 1974]. Typical examples from the early literature revolve around agricultural applications, where we have types of fertiliser we can apply to plots of land, which has an effect on crop yield. In an RCT, we randomly assign units to treatment/control, and as a result, the average measured outcomes for units under control C and treatment T give a finite-sample estimate of ACM SIGIR Forum 1 Working draft.
Recommender systems are typically evaluated in an offline setting. A subset of the available user-item interactions is sampled to serve as test set, and some model trained on the remaining data points is then evaluated on its performance to predict which interactions were left out. Alternatively, in an online evaluation setting, multiple versions of the system are deployed and various metrics for those systems are recorded. Systems that score better on these metrics, are then typically preferred. Online evaluation is effective, but inefficient for a number of reasons. Offline evaluation is much more efficient, but current methodologies often fail to accurately predict online performance. In this work, we identify three ways to improve and extend current work on offline evaluation methodologies. More specifically, we believe there is much room for improvement in temporal evaluation, off-policy evaluation, and moving beyond using just clicks to evaluate performance. CCS CONCEPTS• Information systems → Recommender systems; Evaluation of retrieval results.
Conventional approaches to recommendation often do not explicitly take into account information on previously shown recommendations and their recorded responses. One reason is that, since we do not know the outcome of actions the system did not take, learning directly from such logs is not a straightforward task. Several methods for off-policy or counterfactual learning have been proposed in recent years, but their efficacy for the recommendation task remains understudied. Due to the limitations of offline datasets and the lack of access of most academic researchers to online experiments, this is a non-trivial task. Simulation environments can provide a reproducible solution to this problem.In this work, we conduct the first broad empirical study of counterfactual learning methods for recommendation, in a simulated environment. We consider various different policy-based methods that make use of the Inverse Propensity Score (IPS) to perform Counterfactual Risk Minimisation (CRM), as well as value-based methods based on Maximum Likelihood Estimation (MLE). We highlight how existing off-policy learning methods fail due to stochastic and sparse rewards, and show how a logarithmic variant of the traditional IPS estimator can solve these issues, whilst convexifying the objective and thus facilitating its optimisation. Additionally, under certain assumptions the value-and policy-based methods have an identical parameterisation, allowing us to propose a new model that combines both the MLE and CRM objectives. Extensive experiments show that this łDual Banditž approach achieves stateof-the-art performance in a wide range of scenarios, for varying logging policies, action spaces and training sample sizes.
Recent work has shown that, despite their simplicity, item-based models optimised through ridge regression can attain highly competitive results on collaborative filtering tasks. As these models are analytically computable and thus forgo the need for often expensive iterative optimisation procedures, they are an attractive choice for practitioners. We study the applicability of such closedform models to implicit-feedback collaborative filtering when additional side-information or metadata about items is available. Two complementary extensions to the ease r paradigm are proposed, based on collective and additive models. Through an extensive empirical analysis on several large-scale datasets, we show that our methods can effectively exploit side-information whilst retaining a closed-form solution, and improve upon the state-of-the-art without increasing the computational complexity of the original ease r approach. Additionally, empirical results demonstrate that the use of side-information leads to more łlong tailž items being recommended, benefiting the recommendations' coverage of the item catalogue.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.