Abstract:Contextual bandit problems are a natural fit for many information retrieval tasks, such as learning to rank, text classification, recommendation, and so on. However, existing learning methods for contextual bandit problems have one of two drawbacks: They either do not explore the space of all possible document rankings (i.e., actions) and, thus, may miss the optimal ranking, or they present suboptimal rankings to a user and, thus, may harm the user experience. We introduce a new learning method for contextual … Show more
“…In this section, we prove that the relative bounds of GENSPEC are more efficient than SEA bounds [11], when the covariance between the reward estimates of two models is positive:…”
Section: B Efficiency Of Relative Boundingmentioning
confidence: 99%
“…Even though it is known that deployment should be avoided in such cases, to the best of our knowledge, there exist no theoretically principled method for detecting when it is safe to deploy a tabular model. The only existing method that safely chooses between models appears to be the Safe Exploration Algorithm (SEA) [11], which applies high-confidence bounds to the performance of a safe logging policy model and a newly learned ranking model. If these bounds do not overlap, SEA can conclude with high-confidence that one model outperforms the other.…”
Section: Related Workmentioning
confidence: 99%
“…It also appears that there is no clear optimal choice between 𝜋 𝜃 and 𝜋 D ; instead, this choice seems to mostly depend on the available data D. We wish to deploy the model that leads to the highest performance, however, we also want to avoid a detrimental user experience due to choosing the wrong model. Recent work by Jagerman et al [11] introduced the Safe Exploration Algorithm (SEA), for choosing safely between a safe model and a risky learned model. SEA applies high confidence bounds [35] to the performances of both models.…”
Section: Reliably Choosing Between Modelsmentioning
confidence: 99%
“…In previous work, these LTR methods have been divided into online and counterfactual approaches [4,11,12], where online approaches learn from direct interactions [23,44,45], and counterfactual approaches learn from historical interaction data [15,25,39]. While this division is very interesting [4,12,26], this paper focusses on a different division between methods that learn feature-based models and those that learn tabular models.…”
Existing work in counterfactual Learning to Rank (LTR) has focussed on optimizing feature-based models that predict the optimal ranking based on document features. LTR methods based on bandit algorithms often optimize tabular models that memorize the optimal ranking per query. These types of model have their own advantages and disadvantages. Feature-based models provide very robust performance across many queries, including those previously unseen, however, the available features often limit the rankings the model can predict. In contrast, tabular models can converge on any possible ranking through memorization. However, memorization is extremely prone to noise, which makes tabular models reliable only when large numbers of user interactions are available. Can we develop a robust counterfactual LTR method that pursues memorization-based optimization whenever it is safe to do?We introduce the Generalization and Specialization (GENSPEC) algorithm, a robust feature-based counterfactual LTR method that pursues per-query memorization when it is safe to do so. GENSPEC optimizes a single feature-based model for generalization: robust performance across all queries, and many tabular models for specialization: each optimized for high performance on a single query. GENSPEC uses novel relative high-confidence bounds to choose which model to deploy per query. By doing so, GENSPEC enjoys the high performance of successfully specialized tabular models with the robustness of a generalized feature-based model. Our results show that GENSPEC leads to optimal performance on queries with sufficient click data, while having robust behavior on queries with little or noisy data.
“…In this section, we prove that the relative bounds of GENSPEC are more efficient than SEA bounds [11], when the covariance between the reward estimates of two models is positive:…”
Section: B Efficiency Of Relative Boundingmentioning
confidence: 99%
“…Even though it is known that deployment should be avoided in such cases, to the best of our knowledge, there exist no theoretically principled method for detecting when it is safe to deploy a tabular model. The only existing method that safely chooses between models appears to be the Safe Exploration Algorithm (SEA) [11], which applies high-confidence bounds to the performance of a safe logging policy model and a newly learned ranking model. If these bounds do not overlap, SEA can conclude with high-confidence that one model outperforms the other.…”
Section: Related Workmentioning
confidence: 99%
“…It also appears that there is no clear optimal choice between 𝜋 𝜃 and 𝜋 D ; instead, this choice seems to mostly depend on the available data D. We wish to deploy the model that leads to the highest performance, however, we also want to avoid a detrimental user experience due to choosing the wrong model. Recent work by Jagerman et al [11] introduced the Safe Exploration Algorithm (SEA), for choosing safely between a safe model and a risky learned model. SEA applies high confidence bounds [35] to the performances of both models.…”
Section: Reliably Choosing Between Modelsmentioning
confidence: 99%
“…In previous work, these LTR methods have been divided into online and counterfactual approaches [4,11,12], where online approaches learn from direct interactions [23,44,45], and counterfactual approaches learn from historical interaction data [15,25,39]. While this division is very interesting [4,12,26], this paper focusses on a different division between methods that learn feature-based models and those that learn tabular models.…”
Existing work in counterfactual Learning to Rank (LTR) has focussed on optimizing feature-based models that predict the optimal ranking based on document features. LTR methods based on bandit algorithms often optimize tabular models that memorize the optimal ranking per query. These types of model have their own advantages and disadvantages. Feature-based models provide very robust performance across many queries, including those previously unseen, however, the available features often limit the rankings the model can predict. In contrast, tabular models can converge on any possible ranking through memorization. However, memorization is extremely prone to noise, which makes tabular models reliable only when large numbers of user interactions are available. Can we develop a robust counterfactual LTR method that pursues memorization-based optimization whenever it is safe to do?We introduce the Generalization and Specialization (GENSPEC) algorithm, a robust feature-based counterfactual LTR method that pursues per-query memorization when it is safe to do so. GENSPEC optimizes a single feature-based model for generalization: robust performance across all queries, and many tabular models for specialization: each optimized for high performance on a single query. GENSPEC uses novel relative high-confidence bounds to choose which model to deploy per query. By doing so, GENSPEC enjoys the high performance of successfully specialized tabular models with the robustness of a generalized feature-based model. Our results show that GENSPEC leads to optimal performance on queries with sufficient click data, while having robust behavior on queries with little or noisy data.
“…Hence, in order to reduce variance and speed up learning, we simplify MDP to Contextual Bandits [19,1,17] by setting γ = 0. This setting makes RE-INFORCE to choose a t so as to maximize only the expectation of immediate reward R(s t , a t ):…”
Online learning to rank (OLTR) aims to learn a ranker directly from implicit feedback derived from users' interactions, such as clicks. Clicks however are a biased signal: specifically, top-ranked documents are likely to attract more clicks than documents down the ranking (position bias). In this paper, we propose a novel learning algorithm for OLTR that uses reinforcement learning to optimize rankers: Reinforcement Online Learning to Rank (ROLTR). In ROLTR, the gradients of the ranker are estimated based on the rewards assigned to clicked and unclicked documents. In order to de-bias the users' position bias contained in the reward signals, we introduce unbiased reward shaping functions that exploit inverse propensity scoring for clicked and unclicked documents. The fact that our method can also model unclicked documents provides a further advantage in that less users interactions are required to effectively train a ranker, thus providing gains in efficiency. Empirical evaluation on standard OLTR datasets shows that ROLTR achieves state-ofthe-art performance, and provides significantly better user experience than other OLTR approaches. To facilitate the reproducibility of our experiments, we make all experiment code available at https://github.com/ielab/OLTR.
Online learning to rank (OLTR) aims to learn a ranker directly from implicit feedback derived from users’ interactions, such as clicks. Clicks however are a biased signal: specifically, top-ranked documents are likely to attract more clicks than documents down the ranking (position bias). In this paper, we propose a novel learning algorithm for OLTR that uses reinforcement learning to optimize rankers: Reinforcement Online Learning to Rank (ROLTR). In ROLTR, the gradients of the ranker are estimated based on the rewards assigned to clicked and unclicked documents. In order to de-bias the users’ position bias contained in the reward signals, we introduce unbiased reward shaping functions that exploit inverse propensity scoring for clicked and unclicked documents. The fact that our method can also model unclicked documents provides a further advantage in that less users interactions are required to effectively train a ranker, thus providing gains in efficiency. Empirical evaluation on standard OLTR datasets shows that ROLTR achieves state-of-the-art performance, and provides significantly better user experience than other OLTR approaches. To facilitate the reproducibility of our experiments, we make all experiment code available at https://github.com/ielab/OLTR.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.