Safe Exploration for Optimizing Contextual Bandits

Jagerman, Rolf; Markov, Ilia; Rijke, Maarten de

doi:10.1145/3385670

Cited by 11 publications

(16 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we prove that the relative bounds of GENSPEC are more efficient than SEA bounds [11], when the covariance between the reward estimates of two models is positive:…”

Section: B Efficiency Of Relative Boundingmentioning

confidence: 99%

“…Even though it is known that deployment should be avoided in such cases, to the best of our knowledge, there exist no theoretically principled method for detecting when it is safe to deploy a tabular model. The only existing method that safely chooses between models appears to be the Safe Exploration Algorithm (SEA) [11], which applies high-confidence bounds to the performance of a safe logging policy model and a newly learned ranking model. If these bounds do not overlap, SEA can conclude with high-confidence that one model outperforms the other.…”

Section: Related Workmentioning

confidence: 99%

“…It also appears that there is no clear optimal choice between 𝜋 𝜃 and 𝜋 D ; instead, this choice seems to mostly depend on the available data D. We wish to deploy the model that leads to the highest performance, however, we also want to avoid a detrimental user experience due to choosing the wrong model. Recent work by Jagerman et al [11] introduced the Safe Exploration Algorithm (SEA), for choosing safely between a safe model and a risky learned model. SEA applies high confidence bounds [35] to the performances of both models.…”

Section: Reliably Choosing Between Modelsmentioning

confidence: 99%

“…In previous work, these LTR methods have been divided into online and counterfactual approaches [4,11,12], where online approaches learn from direct interactions [23,44,45], and counterfactual approaches learn from historical interaction data [15,25,39]. While this division is very interesting [4,12,26], this paper focusses on a different division between methods that learn feature-based models and those that learn tabular models.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Robust Generalization and Safe Query-Specializationin Counterfactual Learning to Rank

Oosterhuis

Rijke

2021

Proceedings of the Web Conference 2021

View full text Add to dashboard Cite

Existing work in counterfactual Learning to Rank (LTR) has focussed on optimizing feature-based models that predict the optimal ranking based on document features. LTR methods based on bandit algorithms often optimize tabular models that memorize the optimal ranking per query. These types of model have their own advantages and disadvantages. Feature-based models provide very robust performance across many queries, including those previously unseen, however, the available features often limit the rankings the model can predict. In contrast, tabular models can converge on any possible ranking through memorization. However, memorization is extremely prone to noise, which makes tabular models reliable only when large numbers of user interactions are available. Can we develop a robust counterfactual LTR method that pursues memorization-based optimization whenever it is safe to do?We introduce the Generalization and Specialization (GENSPEC) algorithm, a robust feature-based counterfactual LTR method that pursues per-query memorization when it is safe to do so. GENSPEC optimizes a single feature-based model for generalization: robust performance across all queries, and many tabular models for specialization: each optimized for high performance on a single query. GENSPEC uses novel relative high-confidence bounds to choose which model to deploy per query. By doing so, GENSPEC enjoys the high performance of successfully specialized tabular models with the robustness of a generalized feature-based model. Our results show that GENSPEC leads to optimal performance on queries with sufficient click data, while having robust behavior on queries with little or noisy data.

show abstract

“…In this section, we prove that the relative bounds of GENSPEC are more efficient than SEA bounds [11], when the covariance between the reward estimates of two models is positive:…”

Section: B Efficiency Of Relative Boundingmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Reliably Choosing Between Modelsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Robust Generalization and Safe Query-Specializationin Counterfactual Learning to Rank

Oosterhuis

Rijke

2021

Proceedings of the Web Conference 2021

View full text Add to dashboard Cite

show abstract

“…Hence, in order to reduce variance and speed up learning, we simplify MDP to Contextual Bandits [19,1,17] by setting γ = 0. This setting makes RE-INFORCE to choose a t so as to maximize only the expectation of immediate reward R(s t , a t ):…”

Section: Learning With Policy Gradientmentioning

confidence: 99%

Reinforcement Online Learning to Rank with Unbiased Reward Shaping

Zhuang¹,

Qiao²,

Zuccon³

2022

Preprint

View full text Add to dashboard Cite

Online learning to rank (OLTR) aims to learn a ranker directly from implicit feedback derived from users' interactions, such as clicks. Clicks however are a biased signal: specifically, top-ranked documents are likely to attract more clicks than documents down the ranking (position bias). In this paper, we propose a novel learning algorithm for OLTR that uses reinforcement learning to optimize rankers: Reinforcement Online Learning to Rank (ROLTR). In ROLTR, the gradients of the ranker are estimated based on the rewards assigned to clicked and unclicked documents. In order to de-bias the users' position bias contained in the reward signals, we introduce unbiased reward shaping functions that exploit inverse propensity scoring for clicked and unclicked documents. The fact that our method can also model unclicked documents provides a further advantage in that less users interactions are required to effectively train a ranker, thus providing gains in efficiency. Empirical evaluation on standard OLTR datasets shows that ROLTR achieves state-ofthe-art performance, and provides significantly better user experience than other OLTR approaches. To facilitate the reproducibility of our experiments, we make all experiment code available at https://github.com/ielab/OLTR.

show abstract

Reinforcement online learning to rank with unbiased reward shaping

2022

View full text Add to dashboard Cite

Online learning to rank (OLTR) aims to learn a ranker directly from implicit feedback derived from users’ interactions, such as clicks. Clicks however are a biased signal: specifically, top-ranked documents are likely to attract more clicks than documents down the ranking (position bias). In this paper, we propose a novel learning algorithm for OLTR that uses reinforcement learning to optimize rankers: Reinforcement Online Learning to Rank (ROLTR). In ROLTR, the gradients of the ranker are estimated based on the rewards assigned to clicked and unclicked documents. In order to de-bias the users’ position bias contained in the reward signals, we introduce unbiased reward shaping functions that exploit inverse propensity scoring for clicked and unclicked documents. The fact that our method can also model unclicked documents provides a further advantage in that less users interactions are required to effectively train a ranker, thus providing gains in efficiency. Empirical evaluation on standard OLTR datasets shows that ROLTR achieves state-of-the-art performance, and provides significantly better user experience than other OLTR approaches. To facilitate the reproducibility of our experiments, we make all experiment code available at https://github.com/ielab/OLTR.

show abstract

Safe Exploration for Optimizing Contextual Bandits

Cited by 11 publications

References 38 publications

Robust Generalization and Safe Query-Specializationin Counterfactual Learning to Rank

Robust Generalization and Safe Query-Specializationin Counterfactual Learning to Rank

Reinforcement Online Learning to Rank with Unbiased Reward Shaping

Reinforcement online learning to rank with unbiased reward shaping

Contact Info

Product

Resources

About