Optimizing Ranking Models in an Online Setting

Oosterhuis, Harrie; Rijke, Maarten de

doi:10.1007/978-3-030-15712-8_25

Cited by 12 publications

(10 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The gradient estimation of PDGD is unbiased with respect to user document pair preferences [31]. PDGD is empirically found to be significantly better than DBGD in terms of final convergence, learning speed and user experience during optimization, making PDGD the current state-of-the-art method for OLTR [20,64,32,52]. PDGD has also been adapted to the federated OLTR context [51], exhibiting again state-of-the-art performance.…”

Section: Online Learning To Rankmentioning

confidence: 99%

Reinforcement Online Learning to Rank with Unbiased Reward Shaping

Zhuang¹,

Qiao²,

Zuccon³

2022

Preprint

View full text Add to dashboard Cite

Online learning to rank (OLTR) aims to learn a ranker directly from implicit feedback derived from users' interactions, such as clicks. Clicks however are a biased signal: specifically, top-ranked documents are likely to attract more clicks than documents down the ranking (position bias). In this paper, we propose a novel learning algorithm for OLTR that uses reinforcement learning to optimize rankers: Reinforcement Online Learning to Rank (ROLTR). In ROLTR, the gradients of the ranker are estimated based on the rewards assigned to clicked and unclicked documents. In order to de-bias the users' position bias contained in the reward signals, we introduce unbiased reward shaping functions that exploit inverse propensity scoring for clicked and unclicked documents. The fact that our method can also model unclicked documents provides a further advantage in that less users interactions are required to effectively train a ranker, thus providing gains in efficiency. Empirical evaluation on standard OLTR datasets shows that ROLTR achieves state-ofthe-art performance, and provides significantly better user experience than other OLTR approaches. To facilitate the reproducibility of our experiments, we make all experiment code available at https://github.com/ielab/OLTR.

show abstract

Section: Online Learning To Rankmentioning

confidence: 99%

Reinforcement Online Learning to Rank with Unbiased Reward Shaping

Zhuang¹,

Qiao²,

Zuccon³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The earliest method, Dueling Bandit Gradient Descent (DBGD), samples variations of a ranking model and compares them using online evaluation [10]; if an improvement is recognized the model is updated accordingly. Most online LTR methods have increased the data-efficiency of DBGD [9,24,26]; later work found that DBGD is not effective at optimizing neural models [17] and often fails to find the optimal linear-model even in ideal scenarios [18]. To these limitations, alternative approaches for online LTR have been proposed.…”

Section: Related Workmentioning

confidence: 99%

Unifying Online and Counterfactual Learning to Rank: A Novel Counterfactual Estimator that Effectively Utilizes Online Interventions

Oosterhuis

Rijke

2021

Proceedings of the 14th ACM International Conference on Web Search and Data Mining

Self Cite

View full text Add to dashboard Cite

Optimizing ranking systems based on user interactions is a wellstudied problem. State-of-the-art methods for optimizing ranking systems based on user interactions are divided into online approaches -that learn by directly interacting with users -and counterfactual approaches -that learn from historical interactions. Existing online methods are hindered without online interventions and thus should not be applied counterfactually. Conversely, counterfactual methods cannot directly benefit from online interventions.We propose a novel intervention-aware estimator for both counterfactual and online Learning to Rank (LTR). With the introduction of the intervention-aware estimator, we aim to bridge the online/counterfactual LTR division as it is shown to be highly effective in both online and counterfactual scenarios. The estimator corrects for the effect of position bias, trust bias, and item-selection bias by using corrections based on the behavior of the logging policy and on online interventions: changes to the logging policy made during the gathering of click data. Our experimental results, conducted in a semi-synthetic experimental setup, show that, unlike existing counterfactual LTR methods, the intervention-aware estimator can greatly benefit from online interventions.

show abstract

“…Our experimental setup is semi-synthetic: queries, relevance judgements, and documents come from industry datasets, while biased and noisy user interactions are simulated using probabilistic user models. This setup is very common in the counterfactual and online LTR literature [1,15,24]. We make use of the three largest LTR industry datasets: Yahoo!…”

Section: The Semi-synthetic Setupmentioning

confidence: 99%

Robust Generalization and Safe Query-Specializationin Counterfactual Learning to Rank

Oosterhuis

Rijke

2021

Proceedings of the Web Conference 2021

Self Cite

View full text Add to dashboard Cite

Existing work in counterfactual Learning to Rank (LTR) has focussed on optimizing feature-based models that predict the optimal ranking based on document features. LTR methods based on bandit algorithms often optimize tabular models that memorize the optimal ranking per query. These types of model have their own advantages and disadvantages. Feature-based models provide very robust performance across many queries, including those previously unseen, however, the available features often limit the rankings the model can predict. In contrast, tabular models can converge on any possible ranking through memorization. However, memorization is extremely prone to noise, which makes tabular models reliable only when large numbers of user interactions are available. Can we develop a robust counterfactual LTR method that pursues memorization-based optimization whenever it is safe to do?We introduce the Generalization and Specialization (GENSPEC) algorithm, a robust feature-based counterfactual LTR method that pursues per-query memorization when it is safe to do so. GENSPEC optimizes a single feature-based model for generalization: robust performance across all queries, and many tabular models for specialization: each optimized for high performance on a single query. GENSPEC uses novel relative high-confidence bounds to choose which model to deploy per query. By doing so, GENSPEC enjoys the high performance of successfully specialized tabular models with the robustness of a generalized feature-based model. Our results show that GENSPEC leads to optimal performance on queries with sufficient click data, while having robust behavior on queries with little or noisy data.

show abstract

Optimizing Ranking Models in an Online Setting

Cited by 12 publications

References 35 publications

Reinforcement Online Learning to Rank with Unbiased Reward Shaping

Reinforcement Online Learning to Rank with Unbiased Reward Shaping

Unifying Online and Counterfactual Learning to Rank: A Novel Counterfactual Estimator that Effectively Utilizes Online Interventions

Robust Generalization and Safe Query-Specializationin Counterfactual Learning to Rank

Contact Info

Product

Resources

About