The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2020
DOI: 10.1145/3385670
|View full text |Cite
|
Sign up to set email alerts
|

Safe Exploration for Optimizing Contextual Bandits

Abstract: Contextual bandit problems are a natural fit for many information retrieval tasks, such as learning to rank, text classification, recommendation, and so on. However, existing learning methods for contextual bandit problems have one of two drawbacks: They either do not explore the space of all possible document rankings (i.e., actions) and, thus, may miss the optimal ranking, or they present suboptimal rankings to a user and, thus, may harm the user experience. We introduce a new learning method for contextual … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
16
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 11 publications
(16 citation statements)
references
References 38 publications
0
16
0
Order By: Relevance
“…In this section, we prove that the relative bounds of GENSPEC are more efficient than SEA bounds [11], when the covariance between the reward estimates of two models is positive:…”
Section: B Efficiency Of Relative Boundingmentioning
confidence: 99%
See 3 more Smart Citations
“…In this section, we prove that the relative bounds of GENSPEC are more efficient than SEA bounds [11], when the covariance between the reward estimates of two models is positive:…”
Section: B Efficiency Of Relative Boundingmentioning
confidence: 99%
“…Even though it is known that deployment should be avoided in such cases, to the best of our knowledge, there exist no theoretically principled method for detecting when it is safe to deploy a tabular model. The only existing method that safely chooses between models appears to be the Safe Exploration Algorithm (SEA) [11], which applies high-confidence bounds to the performance of a safe logging policy model and a newly learned ranking model. If these bounds do not overlap, SEA can conclude with high-confidence that one model outperforms the other.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Hence, in order to reduce variance and speed up learning, we simplify MDP to Contextual Bandits [19,1,17] by setting γ = 0. This setting makes RE-INFORCE to choose a t so as to maximize only the expectation of immediate reward R(s t , a t ):…”
Section: Learning With Policy Gradientmentioning
confidence: 99%