Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &Amp; Data Mining 2021
DOI: 10.1145/3447548.3467456
|View full text |Cite
|
Sign up to set email alerts
|

Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 9 publications
(14 citation statements)
references
References 11 publications
0
14
0
Order By: Relevance
“…Our policy learning algorithm follows a two-step procedure: 1) choose a policy value estimator Q(π) for any fixed π ∈ Π using the collected data; 2) output a policy that achieves the maximum of the estimated value in Π: π = argmax π∈Π Q(π). More specifically, the estimator we use is an variant of the family of generalized augmented inverse propensity weights (GAIPW) estimators considered in Luedtke and Van Der Laan (2016); Hadad et al (2019); Zhan et al (2021), which takes the following form:…”
Section: Our Contributions and Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Our policy learning algorithm follows a two-step procedure: 1) choose a policy value estimator Q(π) for any fixed π ∈ Π using the collected data; 2) output a policy that achieves the maximum of the estimated value in Π: π = argmax π∈Π Q(π). More specifically, the estimator we use is an variant of the family of generalized augmented inverse propensity weights (GAIPW) estimators considered in Luedtke and Van Der Laan (2016); Hadad et al (2019); Zhan et al (2021), which takes the following form:…”
Section: Our Contributions and Related Workmentioning
confidence: 99%
“…Depending on whether gt is known or not, ht would be chosen differently (see the discussion on Section 4.3). These specific choices of ht are simple and different from those variants adopted in Luedtke and Van Der Laan (2016); Hadad et al (2019); Zhan et al (2021), which are concerned with devising ht to achieve asymptotic normality for inference, while instead we are aiming for finite-sample regret bounds here. Note when α = 0, we again recover the minimax optimal regret bound for policy learning under i.i.d data collection established in Zhou et al (2018).…”
Section: Our Contributions and Related Workmentioning
confidence: 99%
See 3 more Smart Citations