Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits

Zhan, Ruohan; Hadad, Vitor; Hirshberg, David A.; Athey, Susan

doi:10.1145/3447548.3467456

Cited by 9 publications

(14 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our policy learning algorithm follows a two-step procedure: 1) choose a policy value estimator Q(π) for any fixed π ∈ Π using the collected data; 2) output a policy that achieves the maximum of the estimated value in Π: π = argmax π∈Π Q(π). More specifically, the estimator we use is an variant of the family of generalized augmented inverse propensity weights (GAIPW) estimators considered in Luedtke and Van Der Laan (2016); Hadad et al (2019); Zhan et al (2021), which takes the following form:…”

Section: Our Contributions and Related Workmentioning

confidence: 99%

“…Depending on whether gt is known or not, ht would be chosen differently (see the discussion on Section 4.3). These specific choices of ht are simple and different from those variants adopted in Luedtke and Van Der Laan (2016); Hadad et al (2019); Zhan et al (2021), which are concerned with devising ht to achieve asymptotic normality for inference, while instead we are aiming for finite-sample regret bounds here. Note when α = 0, we again recover the minimax optimal regret bound for policy learning under i.i.d data collection established in Zhou et al (2018).…”

Section: Our Contributions and Related Workmentioning

confidence: 99%

“…Extension to multi-action schemes has been successively investigated by Swaminathan and Joachims (2015a); Zhou et al (2017); Kallus (2018); Zhou et al (2018); Zhou et al (2018) established the minimax regret bound using doubly robust AIPW estimator when the propensities are unknown. An important distinction when using AIPW estimator on adaptive data is that one often assumes the knowledge of propensities (Su et al, 2020;Hadad et al, 2019;Zhan et al, 2021), as is the case in this paper, since they are typically time-varying and are difficult to approximate with limited batch size.…”

Section: Other Related Workmentioning

confidence: 99%

“…The first incorporates weight clipping into AIPW (Bembom and van der Laan, 2008;Charles et al, 2013;Wang et al, 2017;Su et al, 2020), where one controls variance by shrinking the weights at the cost of introducing a small bias. The second approach, described above, is to locally stabilize the elements of the AIPW estimator (Luedtke and Van Der Laan, 2016;Hadad et al, 2019;Zhan et al, 2021). Our policy learning algorithm uses an estimator that falls into this second approach, where the weights ht are chosen with the consideration of the worst-case variance in order to be robust.…”

Section: Other Related Workmentioning

confidence: 99%

“…Note that on the contrary to Hadad et al (2019); Zhan et al (2021), here the weights {ht} t∈[T ] are prespecified and not adaptive. In fact, the choice of ht should take into account the worst-case variance of Γt over the entire policy class and all possible data realizations; we shall discuss the details in Section 4.3.…”

Section: Policy Learning Algorithmmentioning

confidence: 99%

See 4 more Smart Citations

Policy Learning with Adaptively Collected Data

Zhan¹,

Ren²,

Athey³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Learning optimal policies from historical data enables the gains from personalization to be realized in a wide variety of applications. The growing policy learning literature focuses on a setting where the treatment assignment policy does not adapt to the data. However, adaptive data collection is becoming more common in practice, from two primary sources: 1) data collected from adaptive experiments that are designed to improve inferential efficiency; 2) data collected from production systems that are adaptively evolving an operational policy to improve performance over time (e.g. contextual bandits). In this paper, we aim to address the challenge of learning the optimal policy with adaptively collected data and provide one of the first theoretical inquiries into this problem. We propose an algorithm based on generalized augmented inverse propensity weighted estimators and establish its finite-sample regret bound. We complement this regret upper bound with a lower bound that characterizes the fundamental difficulty of policy learning with adaptive data. Finally, we demonstrate our algorithm's effectiveness using both synthetic data and public benchmark datasets.

show abstract

Section: Our Contributions and Related Workmentioning

confidence: 99%