Inference for Batched Bandits

Zhang, Kelly W.; Janson, Lucas; Murphy, Susan A.

doi:10.48550/arxiv.2002.03217

Cited by 6 publications

(16 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the optimal policy is unknown, we estimate the optimal policy from the online data as π t , to infer the value of interest. As commonly assumed in the current online inference literature (see e.g., Deshpande et al, 2018;Zhang et al, 2020;Chen et al, 2020) and the bandit literature (see e.g., Chu et al, 2011;Abbasi-Yadkori et al, 2011;Bubeck and Cesa-Bianchi, 2012;Zhou, 2015), we consider the conditional mean outcome function takes a linear form, i.e., µ(x, a) = x β(a), where β(•) is a smooth function, which can be estimated via a ridge regression based on H t−1 as…”

Section: Frameworkmentioning

confidence: 75%

“…as the number of pulls for action a, y t−1 (a) is the N t−1 (a)×1 vector of the outcomes received under action a at time t − 1, and ω is the regularization term. There are two main reasons to choose the ridge estimator instead of the ordinary least square estimator that is considered in Deshpande et al (2018); Zhang et al (2020); Chen et al (2020). First, the ridge estimator is well defined when D t−1 (a) D t−1 (a) is singular, and its bias is negligible when the time step is large.…”

Section: Frameworkmentioning

confidence: 99%

“…Assumption 3.1 requires the bandit algorithm to explore all actions sufficiently such that the asymptotic properties for the online conditional mean estimator under different actions hold (see e.g., Deshpande et al, 2018;Hadad et al, 2019;Zhang et al, 2020).…”

Section: Inference For Online Policy Optimizationmentioning

confidence: 99%

“…The parameter p t defined in assumption 3.1 characterizes the boundary of the probability of taking one action, known as the clipping rate (Zhang et al, 2020). We establish the relationship between p t and κ t in the next section and discuss when assumption 3.1 cannot hold.…”

Section: Inference For Online Policy Optimizationmentioning

confidence: 99%

“…The second challenge lies in estimating the mean outcome under the optimal policy online. Though there are numerous methods proposed recently to assess the online sample mean for a fixed action (see e.g., Nie et al, 2018;Neel and Roth, 2018;Deshpande et al, 2018;Shin et al, 2019a,b;Waisman et al, 2019;Hadad et al, 2019;Zhang et al, 2020), we note none of these methods are directly applicable to our problem since the sample mean only provides the impact of one particular arm, not the value of the optimal policy in bandits that considers the dynamics of the online environment. For instance, in the contextual bandits, we aim to select an action for each subject based on its context/feature to optimize the overall outcome of interest.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning

Cai

Song

2021

Preprint

View full text Add to dashboard Cite

Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instruction on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring the non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection on the consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulations and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.

show abstract

Section: Frameworkmentioning

confidence: 75%

Section: Frameworkmentioning

confidence: 99%

Section: Inference For Online Policy Optimizationmentioning

confidence: 99%

Section: Inference For Online Policy Optimizationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning

Cai

Song

2021

Preprint

View full text Add to dashboard Cite

show abstract

Optimal Off-Policy Evaluation from Multiple Logging Policies

Kallus,

Saito,

Uehara

2020

Preprint

View full text Add to dashboard Cite

We study off-policy evaluation (OPE) from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling. Previous work noted that in this setting the ordering of the variances of different importance sampling estimators is instance-dependent, which brings up a dilemma as to which importance sampling weights to use. In this paper, we resolve this dilemma by finding the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one. In particular, we establish the efficiency bound under stratified sampling and propose an estimator achieving this bound when given consistent q-estimates. To guard against misspecification of q-functions, we also provide a way to choose the control variate in a hypothesis class to minimize variance. Extensive experiments demonstrate the benefits of our methods' efficiently leveraging of the stratified sampling of off-policy data from multiple loggers.

show abstract

From Finite to Countable-Armed Bandits

Kalvit,

Zeevi

2021

Preprint

View full text Add to dashboard Cite

We consider a stochastic bandit problem with countably many arms that belong to a finite set of types, each characterized by a unique mean reward. In addition, there is a fixed distribution over types which sets the proportion of each type in the population of arms. The decision maker is oblivious to the type of any arm and to the aforementioned distribution over types, but perfectly knows the total number of types occurring in the population of arms. We propose a fully adaptive online learning algorithm that achieves O (log n) distribution-dependent expected cumulative regret after any number of plays n, and show that this order of regret is best possible. The analysis of our algorithm relies on newly discovered concentration and convergence properties of optimism-based policies like UCB in finite-armed bandit problems with zero gap, which may be of independent interest.

show abstract

Inference for Batched Bandits

Cited by 6 publications

References 0 publications

Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning

Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning

Optimal Off-Policy Evaluation from Multiple Logging Policies

From Finite to Countable-Armed Bandits

Contact Info

Product

Resources

About