Policy Learning with Adaptively Collected Data

Zhan, Ruohan; Ren, Zhimei; Athey, Susan; Zhou, Zhengyuan

doi:10.48550/arxiv.2105.02344

Cited by 4 publications

(13 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Without careful causal methods, this can also lead to feedback loops. Recent work has explored building causal mechanisms into SDM algorithms [35,43,33,18,56,39]. But more work is needed to infer causal mechanisms in the face of challenges described above.…”

Section: Causal Inferencementioning

confidence: 99%

Beyond Ads: Sequential Decision-Making Algorithms in Law and Public Policy

Henderson¹,

Chugg²,

Anderson³

et al. 2021

Preprint

View full text Add to dashboard Cite

We explore the promises and challenges of employing sequential decision-making algorithms -such as bandits, reinforcement learning, and active learning -in the public sector. While such algorithms have been heavily studied in settings that are suitable for the private sector (e.g., online advertising), the public sector could greatly benefit from these approaches, but poses unique methodological challenges for machine learning. We highlight several applications of sequential decision-making algorithms in regulation and governance, and discuss areas for further research which would enable them to be more widely applicable, fair, and effective. In particular, ensuring that these systems learn rational, causal decisionmaking policies can be difficult and requires great care. We also note the potential risks of such deployments and urge caution when conducting work in this area. We hope our work inspires more investigation of public-sector sequential decision making applications, which provide unique challenges for machine learning researchers and can be socially beneficial. * Equal contribution. 2 We focus primarily on governance in the United States due to the expertise of authors, but most if not all of our examples have parallels in countries and regions around the world.

show abstract

Section: Causal Inferencementioning

confidence: 99%

Beyond Ads: Sequential Decision-Making Algorithms in Law and Public Policy

Henderson¹,

Chugg²,

Anderson³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Kato [2021a,b] propose a doubly-robust estimator for o -policy evaluation with dependent samples. Zhan et al [2021] provide regret bounds for learning an optimal policy using adaptively collected data, where the probability of selecting an action is a function of past data. Zhang et al [2021a,b] study statistical inference for OLS and M-estimation with non-i.i.d.…”

Section: Related Workmentioning

confidence: 99%

“…Thus we opted for deriving a uniform concentration bound by modifying the classical uniform LLN proof. Zhan et al [2021] also derive a uniform LLN without requiring boundedness of the martingale di erence terms, but with structural assumptions on the summands related to their speci c application.…”

Section: A3 Proof Of Theorem 1 (Regret Of Oms-etc)mentioning

confidence: 99%

Efficient Online Estimation of Causal Effects by Deciding What to Observe

Gupta¹,

Childers²

2021

Preprint

View full text Add to dashboard Cite

Researchers often face data fusion problems, where multiple data sources are available, each capturing a distinct subset of variables. While problem formulations typically take the data as given, in practice, data acquisition can be an ongoing process. In this paper, we aim to estimate any functional of a probabilistic model (e.g., a causal e ect) as e ciently as possible, by deciding, at each time, which data source to query. We propose online moment selection (OMS), a framework in which structural assumptions are encoded as moment conditions. The optimal action at each step depends, in part, on the very moments that identify the functional of interest. Our algorithms balance exploration with choosing the best action as suggested by current estimates of the moments. We propose two selection strategies: (1) explore-then-commit (OMS-ETC) and ( 2) explore-then-greedy (OMS-ETG), proving that both achieve zero asymptotic regret as assessed by MSE. We instantiate our setup for average treatment e ect estimation, where structural assumptions are given by a causal graph and data sources may include subsets of mediators, confounders, and instrumental variables.

show abstract

“…Policy learning with adaptive data. Zhan et al [58] study policy learning from contextual-bandit data by optimizing a doubly robust policy value estimator stabilized by a deterministic lower bound on IS weights. They provide regret guarantees for this algorithm based on invoking the results of Rakhlin et al [45].…”

Section: Example 2 (Classification) In the Same Setting Asmentioning

confidence: 99%

“…These are in general not comparable. Foster and Krishnamurthy [20], Zhan et al [58] use sequential L ∞ and L p covering numbers, respectively, to obtain maximal inequalities. van de Geer [55, Chapter 8] gives guarantees for ERM over nonparametric classes of controlled sequential bracketing entropy.…”

Section: Example 2 (Classification) In the Same Setting Asmentioning

confidence: 99%

Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning

Bibaut¹,

Chambaz²,

Dimakopoulou³

et al. 2021

Preprint

View full text Add to dashboard Cite

Empirical risk minimization (ERM) is the workhorse of machine learning, whether for classification and regression or for off-policy policy learning, but its modelagnostic guarantees can fail when we use adaptively collected data, such as the result of running a contextual bandit algorithm. We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class and provide first-of-their-kind generalization guarantees and fast convergence rates. Our results are based on a new maximal inequality that carefully leverages the importance sampling structure to obtain rates with the right dependence on the exploration rate in the data. For regression, we provide fast rates that leverage the strong convexity of squared-error loss. For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero, as is the case for bandit-collected data. An empirical investigation validates our theory. * Alphabetical order Preprint. Under review.

show abstract

Policy Learning with Adaptively Collected Data

Cited by 4 publications

References 51 publications

Beyond Ads: Sequential Decision-Making Algorithms in Law and Public Policy

Beyond Ads: Sequential Decision-Making Algorithms in Law and Public Policy

Efficient Online Estimation of Causal Effects by Deciding What to Observe

Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning

Contact Info

Product

Resources

About