2019
DOI: 10.48550/arxiv.1905.09751
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning When-to-Treat Policies

Abstract: Many applied decision-making problems have a dynamic component: The policymaker needs not only to choose whom to treat, but also when to start which treatment. For example, a medical doctor may choose between postponing treatment (watchful waiting) and prescribing one of several available treatments during the many visits from a patient. We develop an "advantage doubly robust" estimator for learning such dynamic treatment rules using observational data under the assumption of sequential ignorability. We prove … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(12 citation statements)
references
References 79 publications
(120 reference statements)
0
12
0
Order By: Relevance
“…There is an extensive body of work for off-policy policy evaluation and optimization under this assumption, including doubly robust methods (Jiang and Li, 2015;Thomas and Brunskill, 2016) and recent work that provides semiparametric efficiency bounds (Kallus and Uehara, 2019); often the behavior policy (conditional distribution of decisions given states) is assumed to be known. Notably, Liu et al (2018b) highlights how estimation error in the behavioral policy can bias value estimates, and Nie et al (2019); Hanna et al (2019) provides OPE estimators based on an estimator of the behavior policy. When sequential ignorability doesn't hold, the expected cumulative rewards under an evaluation policy cannot be identified from observable data.…”
Section: Related Workmentioning
confidence: 99%
“…There is an extensive body of work for off-policy policy evaluation and optimization under this assumption, including doubly robust methods (Jiang and Li, 2015;Thomas and Brunskill, 2016) and recent work that provides semiparametric efficiency bounds (Kallus and Uehara, 2019); often the behavior policy (conditional distribution of decisions given states) is assumed to be known. Notably, Liu et al (2018b) highlights how estimation error in the behavioral policy can bias value estimates, and Nie et al (2019); Hanna et al (2019) provides OPE estimators based on an estimator of the behavior policy. When sequential ignorability doesn't hold, the expected cumulative rewards under an evaluation policy cannot be identified from observable data.…”
Section: Related Workmentioning
confidence: 99%
“…Off-policy evaluation (OPE) has been studied extensively across a range of different domains, from healthcare (Thapa et al, 2005;Raghu et al, 2018;Nie et al, 2019), to recommender systems (Li et al, 2010;Dudík et al, 2014;, and robotics (Kalashnikov et al, 2018). While a full survey of OPE methods is outside the scope of this article, broadly speaking we can categories OPE methods into groups based the use of importance sampling (Precup, 2000), value functions (Sutton et al, 2009;Migliavacca et al, 2010;Sutton et al, 2016;, and learned transition models (Paduraru, 2007), though a number of methods combine two or more of these components (Jiang & Li, 2015;Thomas & Brunskill, 2016;Munos et al, 2016).…”
Section: Related Workmentioning
confidence: 99%
“…The goal of this paper is to provide a standardized benchmark for evaluating OPE methods. Although considerable theoretical (Thomas & Brunskill, 2016;Swaminathan & Joachims, 2015;Jiang & Li, 2015;Wang et al, 2017; and practical progress (Gilotte et al, 2018;Nie et al, 2019;Kalashnikov et al, 2018) on OPE algorithms has been made in a range of different domains, there are few broadly accepted evaluation tasks that combine complex, high-dimensional problems commonly explored by modern deep reinforcement learning algorithms (Bellemare et al, 2013;Brockman et al, 2016) with standardized evaluation protocols and metrics. Our goal is to provide a set of tasks with a range of difficulty, excercise a variety of design properties, and provide policies with different behavioral patterns in order to establish a standardized framework for comparing OPE algorithms.…”
Section: Introductionmentioning
confidence: 99%
“…OPE has been studied extensively across many domains [37,63,27,47]. Generally OPE methods can be grouped into methods that use importance sampling [49] or stationary state distribution [39,46,64], value function methods [60,42,61], and learned transition models [69], as well as methods that combine two or more approaches [45,13,26,16].…”
Section: Related Workmentioning
confidence: 99%