Representation Balancing MDPs for Off-Policy Policy Evaluation

Liu, Yao; Gottesman, Omer; Raghu, Aniruddh; Komorowski, Matthieu; Faisal, A. Aldo; Doshi‐Velez, Finale; Brunskill, Emma

doi:10.48550/arxiv.1805.09044

Cited by 1 publication

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While inverse propensity weighting is a simple and transparent approach to estimating V π , it has several limitations. In observational studies treatment probabilities need to be estimated from data, and it is known that the variant of (6) with estimated weights γ(i) t (π) can perform poorly with even mild estimation error (see, e.g., Liu et al, 2018b). Furthermore, for any policy π considered, the IPW value estimator only uses trajectories that match the policy π exactly, which can make policy learning sample-inefficient.…”

Section: Existing Methodsmentioning

confidence: 99%

“…Considerable progress has been made in learning good models for the value functions and combining them with propensity models in doubly robust forms. In reinforcement learning, there has been extensive work focused on learning good models (Farajtabar et al, 2018;Hanna et al, 2017;Liu et al, 2018b). Guo et al (2017) focuse reducing the meansquared error in policy evaluation in long horizon settings.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Learning When-to-Treat Policies

Nie

Brunskill

Wager

2019

Preprint

Self Cite

View full text Add to dashboard Cite

Many applied decision-making problems have a dynamic component: The policymaker needs not only to choose whom to treat, but also when to start which treatment. For example, a medical doctor may choose between postponing treatment (watchful waiting) and prescribing one of several available treatments during the many visits from a patient. We develop an "advantage doubly robust" estimator for learning such dynamic treatment rules using observational data under the assumption of sequential ignorability. We prove welfare regret bounds that generalize results for doubly robust learning in the single-step setting, and show promising empirical performance in several different contexts. Our approach is practical for policy optimization, and does not need any structural (e.g., Markovian) assumptions.

show abstract

Section: Existing Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%