2018
DOI: 10.48550/arxiv.1805.09044
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Representation Balancing MDPs for Off-Policy Policy Evaluation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2019
2019
2019
2019

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 0 publications
0
2
0
Order By: Relevance
“…While inverse propensity weighting is a simple and transparent approach to estimating V π , it has several limitations. In observational studies treatment probabilities need to be estimated from data, and it is known that the variant of (6) with estimated weights γ(i) t (π) can perform poorly with even mild estimation error (see, e.g., Liu et al, 2018b). Furthermore, for any policy π considered, the IPW value estimator only uses trajectories that match the policy π exactly, which can make policy learning sample-inefficient.…”
Section: Existing Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…While inverse propensity weighting is a simple and transparent approach to estimating V π , it has several limitations. In observational studies treatment probabilities need to be estimated from data, and it is known that the variant of (6) with estimated weights γ(i) t (π) can perform poorly with even mild estimation error (see, e.g., Liu et al, 2018b). Furthermore, for any policy π considered, the IPW value estimator only uses trajectories that match the policy π exactly, which can make policy learning sample-inefficient.…”
Section: Existing Methodsmentioning
confidence: 99%
“…Considerable progress has been made in learning good models for the value functions and combining them with propensity models in doubly robust forms. In reinforcement learning, there has been extensive work focused on learning good models (Farajtabar et al, 2018;Hanna et al, 2017;Liu et al, 2018b). Guo et al (2017) focuse reducing the meansquared error in policy evaluation in long horizon settings.…”
Section: Related Workmentioning
confidence: 99%