2015
DOI: 10.1609/aaai.v29i1.9541
|View full text |Cite
|
Sign up to set email alerts
|

High-Confidence Off-Policy Evaluation

Abstract: Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not provided confidence… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
71
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 83 publications
(82 citation statements)
references
References 12 publications
1
71
0
Order By: Relevance
“…We use Monte Carlo rollouts to estimate V with MB. We also show results for importance sampling (IS) BCa-bootstrap methods from Thomas et. al.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…We use Monte Carlo rollouts to estimate V with MB. We also show results for importance sampling (IS) BCa-bootstrap methods from Thomas et. al.…”
Section: Resultsmentioning
confidence: 99%
“…The method is equally applicable to upper bounds and two sided intervals. A similar method using weighted importance sampling was proposed by Thomas et al (2015).…”
Section: Bootstrapping Policy Lower Boundsmentioning
confidence: 99%
“…Additional work on safety in MDPs has focused on obtaining high-confidence bounds on the performance of a policy before that policy is deployed (Thomas, Theocharous, and Ghavamzadeh 2015b;Hanna, Stone, and Niekum 2017), as well as methods for high-confidence policy improvement (Thomas, Theocharous, and Ghavamzadeh 2015a). Our work draws inspiration from these previous approaches; however, we provide bounds on policy performance that are applicable when learning from demonstrations, i.e., when the rewards are not observed.…”
Section: Related Workmentioning
confidence: 99%
“…The evaluation policy (target distribution) might be a new treatment policy that is both dangerously worse than the behavior policy and quite different from the behavior policy. To determine whether the evaluation policy should be deployed, we might rely on high-confidence guarantees, as has been suggested for similar problems (Thomas et al 2015a). That is, we might use Hoeffding's inequality to construct a high-confidence lower-bound on the expected value of the WIS estimator, and then require this bound to be not far below the performance of the behavior policy.…”
Section: Should One Use Us or Wis In Practice?mentioning
confidence: 99%
“…This means that the lower-bound that we compute will be a lower bound on the performance of the decent behavior policy, rather the true poor performance of the evaluation policy. Moreover, if one uses Student's t-test or a bootstrap method to construct the confidence interval, as has been suggested when using WIS (Thomas et al 2015b), we might obtain a very-tight confidence interval around the performance of the behavior policy. This exemplifies the problem with using WIS for high-risk problems: the bias of the WIS estimator can cause us to often erroneously conclude that dangerous policies are safe to deploy.…”
Section: Should One Use Us or Wis In Practice?mentioning
confidence: 99%