2021
DOI: 10.48550/arxiv.2106.10251
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Active Offline Policy Selection

Abstract: This paper addresses the problem of policy selection in domains with abundant logged data, but with a very restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and healthcare domain among others. Several offpolicy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evalu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 31 publications
0
1
0
Order By: Relevance
“…This form of evaluation is closer to a worst-case analysis than the typical average performance reporting. In this case, the conservative estimation is important, since there exists no equivalent to early stopping from supervised learning in the offline RL setting: Offline policy evaluation and selection are still open problems (Hans et al, 2011;Paine et al, 2020;Konyushkova et al, 2021;Zhang et al, 2021;Fu et al, 2021), so we need to stop the policy training at some random point and deploy that policy on the real system. The above procedure is meant to simulate this random stopping, and should quantify how good the algorithm performs at least.…”
Section: Discussionmentioning
confidence: 99%
“…This form of evaluation is closer to a worst-case analysis than the typical average performance reporting. In this case, the conservative estimation is important, since there exists no equivalent to early stopping from supervised learning in the offline RL setting: Offline policy evaluation and selection are still open problems (Hans et al, 2011;Paine et al, 2020;Konyushkova et al, 2021;Zhang et al, 2021;Fu et al, 2021), so we need to stop the policy training at some random point and deploy that policy on the real system. The above procedure is meant to simulate this random stopping, and should quantify how good the algorithm performs at least.…”
Section: Discussionmentioning
confidence: 99%