2020
DOI: 10.48550/arxiv.2006.03647
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Tatsuya Matsushima,
Hiroki Furuta,
Yutaka Matsuo
et al.

Abstract: Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
17
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2
1

Relationship

2
7

Authors

Journals

citations
Cited by 15 publications
(17 citation statements)
references
References 26 publications
0
17
0
Order By: Relevance
“…Figure 3 shows instability over the set of evaluations, which means the performance of the agent may be dependent on the specific stopping point chosen for evaluation. This questions the empirical effectiveness of offline RL for safety-critical real-world use cases [Mandel et al, 2014, Gottesman et al, 2018, Gauci et al, 2018, Jaques et al, 2019, Matsushima et al, 2020 as well as the current trend of reporting only the mean-value of the final policy in offline benchmarking [Fu et al, 2021].…”
Section: Meta Analyses Of Rl Algorithmsmentioning
confidence: 99%
“…Figure 3 shows instability over the set of evaluations, which means the performance of the agent may be dependent on the specific stopping point chosen for evaluation. This questions the empirical effectiveness of offline RL for safety-critical real-world use cases [Mandel et al, 2014, Gottesman et al, 2018, Gauci et al, 2018, Jaques et al, 2019, Matsushima et al, 2020 as well as the current trend of reporting only the mean-value of the final policy in offline benchmarking [Fu et al, 2021].…”
Section: Meta Analyses Of Rl Algorithmsmentioning
confidence: 99%
“…While the use of model-based principles in OPE has been relatively rare, it has been more commonly used for policy optimization. The field of model-based RL has matured in recent years to yield impressive results for both online (Nagabandi et al, 2018;Chua et al, 2018;Kurutach et al, 2018;Janner et al, 2019) and offline (Matsushima et al, 2020;Kidambi et al, 2020;Yu et al, 2020;Argenson and Dulac-Arnold, 2020) policy optimization. Several of the techniques we employ, such For both family of dynamics models, we train 48 models with different hyperparameters.…”
Section: Related Workmentioning
confidence: 99%
“…Deployment efficiency measure [22] in RL counts the number of changes in the data-collection policy during learning, i.e., an offline RL setup corresponds to a single deployment allowed for learning. Reinforcement learning methods can be classified through data and interaction perspectives [23].…”
Section: B Offline (Batch) Reinforcement Learningmentioning
confidence: 99%
“…Recently, fully offline reinforcement learning methods like Random Ensemble Mixture (REM) [12], Deep Q-learning from Demonstrations (DQfD) [24], Bootstrapping Error Accumulation Reduction (BEAR) [25], Batch-Constrained deep Q-learning (BCQ) [13], [26], and Behavior Regularized Actor Critic (BRAC) [27] consider different approaches and techniques to surpass those limitations. Behavior-Regularized Model-ENsemble (BREMEN) [22] is a model-based algorithm that can be used in a fully offline setup, but its main goal is to be efficient considering the number of changes in the data-collection policy needed during learning (deployment efficiency) and sampling efficient by using a mixed (online and offline) approach. Similar to BREMEN, Advantage Weighted Actor Critic (AWAC) [28] is focused on effectively fine-tuning using online experiences after an offline pre-training period.…”
Section: B Offline (Batch) Reinforcement Learningmentioning
confidence: 99%