Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Matsushima, Tatsuya; Furuta, Hiroki; Matsuo, Yutaka; Nachum, Ofir; Gu, Shixiang

doi:10.48550/arxiv.2006.03647

Cited by 15 publications

(17 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure 3 shows instability over the set of evaluations, which means the performance of the agent may be dependent on the specific stopping point chosen for evaluation. This questions the empirical effectiveness of offline RL for safety-critical real-world use cases [Mandel et al, 2014, Gottesman et al, 2018, Gauci et al, 2018, Jaques et al, 2019, Matsushima et al, 2020 as well as the current trend of reporting only the mean-value of the final policy in offline benchmarking [Fu et al, 2021].…”

Section: Meta Analyses Of Rl Algorithmsmentioning

confidence: 99%

A Minimalist Approach to Offline Reinforcement Learning

Fujimoto¹,

Gu²

2021

Preprint

View full text Add to dashboard Cite

Offline reinforcement learning (RL) defines the task of learning from a fixed batch of data. Due to errors in value estimation from out-of-distribution actions, most offline RL algorithms take the approach of constraining or regularizing the policy with the actions contained in the dataset. Built on pre-existing RL algorithms, modifications to make an RL algorithm work offline comes at the cost of additional complexity. Offline RL algorithms introduce new hyperparameters and often leverage secondary components such as generative models, while adjusting the underlying RL algorithm. In this paper we aim to make a deep RL algorithm work while making minimal changes. We find that we can match the performance of state-of-the-art offline RL algorithms by simply adding a behavior cloning term to the policy update of an online RL algorithm and normalizing the data. The resulting algorithm is a simple to implement and tune baseline, while more than halving the overall run time by removing the additional computational overheads of previous methods.Preprint. Under review.

show abstract

Section: Meta Analyses Of Rl Algorithmsmentioning

confidence: 99%

A Minimalist Approach to Offline Reinforcement Learning

Fujimoto¹,

Gu²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…While the use of model-based principles in OPE has been relatively rare, it has been more commonly used for policy optimization. The field of model-based RL has matured in recent years to yield impressive results for both online (Nagabandi et al, 2018;Chua et al, 2018;Kurutach et al, 2018;Janner et al, 2019) and offline (Matsushima et al, 2020;Kidambi et al, 2020;Yu et al, 2020;Argenson and Dulac-Arnold, 2020) policy optimization. Several of the techniques we employ, such For both family of dynamics models, we train 48 models with different hyperparameters.…”

Section: Related Workmentioning

confidence: 99%

Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

Zhang,

Paine,

Nachum

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail deterministic transition dynamics. In this paper, we challenge this conditional independence assumption and propose a family of expressive autoregressive dynamics models that generate different dimensions of the next state and reward sequentially conditioned on previous dimensions. We demonstrate that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout transitions. Furthermore, we compare different model-based and model-free off-policy evaluation (OPE) methods on RL Unplugged, a suite of offline MuJoCo datasets, and find that autoregressive dynamics models consistently outperform all baselines, achieving a new state-of-the-art. Finally, we show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer through data augmentation and improving performance using model-based planning. * Work done as an intern at Google Brain.

show abstract

“…Deployment efficiency measure [22] in RL counts the number of changes in the data-collection policy during learning, i.e., an offline RL setup corresponds to a single deployment allowed for learning. Reinforcement learning methods can be classified through data and interaction perspectives [23].…”

Section: B Offline (Batch) Reinforcement Learningmentioning

confidence: 99%

“…Recently, fully offline reinforcement learning methods like Random Ensemble Mixture (REM) [12], Deep Q-learning from Demonstrations (DQfD) [24], Bootstrapping Error Accumulation Reduction (BEAR) [25], Batch-Constrained deep Q-learning (BCQ) [13], [26], and Behavior Regularized Actor Critic (BRAC) [27] consider different approaches and techniques to surpass those limitations. Behavior-Regularized Model-ENsemble (BREMEN) [22] is a model-based algorithm that can be used in a fully offline setup, but its main goal is to be efficient considering the number of changes in the data-collection policy needed during learning (deployment efficiency) and sampling efficient by using a mixed (online and offline) approach. Similar to BREMEN, Advantage Weighted Actor Critic (AWAC) [28] is focused on effectively fine-tuning using online experiences after an offline pre-training period.…”

Section: B Offline (Batch) Reinforcement Learningmentioning

confidence: 99%

Discovering an Aid Policy to Minimize Student Evasion Using Offline Reinforcement Learning

de Lima,

Krohling

2021

Preprint

View full text Add to dashboard Cite

High dropout rates in tertiary education expose a lack of efficiency that causes frustration of expectations and financial waste. Predicting students at risk is not enough to avoid student dropout. Usually, an appropriate aid action must be discovered and applied in the proper time for each student. To tackle this sequential decision-making problem, we propose a decision support method to the selection of aid actions for students using offline reinforcement learning to support decision-makers effectively avoid student dropout. Additionally, a discretization of student's state space applying two different clustering methods is evaluated. Our experiments using logged data of real students shows, through off-policy evaluation, that the method should achieve roughly 1.0 to 1.5 times as much cumulative reward as the logged policy. So, it is feasible to help decision-makers apply appropriate aid actions and, possibly, reduce student dropout.

show abstract

Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Cited by 15 publications

References 26 publications

A Minimalist Approach to Offline Reinforcement Learning

A Minimalist Approach to Offline Reinforcement Learning

Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

Discovering an Aid Policy to Minimize Student Evasion Using Offline Reinforcement Learning

Contact Info

Product

Resources

About