2023
DOI: 10.48550/arxiv.2303.05479
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning

Abstract: A compelling use case of o ine reinforcement learning (RL) is to obtain a policy initialization from existing datasets, which allows e cient ne-tuning with limited amounts of active online interaction. However, several existing o ine RL methods tend to exhibit poor online ne-tuning performance. On the other hand, online RL methods can learn e ectively through online interaction, but struggle to incorporate o ine data, which can make them very slow in settings where exploration is challenging or pre-training is… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 29 publications
0
3
0
Order By: Relevance
“…The combination of offline and online RL techniques has emerged as a promising research direction. In works like [2,3,8], offline RL has been used to train a policy from a pre-collected dataset of experiences that is then fine-tuned with online RL. These studies have investigated diverse strategies aimed at improving the performance gain of offline pre-training and mitigating the phenomenon known as policy collapse, which causes a performance dip when shifting from offline to online training [3].…”
Section: Combining Offline and Online Rlmentioning
confidence: 99%
See 1 more Smart Citation
“…The combination of offline and online RL techniques has emerged as a promising research direction. In works like [2,3,8], offline RL has been used to train a policy from a pre-collected dataset of experiences that is then fine-tuned with online RL. These studies have investigated diverse strategies aimed at improving the performance gain of offline pre-training and mitigating the phenomenon known as policy collapse, which causes a performance dip when shifting from offline to online training [3].…”
Section: Combining Offline and Online Rlmentioning
confidence: 99%
“…These approaches propose measures such as reducing the underestimation during offline stages [8], imposing a conservative improvement to the online stage [3], and weighting policy improvement with the advantage function [2].…”
Section: Combining Offline and Online Rlmentioning
confidence: 99%
“…The data available for RL consists of a combination of on-policy samples and potentially suboptimal near-expert interventions, which necessitates using a suitable off-policy RL algorithm that can incorporate prior (near-expert) data easily but also can efficiently improve with online experience. While a variety of algorithms designed for online RL with offline data could be suitable (Song et al, 2022;Lee et al, 2022;Nakamoto et al, 2023), we adopt the recently proposed RLPD algorithm (Ball et al, 2023), which has shown compelling results on sample-efficient robotic learning. RLPD is an off-policy actor-critic reinforcement learning algorithm that builds on soft-actor critic (Haarnoja et al, 2018), but makes some key modifications to satisfy the desiderata above such as a high update-to-data ratio, layer-norm regularization during training, and using ensembles of value functions, which make it more suitable for incorporating offline data into online RL.…”
Section: Interactive Imitation Learning As Reinforcement Learningmentioning
confidence: 99%
“…Consequently, offline datasets alone cannot provide enough information on the safety constraints in the environment, and thus offline training is not sufficient for safe RL. It, therefore, strengthened the necessity of continuing to improve the decision-making policy by an online finetuning process with interactions in task environments [24]- [26].…”
Section: Introductionmentioning
confidence: 99%
“…Prior work [25], [26] typically uses the offline trained policy network architecture for online finetuning. Unfortunately, DT's transformer-based policy network, with its numerous parameters, can often fall short of meeting computation speed requirements in real-world tasks like autonomous driving.…”
Section: Introductionmentioning
confidence: 99%