2021
DOI: 10.48550/arxiv.2106.04895
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Abstract: Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
15
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(16 citation statements)
references
References 39 publications
(66 reference statements)
1
15
0
Order By: Relevance
“…up to log factor) for small enough ε (namely, ε ≤ (0, 1/H]). The ε-range that enjoys nearoptimality is much larger compared to ε ≤ 0, 1/H 2.5 established in Xie et al (2021b) for model-based algorithms.…”
Section: Main Contributionsmentioning
confidence: 88%
See 3 more Smart Citations
“…up to log factor) for small enough ε (namely, ε ≤ (0, 1/H]). The ε-range that enjoys nearoptimality is much larger compared to ε ≤ 0, 1/H 2.5 established in Xie et al (2021b) for model-based algorithms.…”
Section: Main Contributionsmentioning
confidence: 88%
“…We prove that pessimistic Q-learning finds an ε-optimal policy as soon as the sample size T exceeds the order of (up to log factor) H 6 SC ε 2 , where C denotes the single-policy concentrability coefficient of the batch dataset. In comparison to the minimax lower bound Ω H 4 SC ε 2 developed in Xie et al (2021b), the sample complexity of pessimistic Q-learning is at most a factor of H 2 from optimal (modulo some log factor).…”
Section: Main Contributionsmentioning
confidence: 98%
See 2 more Smart Citations
“…We believe this is likely most effective when one has a lot of domain knowledge of the task, and when being applied to tasks that are too difficult for the algorithm to learn initially. For example, the authors of [25] use this approach when controlling Cassie. This approach works well for them because they are able to engineer a reward that led to the desired behavior.…”
Section: B Panda Arm Environmentsmentioning
confidence: 99%