Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Xie, Tengyang; Jiang, Nan; Wang, Huan; Xiong, Caiming; Bai, Yu

doi:10.48550/arxiv.2106.04895

Cited by 4 publications

(16 citation statements)

References 39 publications

(66 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…up to log factor) for small enough ε (namely, ε ≤ (0, 1/H]). The ε-range that enjoys nearoptimality is much larger compared to ε ≤ 0, 1/H 2.5 established in Xie et al (2021b) for model-based algorithms.…”

Section: Main Contributionsmentioning

confidence: 88%

“…We prove that pessimistic Q-learning finds an ε-optimal policy as soon as the sample size T exceeds the order of (up to log factor) H 6 SC ε 2 , where C denotes the single-policy concentrability coefficient of the batch dataset. In comparison to the minimax lower bound Ω H 4 SC ε 2 developed in Xie et al (2021b), the sample complexity of pessimistic Q-learning is at most a factor of H 2 from optimal (modulo some log factor).…”

Section: Main Contributionsmentioning

confidence: 98%

“…In this paper, we consider finite-horizon non-stationary Markov decision processes (MDPs) with S states, A actions, and horizon length H. The focal point is to pin down the sample efficiency for pessimistic variants of model-free algorithms, under the mild single-policy concentrability assumption (cf. Assumption 1) of the Xie et al (2021b) (in short, this assumption captures how close the batch dataset is to an expert dataset, and will be formally introduced in Section 2.2). Given K episodes of history data each of length H (which amounts to a total number of T = KH samples), our main contributions are summarized as follows.…”

Section: Main Contributionsmentioning

confidence: 99%

“…Recently, the principle of pessimism (or conservatism) -namely, being conservative in Q-function estimation when there are not enough samples -has been put forward as an effective way to solve offline RL (Buckman et al, 2020;Kumar et al, 2020). This principle has been implemented in, for instance, a model-based offline value iteration algorithm, which modifies classical value iteration (Azar et al, 2017) by subtracting a penalty term in the estimated Q-values and has been shown to achieve appealing sample efficiency (Jin et al, 2021;Rashidinejad et al, 2021;Xie et al, 2021b). It is noteworthy that the model-based approach is built upon the construction of an empirical transition kernel, and therefore, requires specific representation of the environment (see, e.g.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity

Shi¹,

Li²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

Offline or batch reinforcement learning seeks to learn a near-optimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of model-based algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their model-free counterparts -which do not require explicit model estimation -have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes, and characterize its sample complexity under the singlepolicy concentrability assumption which does not require the full coverage of the state-action space. In addition, a variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity. Altogether, this work highlights the efficiency of model-free algorithms in offline RL when used in conjunction with pessimism and variance reduction.

show abstract

Section: Main Contributionsmentioning

confidence: 88%

Section: Main Contributionsmentioning

confidence: 98%

Section: Main Contributionsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity

Shi¹,

Li²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…We believe this is likely most effective when one has a lot of domain knowledge of the task, and when being applied to tasks that are too difficult for the algorithm to learn initially. For example, the authors of [25] use this approach when controlling Cassie. This approach works well for them because they are able to engineer a reward that led to the desired behavior.…”

Section: B Panda Arm Environmentsmentioning

confidence: 99%

Direct Random Search for Fine Tuning of Deep Reinforcement Learning Policies

Gillen¹,

Ozmen²,

Byl³

2021

Preprint

View full text Add to dashboard Cite

Researchers have demonstrated that Deep Reinforcement Learning (DRL) is a powerful tool for finding policies that perform well on complex robotic systems. However, these policies are often unpredictable and can induce highly variable behavior when evaluated with only slightly different initial conditions. Training considerations constrain DRL algorithm designs in that most algorithms must use stochastic policies during training. The resulting policy used during deployment, however, can and frequently is a deterministic one that uses the Maximum Likelihood Action (MLA) at each step. In this work, we show that a direct random search is very effective at fine-tuning DRL policies by directly optimizing them using deterministic rollouts. We illustrate this across a large collection of reinforcement learning environments, using a wide variety of policies obtained from different algorithms. Our results show that this method yields more consistent and higher performing agents on the environments we tested. Furthermore, we demonstrate how this method can be used to extend our previous work on shrinking the dimensionality of the reachable state space of closed-loop systems run under Deep Neural Network (DNN) policies.

show abstract

When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?

Song¹,

Song²,

Bai³

2021

Preprint

View full text Add to dashboard Cite

Multi-agent reinforcement learning has made substantial empirical progresses in solving games with a large number of players. However, theoretically, the best known sample complexity for finding a Nash equilibrium in general-sum games scales exponentially in the number of players due to the size of the joint action space, and there is a matching exponential lower bound. This paper investigates what learning goals admit better sample complexities in the setting of m-player general-sum Markov games with H steps, S states, and A i actions per player. First, we design algorithms for learning an ε-Coarse Correlated Equilibrium (CCE) in O(H 5 S max i≤m A i /ε 2 ) episodes, and an ε-Correlated Equilibrium (CE) in O(H 6 S max i≤m A 2 i /ε 2 ) episodes. This is the first line of results for learning CCE and CE with sample complexities polynomial in max i≤m A i . Our algorithm for learning CE integrates an adversarial bandit subroutine which minimizes a weighted swap regret, along with several novel designs in the outer loop. Second, we consider the important special case of Markov Potential Games, and design an algorithm that learns an ε-approximate Nash equilibrium within O(S i≤m A i /ε 3 ) episodes (when only highlighting the dependence on S, A i , and ε), which only depends linearly in i≤m A i and significantly improves over existing efficient algorithms in the ε dependence. Overall, our results shed light on what equilibria or structural assumptions on the game may enable sample-efficient learning with many players.

show abstract

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Cited by 4 publications

References 39 publications

Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity

Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity

Direct Random Search for Fine Tuning of Deep Reinforcement Learning Policies

When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?

Contact Info

Product

Resources

About