Optimization Issues in KL-Constrained Approximate Policy Iteration

Lazić, Nevena; Hao, Botao; Abbasi-Yadkori, Yasin; Schuurmans, Dale; Szepesvári, Csaba

doi:10.48550/arxiv.2102.06234

Cited by 2 publications

(2 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here we consider simply best selection, and it remains open to understand other practical training tricks, such as proximal update [16,26] and regularization [27] under stochastic settings.…”

Section: Ensemble Methodsmentioning

confidence: 99%

Understanding the Effect of Stochasticity in Policy Optimization

Mei¹,

Dai²,

Xiao³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We study the effect of stochasticity in on-policy policy optimization, and make the following four contributions. First, we show that the preferability of optimization methods depends critically on whether stochastic versus exact gradients are used. In particular, unlike the true gradient setting, geometric information cannot be easily exploited in the stochastic case for accelerating policy optimization without detrimental consequences or impractical assumptions. Second, to explain these findings we introduce the concept of committal rate for stochastic policy optimization, and show that this can serve as a criterion for determining almost sure convergence to global optimality. Third, we show that in the absence of external oracle information, which allows an algorithm to determine the difference between optimal and sub-optimal actions given only on-policy samples, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely. That is, an uninformed algorithm either converges to a globally optimal policy with probability 1 but at a rate no better than O(1/t), or it achieves faster than O(1/t) convergence but then must fail to converge to the globally optimal policy with some positive probability. Finally, we use the committal rate theory to explain why practical policy optimization methods are sensitive to random initialization, then develop an ensemble method that can be guaranteed to achieve near-optimal solutions with high probability.

show abstract

Section: Ensemble Methodsmentioning

confidence: 99%

Understanding the Effect of Stochasticity in Policy Optimization

Mei¹,

Dai²,

Xiao³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…If the the limit of the variance is small enough, then Theorem 3.2 implies that the trust region method converges. In [22], a linear expected regret is proved in TRPO. Actually, this is consistent with our conclusions.…”

Section: Parameterized Policymentioning

confidence: 99%

A Stochastic Trust-Region Framework for Policy Optimization

Zhao¹,

Li²,

Wen³

2022

JCM

View full text Add to dashboard Cite

In this paper, we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning. The goal is to find a policy that maximizes the total expected reward when the agent acts according to the policy. The trust region subproblem is constructed with a surrogate function coherent to the total expected reward and a general distance constraint around the latest policy. We solve the subproblem using a preconditioned stochastic gradient method with a line search scheme to ensure that each step promotes the model function and stays in the trust region. To overcome the bias caused by sampling to the function estimations under the random settings, we add the empirical standard deviation of the total expected reward to the predicted increase in a ratio in order to update the trust region radius and decide whether the trial point is accepted. Moreover, for a Gaussian policy which is commonly used for continuous action space, the maximization with respect to the mean and covariance is performed separately to control the entropy loss. Our theoretical analysis shows that the deterministic version of the proposed algorithm tends to generate a monotonic improvement of the total expected reward and the global convergence is guaranteed under moderate assumptions. Comparisons with the state-of-the-art methods demonstrate the effectiveness and robustness of our method over robotic controls and game playings from OpenAI Gym.

show abstract

Optimization Issues in KL-Constrained Approximate Policy Iteration

Cited by 2 publications

References 8 publications

Understanding the Effect of Stochasticity in Policy Optimization

Understanding the Effect of Stochasticity in Policy Optimization

A Stochastic Trust-Region Framework for Policy Optimization

Contact Info

Product

Resources

About