Nonparametric Stochastic Compositional Gradient Descent for Q-Learning in Continuous Markov Decision Problems

Tolstaya, Ekaterina; Koppel, Alec; Stump, Ethan; Ribeiro, Alejandro

doi:10.23919/acc.2018.8430925

Cited by 16 publications

(6 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As shown by Figure11, the reward estimation converges and becomes stable after the specific iteration, such as 4 iterations for k = 3 and the environment 0 in Figure11(a), which indicates that the parameter ζ converges to a local optimal solution and is closer to the real distribution with the iteration increased. The corresponding episodes where EROT converged, however, is correlated little to the iterations, or the In MPG, policy improvement is based on policy gradient, which is a family of model-free learning algorithms where the policy is parameterized explicitly and improved in the direction of the gradient of some scalar performance measures, such as REINFORCE [8], deep deterministic policy gradient(DDPG) [9], deterministic policy gradient(DPG) [10], policy gradient and Q-learning(PGQ) [11], trust region policy optimization(TRPO) [12], KQ Learning [43], Policy Gradient with Jordan Decomposition [44]. Policy gradient method has many advantages such as learning faster, a superior asymptotic policy, the selection of actions with arbitrary probabilities [29], [45].…”

Section: B Experimental Analysismentioning

confidence: 99%

Offline Multi-Policy Gradient for Latent Mixture Environments

Zhang

Wang

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Section: B Experimental Analysismentioning

confidence: 99%

Offline Multi-Policy Gradient for Latent Mixture Environments

Zhang

Wang

et al. 2021

IEEE Access

View full text Add to dashboard Cite

“…In contrast to other Kernel based RL algorithms, such as [14], ours manages to significantly reduce the computational complexity by only updating the dictionary after a sequence of actions. In practice, our algorithm performs cheap actions (as measured by time and computational complexity) in order to perform relatively few computationally intensive learning steps.…”

Section: A Mountain Cartmentioning

confidence: 99%

“…As is the case in many other learning domains, a common approach to sidestep the dimensionality curse is to assume that either the Q-function or the policy admits a finite parametrization that can be linear [10], rely on a nonlinear basis expansion [11], or be given by a neural network [12]. Alternatively one can assume that the Q-function [13], [14] or the policy [15] belong to a reproducing kernel Hilbert space (RKHS) which provide the ability to approximate functions using nonparameteric functional representations. Although the structure of the space is determined by the choice of the kernel, the set of functions that can be represented is sufficiently rich to permit a good approximation of a large class of functions.…”

Section: Introductionmentioning

confidence: 99%

Stochastic Policy Gradient Ascent in Reproducing Kernel Hilbert Spaces

Paternain¹,

Bazerque²,

Small³

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

Reinforcement learning consists of finding policies that maximize an expected cumulative long term reward in a Markov decision process with unknown transition probabilities and instantaneous rewards. In this paper we consider the problem of finding such optimal policies while assuming they are continuous functions belonging to a reproducing kernel Hilbert space (RKHS). To learn the optimal policy we introduce a stochastic policy gradient ascent algorithm with three unique novel features: (i) The stochastic estimates of policy gradients are unbiased. (ii) The variance of stochastic gradients is reduced drawing on ideas from numerical differentiation. (iii) Policy complexity is controlled using sparse RKHS representations. Novel feature (i) is instrumental in proving convergence to a stationary point of the expected cumulative reward. Novel feature (ii) facilitates reasonable convergence times. Novel feature (iii) is a necessity in practical implementations which we show can be done in a way that does not eliminate convergence guarantees. Numerical examples in standard problems illustrate successful learning of policies with low complexity representations which are close to stationary points of the expected cumulative reward.

show abstract

“…A common approach to overcome this difficulty is to assume that the Q-function admits a finite parameterization that can be linear [4], rely on a nonlinear basis expansion [5], or be given by a neural network [6]. Alternatively one can assume that the Q-function [7], [8] belongs to a reproducing kernel Hilbert space. However, in these cases, maximizing the Q-function to select the best possible action is computationally challenging.…”

Section: Introductionmentioning

confidence: 99%

Policy Gradient for Continuing Tasks in Non-stationary Markov Decision Processes

Paternain¹,

Bazerque²,

Ribeiro³

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities. In this paper we consider the problem of finding optimal policies assuming that they belong to a reproducing kernel Hilbert space (RKHS). To that end we compute unbiased stochastic gradients of the value function which we use as ascent directions to update the policy. A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed. Hence preventing these algorithms to be fully implemented online, which is a desirable property for systems that need to adapt to new tasks and/or environments in deployment. The main requirement for a policy gradient algorithm to work is that the estimate of the gradient at any point in time is an ascent direction for the initial value function. In this work we establish that indeed this is the case which enables to show the convergence of the online algorithm to the critical points of the initial value function. A numerical example shows the ability of our online algorithm to learn to solve a navigation and surveillance problem, in which an agent must loop between to goal locations. This example corroborates our theoretical findings about the ascent directions of subsequent stochastic gradients. It also shows how the agent running our online algorithm succeeds in learning to navigate, following a continuing cyclic trajectory that does not comply with the standard stationarity assumptions in the literature for non episodic training.

show abstract

Nonparametric Stochastic Compositional Gradient Descent for Q-Learning in Continuous Markov Decision Problems

Cited by 16 publications

References 17 publications

Offline Multi-Policy Gradient for Latent Mixture Environments

Offline Multi-Policy Gradient for Latent Mixture Environments

Stochastic Policy Gradient Ascent in Reproducing Kernel Hilbert Spaces

Policy Gradient for Continuing Tasks in Non-stationary Markov Decision Processes

Contact Info

Product

Resources

About