2018 Annual American Control Conference (ACC) 2018
DOI: 10.23919/acc.2018.8430925
|View full text |Cite
|
Sign up to set email alerts
|

Nonparametric Stochastic Compositional Gradient Descent for Q-Learning in Continuous Markov Decision Problems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
6
2

Relationship

4
4

Authors

Journals

citations
Cited by 16 publications
(6 citation statements)
references
References 17 publications
0
6
0
Order By: Relevance
“…As shown by Figure11, the reward estimation converges and becomes stable after the specific iteration, such as 4 iterations for k = 3 and the environment 0 in Figure11(a), which indicates that the parameter ζ converges to a local optimal solution and is closer to the real distribution with the iteration increased. The corresponding episodes where EROT converged, however, is correlated little to the iterations, or the In MPG, policy improvement is based on policy gradient, which is a family of model-free learning algorithms where the policy is parameterized explicitly and improved in the direction of the gradient of some scalar performance measures, such as REINFORCE [8], deep deterministic policy gradient(DDPG) [9], deterministic policy gradient(DPG) [10], policy gradient and Q-learning(PGQ) [11], trust region policy optimization(TRPO) [12], KQ Learning [43], Policy Gradient with Jordan Decomposition [44]. Policy gradient method has many advantages such as learning faster, a superior asymptotic policy, the selection of actions with arbitrary probabilities [29], [45].…”
Section: B Experimental Analysismentioning
confidence: 99%
“…As shown by Figure11, the reward estimation converges and becomes stable after the specific iteration, such as 4 iterations for k = 3 and the environment 0 in Figure11(a), which indicates that the parameter ζ converges to a local optimal solution and is closer to the real distribution with the iteration increased. The corresponding episodes where EROT converged, however, is correlated little to the iterations, or the In MPG, policy improvement is based on policy gradient, which is a family of model-free learning algorithms where the policy is parameterized explicitly and improved in the direction of the gradient of some scalar performance measures, such as REINFORCE [8], deep deterministic policy gradient(DDPG) [9], deterministic policy gradient(DPG) [10], policy gradient and Q-learning(PGQ) [11], trust region policy optimization(TRPO) [12], KQ Learning [43], Policy Gradient with Jordan Decomposition [44]. Policy gradient method has many advantages such as learning faster, a superior asymptotic policy, the selection of actions with arbitrary probabilities [29], [45].…”
Section: B Experimental Analysismentioning
confidence: 99%
“…In contrast to other Kernel based RL algorithms, such as [14], ours manages to significantly reduce the computational complexity by only updating the dictionary after a sequence of actions. In practice, our algorithm performs cheap actions (as measured by time and computational complexity) in order to perform relatively few computationally intensive learning steps.…”
Section: A Mountain Cartmentioning
confidence: 99%
“…As is the case in many other learning domains, a common approach to sidestep the dimensionality curse is to assume that either the Q-function or the policy admits a finite parametrization that can be linear [10], rely on a nonlinear basis expansion [11], or be given by a neural network [12]. Alternatively one can assume that the Q-function [13], [14] or the policy [15] belong to a reproducing kernel Hilbert space (RKHS) which provide the ability to approximate functions using nonparameteric functional representations. Although the structure of the space is determined by the choice of the kernel, the set of functions that can be represented is sufficiently rich to permit a good approximation of a large class of functions.…”
Section: Introductionmentioning
confidence: 99%
“…A common approach to overcome this difficulty is to assume that the Q-function admits a finite parameterization that can be linear [4], rely on a nonlinear basis expansion [5], or be given by a neural network [6]. Alternatively one can assume that the Q-function [7], [8] belongs to a reproducing kernel Hilbert space. However, in these cases, maximizing the Q-function to select the best possible action is computationally challenging.…”
Section: Introductionmentioning
confidence: 99%