“…As shown by Figure11, the reward estimation converges and becomes stable after the specific iteration, such as 4 iterations for k = 3 and the environment 0 in Figure11(a), which indicates that the parameter ζ converges to a local optimal solution and is closer to the real distribution with the iteration increased. The corresponding episodes where EROT converged, however, is correlated little to the iterations, or the In MPG, policy improvement is based on policy gradient, which is a family of model-free learning algorithms where the policy is parameterized explicitly and improved in the direction of the gradient of some scalar performance measures, such as REINFORCE [8], deep deterministic policy gradient(DDPG) [9], deterministic policy gradient(DPG) [10], policy gradient and Q-learning(PGQ) [11], trust region policy optimization(TRPO) [12], KQ Learning [43], Policy Gradient with Jordan Decomposition [44]. Policy gradient method has many advantages such as learning faster, a superior asymptotic policy, the selection of actions with arbitrary probabilities [29], [45].…”