“…The agent of Q‐VMD‐RLG continuously learns to find the optimal policy π based on action value function Q ( S ( t ), a ( t )) and continuously updates action a ( t ) to maximize the value of Q [
72–76], thus obtaining the optimal action‐value function Q * ( S ( t ), a ( t )) [
77, 78], which is updated as shown in Equation (16).
where Q ( S ( t ), a ( t )) are the values of the action value function at moment t , S ( t ) and a t denote state and action executed by the agent at moment t .…”