“…We parameterize the policy π by a neural network F with parameter
, that is,
for some function f . A popular choice of f is given by
for some parameter τ, which gives an energy‐based policy (see, e.g., Haarnoja et al.,
2017; Wang et al.,
2020). The policy parameter θ is updated using the gradient ascent rule given by
where
is an estimate of the policy gradient.…”