Improving Exploration in Soft-Actor-Critic with Normalizing Flows Policies

Ward, Patrick; Ariella, Smofsky,; Bose, Avishek Joey

doi:10.48550/arxiv.1906.02771

Cited by 9 publications

(11 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Dinh et al (2015;; Kingma & Dhariwal (2018) propose the coupling method to make the Jacobian triangular and ensure the forward and inverse can be computed with a single pass. The applications of NF include image generation (Ho et al, 2019;Kingma & Dhariwal, 2018), video generation (Kumar et al, 2019) and reinforcement learning (Mazoure et al, 2020;Ward et al, 2019;Touati et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Flow-based Recurrent Belief State Learning for POMDPs

Chen¹,

Mu²,

Luo³

et al. 2022

Preprint

View full text Add to dashboard Cite

Partially Observable Markov Decision Process (POMDP) provides a principled and generic framework to model real world sequential decision making processes but yet remains unsolved, especially for high dimensional continuous space and unknown models. The main challenge lies in how to accurately obtain the belief state, which is the probability distribution over the unobservable environment states given historical information. Accurately calculating this belief state is a precondition for obtaining an optimal policy of POMDPs. Recent advances in deep learning techniques show great potential to learn good belief states. However, existing methods can only learn approximated distribution with limited flexibility. In this paper, we introduce the FlOw-based Recurrent BElief State model (FORBES), which incorporates normalizing flows into the variational inference to learn general continuous belief states for POMDPs. Furthermore, we show that the learned belief states can be plugged into downstream RL algorithms to improve performance. In experiments, we show that our methods successfully capture the complex belief states that enable multi-modal predictions as well as high quality reconstructions, and results on challenging visual-motor control tasks show that our method achieves superior performance and sample efficiency.

show abstract

Section: Related Workmentioning

confidence: 99%

Flow-based Recurrent Belief State Learning for POMDPs

Chen¹,

Mu²,

Luo³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…al. [43] altered the choice of policy distribution from factored Gaussian in vanilla SAC to Normalizing flow policies for improving exploration. Campo et.…”

Section: Preliminaries and Motivationmentioning

confidence: 99%

Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with On-Policy Experience

Banerjee,

Chen,

Noman

2021

Preprint

View full text Add to dashboard Cite

Soft Actor-Critic (SAC) is an off-policy actor-critic reinforcement learning algorithm, essentially based on entropy regularization. SAC trains a policy by maximizing the tradeoff between expected return and entropy (randomness in the policy). It has achieved state-of-the-art performance on a range of continuous-control benchmark tasks, outperforming prior onpolicy and off-policy methods. SAC works in an off-policy fashion where data are sampled uniformly from past experiences (stored in a buffer) using which parameters of the policy and value function networks are updated. We propose certain crucial modifications for boosting the performance of SAC and make it more sample efficient. In our proposed improved SAC, we firstly introduce a new prioritization scheme for selecting better samples from the experience replay buffer. Secondly we use a mixture of the prioritized off-policy data with the latest on-policy data for training the policy and the value function networks. We compare our approach with the vanilla SAC and some recent variants of SAC and show that our approach outperforms the said algorithmic benchmarks. It is comparatively more stable and sample efficient when tested on a number of continuous control tasks in MuJoCo environments.

show abstract

“…In Haarnoja et al (2018a), SAC is proposed to mitigate the policy's expressiveness issue while retaining tractable optimization; with the policy modeled with either a Gaussian or a mixture of Gaussian, SAC adopts a maximum entropy RL objective function to encourage exploration. The normalizing flow (Rezende and Mohamed, 2015;Dinh et al, 2016) based techniques have been recently applied to design a flexible policy in both on-policy (Tang and Agrawal, 2018) and off-policy settings (Ward et al, 2019).…”

Section: Related Workmentioning

confidence: 99%

Implicit Distributional Reinforcement Learning

Yue¹,

Wang²,

Zhou³

2020

Preprint

View full text Add to dashboard Cite

To improve the sample efficiency of policy-gradient based reinforcement learning algorithms, we propose implicit distributional actor critic (IDAC) that consists of a distributional critic, built on two deep generator networks (DGNs), and a semi-implicit actor (SIA), powered by a flexible policy distribution. We adopt a distributional perspective on the discounted cumulative return and model it with a state-action-dependent implicit distribution, which is approximated by the DGNs that take state-action pairs and random noises as their input. Moreover, we use the SIA to provide a semi-implicit policy distribution, which mixes the policy parameters with a reparameterizable distribution that is not constrained by an analytic density function. In this way, the policy's marginal distribution is implicit, providing the potential to model complex properties such as covariance structure and skewness, but its parameter and entropy can still be estimated. We incorporate these features with an off-policy algorithm framework to solve problems with continuous action space, and compare IDAC with the state-of-art algorithms on representative OpenAI Gym environments. We observe that IDAC outperforms these baselines for most tasks.

show abstract

Improving Exploration in Soft-Actor-Critic with Normalizing Flows Policies

Cited by 9 publications

References 0 publications

Flow-based Recurrent Belief State Learning for POMDPs

Flow-based Recurrent Belief State Learning for POMDPs

Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with On-Policy Experience

Implicit Distributional Reinforcement Learning

Contact Info

Product

Resources

About