Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning

Kumar, Aviral; Agarwal, Rishabh; Ghosh, Dibya; Levine, Sergey

doi:10.48550/arxiv.2010.14498

Cited by 5 publications

(7 citation statements)

References 23 publications

(43 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We perform their experiment, this time with a C51 agent with and without normalised layers looking to better understand the regularisation effects of normalisation. Figure 21 shows an evolution of the effective rank for the baseline agent that is consistent with the report of (Kumar et al, 2020). Interestingly, the baseline agent is consistently the one making use of fewer and fewer dimensions in the feature space as training progresses while the normalised agents preserve the rank.…”

Section: C4 Effective Ranksupporting

confidence: 63%

“…(Miyato et al, 2018) contrasts SN and WN and argue that the Frobenius norm encourages a loss in the number of usable features of the learned representations. Our experiments support this argument: measuring the effective rank (Kumar et al, 2020) shows a faster loss of feature rank for the baseline agent compared to any SN agent (Fig. 21 in Appendix).…”

Section: Related Worksupporting

confidence: 60%

“…Authors of (Kumar et al, 2020) (Hessel et al, 2018). Note that the scores we report for our own implementation of DQN baseline are different from those reported in (Mnih et al, 2015) because the evaluation protocol has changed.…”

Section: C4 Effective Rankmentioning

confidence: 99%

See 2 more Smart Citations

Spectral Normalisation for Deep Reinforcement Learning: an Optimisation Perspective

Gogianu,

Berariu,

Rosca

et al. 2021

Preprint

View full text Add to dashboard Cite

Most of the recent deep reinforcement learning advances take an RL-centric perspective and focus on refinements of the training objective. We diverge from this view and show we can recover the performance of these developments not by changing the objective, but by regularising the value-function estimator. Constraining the Lipschitz constant of a single layer using spectral normalisation is sufficient to elevate the performance of a Categorical-DQN agent to that of a more elaborated RAINBOW agent on the challenging Atari domain. We conduct ablation studies to disentangle the various effects normalisation has on the learning dynamics and show that is sufficient to modulate the parameter updates to recover most of the performance of spectral normalisation. These findings hint towards the need to also focus on the neural component and its learning dynamics to tackle the peculiarities of Deep Reinforcement Learning.

show abstract

Section: C4 Effective Ranksupporting

confidence: 63%

Section: Related Worksupporting

confidence: 60%

See 1 more Smart Citation

Spectral Normalisation for Deep Reinforcement Learning: an Optimisation Perspective

Gogianu,

Berariu,

Rosca

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The performance of CriticSMC relies heavily on the quality of the critic and in this work we trained it using a basic TD update from Equation 6. One avenue for future work is devising more efficient and stable algorithms for learning the soft Q function such as proximal updates [56] or regularization which guards against deterioration [38].…”

Section: Discussionmentioning

confidence: 99%

Critic Sequential Monte Carlo

Lioutas¹,

Lavington²,

Sefas³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce CriticSMC, a new algorithm for planning as inference built from a novel composition of sequential Monte Carlo with learned soft-Q function heuristic factors. This algorithm is structured so as to allow using large numbers of putative particles leading to efficient utilization of computational resource and effective discovery of high reward trajectories even in environments with difficult reward surfaces such as those arising from hard constraints. Relative to prior art our approach is notably still compatible with model-free reinforcement learning in the sense that the implicit policy we produce can be used at test time in the absence of a world model. Our experiments on self-driving car collision avoidance in simulation demonstrate improvements against baselines in terms of infraction minimization relative to computational effort while maintaining diversity and realism of found trajectories.

show abstract

“…Our evaluation focuses on discrete-action on-policy RL algorithms since many factors that influence the learning of off-policy methods are still not well understood (Achiam et al, 2019;Kumar et al, 2020;Van Hasselt et al, 2018;Fu et al, 2019). Specifically we compare three algorithms that CVS consistently exhibits higher sample efficiency than both PPO and PPOF showing that dynamic modularity correlates with more efficient transfer.…”

Section: Simple Experimentsmentioning

confidence: 99%

Modularity in Reinforcement Learning via Algorithmic Independence in Credit Assignment

Chang,

Kaushik,

Levine

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Many transfer problems require re-using previously optimal decisions for solving new tasks, which suggests the need for learning algorithms that can modify the mechanisms for choosing certain actions independently of those for choosing others. However, there is currently no formalism nor theory for how to achieve this kind of modular credit assignment. To answer this question, we define modular credit assignment as a constraint on minimizing the algorithmic mutual information among feedback signals for different decisions. We introduce what we call the modularity criterion for testing whether a learning algorithm satisfies this constraint by performing causal analysis on the algorithm itself. We generalize the recently proposed societal decision-making framework as a more granular formalism than the Markov decision process to prove that for decision sequences that do not contain cycles, certain single-step temporal difference action-value methods meet this criterion while all policy-gradient methods do not. Empirical evidence suggests that such action-value methods are more sample efficient than policy-gradient methods on transfer problems that require only sparse changes to a sequence of previously optimal decisions.It is causality that gives us this modularity, and when we lose causality, we lose modularity. Judea Pearl (Ford, 2018) ICML (long) oral presentation: https://sites. google.com/view/modularcreditassignment.

show abstract

Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning

Cited by 5 publications

References 23 publications

Spectral Normalisation for Deep Reinforcement Learning: an Optimisation Perspective

Spectral Normalisation for Deep Reinforcement Learning: an Optimisation Perspective

Critic Sequential Monte Carlo

Modularity in Reinforcement Learning via Algorithmic Independence in Credit Assignment

Contact Info

Product

Resources

About