Maslow's Hammer for Catastrophic Forgetting: Node Re-Use vs Node Activation

Lee, Sebastian; Mannelli, Stefano Sarao; Clopath, Claudia; Goldt, Sebastian; Saxe, Andrew

doi:10.48550/arxiv.2205.09029

Cited by 2 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast, the learning still succeeds numerically, as any noise will perturb the dynamics off the saddle point, allowing learning to proceed (figure 6(A)). However, the dynamics still slow in the vicinity of the saddle point, providing a theoretical explanation for catastrophic slowing in deep linear networks (Lee et al 2022). We note that the analytical…”

Section: Continualmentioning

confidence: 81%

“…When simulated numerically, the learning dynamics escape the saddle points due to imprecision of floating point arithmetic. However, numerical optimisation still suffers from catastrophic slowing (Lee et al 2022), as escaping the saddle point takes time (figure 6(A)). In contrast, in the case of aligned singular vectors (c = 0), we recover the equation for the temporal dynamics as described in Saxe et al (2014).…”

Section: J Stat Mech (2023) 114004mentioning

confidence: 99%

See 1 more Smart Citation

Exact learning dynamics of deep linear networks with prior knowledge ^*

J Dominé,

Braun,

Fitzgerald

et al. 2023

J. Stat. Mech.

View full text Add to dashboard Cite

Learning in deep neural networks is known to depend critically on the knowledge embedded in the initial network weights. However, few theoretical results have precisely linked prior knowledge to learning dynamics. Here we derive exact solutions to the dynamics of learning with rich prior knowledge in deep linear networks by generalising Fukumizu’s matrix Riccati solution (Fukumizu 1998 Gen 1 1E–03). We obtain explicit expressions for the evolving network function, hidden representational similarity, and neural tangent kernel over training for a broad class of initialisations and tasks. The expressions reveal a class of task-independent initialisations that radically alter learning dynamics from slow non-linear dynamics to fast exponential trajectories while converging to a global optimum with identical representational similarity, dissociating learning trajectories from the structure of initial internal representations. We characterise how network weights dynamically align with task structure, rigorously justifying why previous solutions successfully described learning from small initial weights without incorporating their fine-scale structure. Finally, we discuss the implications of these findings for continual learning, reversal learning and learning of structured knowledge. Taken together, our results provide a mathematical toolkit for understanding the impact of prior knowledge on deep learning.

show abstract

Section: Continualmentioning

confidence: 81%

Section: J Stat Mech (2023) 114004mentioning

confidence: 99%