2020
DOI: 10.1088/1742-5468/abc61e
|View full text |Cite
|
Sign up to set email alerts
|

Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup*

Abstract: Deep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher–student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations and prove that this descrip… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

8
90
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 59 publications
(99 citation statements)
references
References 35 publications
(64 reference statements)
8
90
0
Order By: Relevance
“…Then, the approach to perfect learning is strikingly different as compared to the realizable case with an exponentially fast convergence to zero generalization error: convergence is of power-law type in the over-realizable case due to the presence of soft modes, which we demonstrate both numerically and analytically. In addition, for the case of a noisy teacher we present numerical evidence that the generalization error is smaller in the over-realizable case than in the realizable one (similar to the case of the fully trained two-layer network studied in [22]).…”
mentioning
confidence: 57%
See 2 more Smart Citations
“…Then, the approach to perfect learning is strikingly different as compared to the realizable case with an exponentially fast convergence to zero generalization error: convergence is of power-law type in the over-realizable case due to the presence of soft modes, which we demonstrate both numerically and analytically. In addition, for the case of a noisy teacher we present numerical evidence that the generalization error is smaller in the over-realizable case than in the realizable one (similar to the case of the fully trained two-layer network studied in [22]).…”
mentioning
confidence: 57%
“…The research field of deep learning has recently attracted considerable attention due to significant progress in performing tasks relevant to many different applications [1][2][3][4][5]. Neural networks are learning machines inspired by the structure of the human brain [6], which have been studied with methods from statistical mechanics [5,[7][8][9][10][11], starting with simpler versions such as the perceptron [12][13][14] and also including two-layer networks [15][16][17][18][19][20][21][22][23][24][25][26]. Often, learning is studied in the framework of the student-teacher scenario, in which a student has to learn the connection vectors according to which a teacher classifies input patterns [10].…”
mentioning
confidence: 99%
See 1 more Smart Citation
“…However, collaborations between theoretical neuroscientists, physicists and computer scientists have paved the way for a new approach that uses idealised neural network models to understand the mathematical principles by which they learn 64 , and deploys the results to predict or explain phenomena in psychology or neuroscience 65 . For this endeavour to be tractable, deep network models must be simplified, for example by employing linear activation functions ("deep linear" networks) 66 , structured environments 67,68 , or by studying limit cases, such the limits of infinite width or depth, the high-dimensional limit 69,70 , or the shallow limit 64 (Fig. 2).…”
Section: Theory and Understanding Of Deep Learning Modelsmentioning
confidence: 99%
“…2). Paradoxically, these infinite-size networks are often more interpretable than those with fewer units, because their learning trajectory is more stable and not prone to be waylaid by bad local minima 68,69,71 . Some network idealisations have offered exact solutions for the learning trajectories that every single synapse will follow [72][73][74] , and answered perplexing questions about network behaviour: for example, why learning often involves transitions between quasi-discrete stages, why deep networks are often slower to train, or why an initial epoch of layer by layer statistical learning ("unsupervised pretraining") can accelerate future learning with gradient descent.…”
Section: Theory and Understanding Of Deep Learning Modelsmentioning
confidence: 99%