Mean Field Analysis of Neural Networks: A Law of Large Numbers

Sirignano, Justin; Spiliopoulos, Konstantinos

doi:10.1137/18m1192184

Cited by 95 publications

(117 citation statements)

References 63 publications

Supporting

Mentioning

114

Contrasting

Order By: Relevance

“…In the N → ∞ limit, the NTK becomes deterministic and constant in time. This result explains why the generalization performance converges as N → ∞, a result previously obtained for single hidden layer neural networks using a different approach [32,33,34,35].…”

Section: Introductionsupporting

confidence: 74%

Scaling description of generalization with number of parameters in deep learning

et al. 2020

View full text Add to dashboard Cite

Supervised deep learning involves the training of neural networks with a large number N of parameters. For large enough N , in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as N grows past a certain threshold N * . Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with N . We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations f N −f N ∼ N −1/4 of the neural net output function f N around its expectationf N . These affect the generalization error N for classification: under natural assumptions, it decays to a plateau value ∞ in a power-law fashion ∼ N −1/2 . This description breaks down at a so-called jamming transition N = N * . At this threshold, we argue that f N diverges. This result leads to a plausible explanation for the cusp in test error known to occur at N * . Our results are confirmed by extensive empirical observations on the MNIST and CIFAR image datasets. Our analysis finally suggests that, given a computational envelope, the smallest generalization error is obtained using several networks of intermediate sizes, just beyond N * , and averaging their outputs. arXiv:1901.01608v5 [cond-mat.dis-nn]

show abstract

Section: Introductionsupporting

confidence: 74%

Scaling description of generalization with number of parameters in deep learning

et al. 2020

View full text Add to dashboard Cite

show abstract

“…We mathematically analyze neural networks with a single hidden layer in the asymptotic regime of large network sizes and large numbers of stochastic gradient descent iterations. A law of large numbers was previously proven in [30], see also [27,29] for related results. This paper rigorously proves a central limit theorem (CLT) for the empirical distribution of the neural network parameters.…”

Section: Introductionmentioning

confidence: 64%

“…[30] proves the mean-field limit µ N p →μ as N → ∞. The convergence theorems of [30] are summarized below.…”

Section: Law Of Large Numbersmentioning

confidence: 99%

“…Weak convergence and mean field analysis has been used in many other disciplines, including interacting particle systems in physics, neural networks in biology and financial modeling, see for example [17], [18], [7], [8], [9], [4], [20], [12], [23], [28], [34], [31] and the references therein for a certainly not-complete list. Recently, [30], [35], [27], and [29] study mean-field limits of machine learning algorithms, including neural networks. In this paper, we rigorously establish a central limit theorem for neural networks trained with stochastic gradient descent.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Mean field analysis of neural networks: A central limit theorem

Sirignano

Spiliopoulos

2020

Stochastic Processes and their Applications

Self Cite

102

113

View full text Add to dashboard Cite

We rigorously prove a central limit theorem for neural network models with a single hidden layer. The central limit theorem is proven in the asymptotic regime of simultaneously (A) large numbers of hidden units and (B) large numbers of stochastic gradient descent training iterations. Our result describes the neural network's fluctuations around its mean-field limit. The fluctuations have a Gaussian distribution and satisfy a stochastic partial differential equation. The proof relies upon weak convergence methods from stochastic analysis. In particular, we prove relative compactness for the sequence of processes and uniqueness of the limiting process in a suitable Sobolev space.

show abstract

“…Other recent applications that have motivated this work are global optimization [40], active media [3] and machine learning. Indeed, it has been shown recently [41,43] that "stochastic gradient descent", the optimization algorithm used in the training of neural networks, can be represented as the evolution of a particle system with interactions governed by a potential related to the objective function that is used to train the network. Several of the issues that we study here, such as phase transitions and the effect of nonconvexity, are of great interest in the context of the training of neural networks.…”

Section: Introductionmentioning

confidence: 99%

Dynamics of the Desai-Zwanzig model in multiwell and random energy landscapes

et al. 2019

View full text Add to dashboard Cite

We analyze a variant of the Desai-Zwanzig model [J. Stat. Phys. 19 1-24 (1978)]. In particular, we study stationary states of the mean field limit for a system of weakly interacting diffusions moving in a multi-well potential energy landscape, coupled via a Curie-Weiss type (quadratic) interaction potential. The location and depth of the local minima of the potential are either deterministic or random. We characterize the structure and nature of bifurcations and phase transitions for this system, by means of extensive numerical simulations and of analytical calculations for an explicitly solvable model. Our numerical experiments are based on Monte Carlo simulations, the numerical solution of the time-dependent nonlinear Fokker-Planck (McKean-Vlasov equation), the minimization of the free energy functional and a continuation algorithm for the stationary solutions.

show abstract

Mean Field Analysis of Neural Networks: A Law of Large Numbers

Cited by 95 publications

References 63 publications

Scaling description of generalization with number of parameters in deep learning

Scaling description of generalization with number of parameters in deep learning

Mean field analysis of neural networks: A central limit theorem

Dynamics of the Desai-Zwanzig model in multiwell and random energy landscapes

Contact Info

Product

Resources

About