Jamming transition as a paradigm to understand the loss landscape of deep neural networks

Geiger, Mario; Spigler, Stefano; d’Ascoli, Stéphane; Sagun, Levent; Baity‐Jesi, Marco; Biroli, Giulio; Wyart, Matthieu

doi:10.1103/physreve.100.012115

Cited by 94 publications

(120 citation statements)

References 52 publications

Supporting

Mentioning

106

Contrasting

Order By: Relevance

“…(The example 5 in Section 4 is just a caricature in that direction.) In the empirical double descent papers [37,38], the second descent of the test MSE in the overparameterized regime is strictly better than the first descent in the underparameterized regime. From our concrete understanding of the correctly specified high-dimensional regime, it is clear that this is only possible in the presence of an approximation-theoretic benefit of adding more features into the model.…”

Section: Future Directionsmentioning

confidence: 99%

“…Most recently, a double-descent curve on the test error (0 − 1 loss and MSE) as a function of the number of parameters of several parametric models was observed on several common datasets by physicists [37] and machine learning researchers [38] respectively. In these experiments, the minimum 2 -norm interpolating solution is used, and several feature families, including kernel approximators [39], were considered.…”

Section: High-dimensional Linear Regressionmentioning

confidence: 99%

“…Since the publication 8 of the double descent experiments [37,38] at the end of the year 2018, there has been extremely active interest in understanding the generalization abilities of interpolating solutions in linear regression. An earlier edition of our work was presented at Information Theory and Applications, February 2019 and subsequently accepted to IEEE International Symposium on Information Theory, July 2019.…”

Section: Concurrent Work In High-dimensional Linear Regressionmentioning

confidence: 99%

See 2 more Smart Citations

Harmless Interpolation of Noisy Data in Regression

Muthukumar¹,

Vodrahalli²,

Subramanian³

et al. 2020

IEEE J. Sel. Areas Inf. Theory

View full text Add to dashboard Cite

A continuing mystery in understanding the empirical success of deep neural networks is their ability to achieve zero training error and generalize well, even when the training data is noisy and there are more parameters than data points. We investigate this overparameterized regime in linear regression, where all solutions that minimize training error interpolate the data, including noise. We characterize the fundamental generalization (mean-squared) error of any interpolating solution in the presence of noise, and show that this error decays to zero with the number of features. Thus, overparameterization can be explicitly beneficial in ensuring harmless interpolation of noise. We discuss two root causes for poor generalization that are complementary in nature -signal "bleeding" into a large number of alias features, and overfitting of noise by parsimonious feature selectors. For the sparse linear model with noise, we provide a hybrid interpolating scheme that mitigates both these issues and achieves order-optimal MSE over all possible interpolating solutions. arXiv:1903.09139v2 [cs.LG] 9 Sep 2019 2. We provide a Fourier-theoretic interpretation of concurrent analyses [6-10] of the minimum 2 -norm interpolator.3. We show (Theorem 2) that parsimonious interpolators (like the 1 -minimizing interpolator and its relatives) suffer the complementary problem of overfitting pure noise.4. We construct two-step hybrid interpolators that successfully recover signal and harmlessly fit noise, achieving the order-optimal rate of test MSE among all interpolators (Proposition 1 and all its corollaries). Related workWe discuss prior work in three categories: a) overparameterization in deep neural networks, b) interpolation of high-dimensional data using kernels, and c) high-dimensional linear regression. We then recap work on overparameterized linear regression that is concurrent to ours. Recent interest in overparameterizationConventional statistical wisdom is that using more parameters in one's model than data points leads to poor generalization. This wisdom is corroborated in theory by worst-case generalization bounds on such overparameterized models following from VC-theory in classification [2] and ill-conditioning in least-squares regression [5]. It is, however, contradicted in practice by the notable recent trend of empirically successful overparameterized deep neural networks. For example, the commonly used CIFAR-10 dataset contains 60000 images, but the number of parameters in all the neural networks achieving state-of-the-art performance on CIFAR-10 is at least 1.5 million [4]. These neural networks have the ability to memorize pure noisesomehow, they are still able to generalize well when trained with meaningful data.Since the publication of this observation [4,11], the machine learning community has seen a flurry of activity to attempt to explain this phenomenon, both for classification and regression problems, in neural networks. The problem is challenging for three core reasons 2 :1. The optimization landscape for l...

show abstract

Section: Future Directionsmentioning

confidence: 99%

Section: High-dimensional Linear Regressionmentioning

confidence: 99%

Section: Concurrent Work In High-dimensional Linear Regressionmentioning

confidence: 99%

See 1 more Smart Citation

Harmless Interpolation of Noisy Data in Regression

Muthukumar¹,

Vodrahalli²,

Subramanian³

et al. 2020

IEEE J. Sel. Areas Inf. Theory

View full text Add to dashboard Cite

show abstract

“…HT-MU applies to the analysis of complicated systems, including many physical systems, traditional NNs [23,24], and even models of the dynamics of actual spiking neurons. Indeed, the dynamics of learning in DNNs seems to resemble a system near a phase transition, such as the phase boundary of spin glass, or a system displaying Self Organized Criticality (SOC), or a Jamming transition [25,26]. Of course, we can not say which mechanism, if any, is at play.…”

Section: Introductionmentioning

confidence: 99%

Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained Deep Neural Networks

Martin¹,

Mahoney²

2020

Proceedings of the 2020 SIAM International Conference on Data Mining

View full text Add to dashboard Cite

Given two or more Deep Neural Networks (DNNs) with the same or similar architectures, and trained on the same dataset, but trained with different solvers, parameters, hyper-parameters, regularization, etc., can we predict which DNN will have the best test accuracy, and can we do so without peeking at the test data? In this paper, we show how to use a new Theory of Heavy-Tailed Self-Regularization (HT-SR) to answer this. HT-SR suggests, among other things, that modern DNNs exhibit what we call Heavy-Tailed Mechanistic Universality (HT-MU), meaning that the correlations in the layer weight matrices can be fit to a power law with exponents that lie in common Universality classes from Heavy-Tailed Random Matrix Theory (HT-RMT). From this, we develop a Universal capacity control metric that is a weighted average of these PL exponents. Rather than considering small toy NNs, we examine over 50 different, large-scale pre-trained DNNs, ranging over 15 different architectures, trained on ImagetNet, each of which has been reported to have different test accuracies. We show that this new capacity metric correlates very well with the reported test accuracies of these DNNs, looking across each architecture (VGG16/.../VGG19, ResNet10/.../ResNet152, etc.). We also show how to approximate the metric by the more familiar Product Norm capacity measure, as the average of the log Frobenius norm of the layer weight matrices. Our approach requires no changes to the underlying DNN or its loss function, it does not require us to train a model (although it could be used to monitor training), and it does not even require access to the ImageNet data.

show abstract

“…Understanding the nature of such glass transitions and jamming is a fundamental problem in CSPs since it is intimately related to efficiency of algorithms to solve CSPs. In the context of DNN, it is certainly important to understand the characteristics of the free-energy landscape to understand the efficiently of various learning algorithms for DNNs [18][19][20].…”

mentioning

confidence: 99%

From complex to simple : hierarchical free-energy landscape renormalized in deep neural networks

Yoshino

2020

SciPost Phys. Core

View full text Add to dashboard Cite

We develop a statistical mechanical approach based on the replica method to study the solution space of deep neural networks. Specifically we analyze the configuration space of the synaptic weights in a simple feed-forward perceptron network within a Gaussian approximation for two scenarios : a setting with random inputs/outputs and a teacher-student setting. By increasing the strength of constraints, i. e. increasing the number of imposed patterns, successive 2nd order glass transition (random inputs/outputs) or 2nd order crystalline transition (teacher-student setting) take place place layer-by-layer starting next to the inputs/outputs boundaries going deeper into the bulk. In deep enough networks the central part of the network remains in the liquid phase. We argue that in systems of finite width, weak bias field remains in the central part and plays the role of a symmetry breaking field which connects the opposite sides of the system. In the setting with random inputs/outputs, the successive glass transitions bring about a hierarchical free-energy landscape with ultra-metricity, which evolves in space: it is most complex close to the boundaries but becomes renormalized into progressively simpler one in deeper layers. These observations provide clues to understand why deep neural networks operate efficiently. Finally we present results of a set of numerical simulations to examine the theoretical predictions. :1910.09918v1 [cond-mat.dis-nn] Contents arXiv

show abstract

Jamming transition as a paradigm to understand the loss landscape of deep neural networks

Cited by 94 publications

References 52 publications

Harmless Interpolation of Noisy Data in Regression

Harmless Interpolation of Noisy Data in Regression

Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained Deep Neural Networks

From complex to simple : hierarchical free-energy landscape renormalized in deep neural networks

Contact Info

Product

Resources

About