Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Chaudhari, Pratik; Choromanska, Anna; Soatto, Stefano; LeCun, Yann; Baldassi, Carlo; Borgs, Christian; Chayes, Jennifer; Sagun, Levent; Zecchina, Riccardo

doi:10.48550/arxiv.1611.01838

Cited by 80 publications

(137 citation statements)

References 0 publications

Supporting

Mentioning

132

Contrasting

Order By: Relevance

“…Furthermore, our numerical experiments verify that the Jacobian matrix of real datasets (such as CIFAR10) indeed exhibit low-rank structure. This is closely related to the observations on the Hessian of deep networks which is empirically observed to be low-rank [15,44]. An equally important question for understanding the convergence behavior of optimization algorithms for overparameterized models is understanding their generalization capabilities.…”

Section: Prior Artmentioning

confidence: 68%

Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks

Li,

Soltanolkotabi,

Oymak

2019

Preprint

View full text Add to dashboard Cite

Modern neural networks are typically trained in an over-parameterized regime where the parameters of the model far exceed the size of the training data. Such neural networks in principle have the capacity to (over)fit any set of labels including pure noise. Despite this, somewhat paradoxically, neural network models trained via first-order methods continue to predict well on yet unseen test data. This paper takes a step towards demystifying this phenomena. Under a rich dataset model, we show that gradient descent is provably robust to noise/corruption on a constant fraction of the labels despite overparameterization. In particular, we prove that: (i) In the first few iterations where the updates are still in the vicinity of the initialization gradient descent only fits to the correct labels essentially ignoring the noisy labels. (ii) to start to overfit to the noisy labels network must stray rather far from from the initialization which can only occur after many more iterations. Together, these results show that gradient descent with early stopping is provably robust to label noise and shed light on the empirical robustness of deep networks as well as commonly adopted heuristics to prevent overfitting.

show abstract

Section: Prior Artmentioning

confidence: 68%

Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks

Li,

Soltanolkotabi,

Oymak

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…(v) For potentials that do not admit an obvious decomposition like (1.2), we propose using local entropy approximation, [19,20] to extract the large scale information needed for either the Modified MALA method or the independence sampler.…”

Section: Results On Performance In the Presence Of Roughnessmentioning

confidence: 99%

“…One option is to use physical intuition about the problem to identify a potential U (x) that has suitable properties. More systematically, we can use the local entropy approach formulated in [19,20], or, equivalently the Moreau-Yosida approximation to estimate a smoothed version of V (x).…”

Section: Finding Smoothed Landscapesmentioning

confidence: 99%

Sampling from rough energy landscapes

Plecháč

Simpson

2020

Communications in Mathematical Sciences

View full text Add to dashboard Cite

Rough energy landscapes appear in a variety of applications including disordered media and soft matter. In this work, we examine challenges to sampling from Boltzmann distributions associated with rough energy landscapes. Here, the roughness will correspond to highly oscillatory, but bounded, perturbations of a smooth landscape. Through a combination of numerical experiments and asymptotic analysis we demonstrate that the performance of Metropolis Adjusted Langevin Algorithm can be severely attenuated as the roughness increases. In contrast, we prove, rigorously, that Random Walk Metropolis is insensitive to such roughness. We also formulate two alternative sampling strategies that incorporate large scale features of the energy landscape, while resisting the impact of roughness; these also outperform Random Walk Metropolis. Numerical experiments on these landscapes are presented that confirm our predictions. Open analysis questions and numerical challenges are also highlighted.

show abstract

“…In recent years, there were lots of efforts to mathematically explain the generalization capability of DNNs by using variety of tools. They range from attributing it to the way that the SGD method automatically finds flat local minima (which are stable and thus can be well generalized) [30,31,32,33], to efforts trying to relate the success of DNNs to the special class of hierarchical functions that they generate [34].…”

Section: Information Bottleneck and Stochastic Gradient Descentmentioning

confidence: 99%

Information Bottleneck and its Applications in Deep Learning

Hafez-Kolahi¹,

Kasaei²

2019

Preprint

View full text Add to dashboard Cite

Information Theory (IT) has been used in Machine Learning (ML) from early days of this field. In the last decade, advances in Deep Neural Networks (DNNs) have led to surprising improvements in many applications of ML. The result has been a paradigm shift in the community toward revisiting previous ideas and applications in this new framework. Ideas from IT are no exception. One of the ideas which is being revisited by many researchers in this new era, is Information Bottleneck (IB); a formulation of information extraction based on IT. The IB is promising in both analyzing and improving DNNs. The goal of this survey is to review the IB concept and demonstrate its applications in deep learning. The information theoretic nature of IB, makes it also a good candidate in showing the more general concept of how IT can be used in ML. Two important concepts are highlighted in this narrative on the subject, i) the concise and universal view that IT provides on seemingly unrelated methods of ML, demonstrated by explaining how IB relates to minimal sufficient statistics, stochastic gradient descent, and variational auto-encoders, and ii) the common technical mistakes and problems caused by applying ideas from IT, which is discussed by a careful study of some recent methods suffering from them.

show abstract

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Cited by 80 publications

References 0 publications

Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks

Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks

Sampling from rough energy landscapes

Information Bottleneck and its Applications in Deep Learning

Contact Info

Product

Resources

About