Global Minima of Overparameterized Neural Networks

Cooper, Yaim

doi:10.1137/19m1308943

Cited by 19 publications

(23 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Embedding Principle provides a structural mechanism underlying the degeneracy as a very common property for critical points (Choromanska et al, 2015;Sagun et al, 2016). Thus it complements the understanding that global minima of NNs typically form a high dimensional manifold due to over-parameterization (Cooper, 2021).…”

Section: Related Workmentioning

confidence: 70%

“…We show that the degeneracy of a critical point substantially increases when it is embedded to a wider network, due to the fact that a critical point can be mapped to a high-dimensional critical submanifold through a class of critical embeddings. This degeneracy of critical points arises from the neuron redundancy of the wide NN in representing certain simple critical functions from narrower NNs, which is different from over-parameterization induced degeneracy studied in Cooper (2021). We also study the property of Hessian of critical points through critical embedding, e.g., the number of its negative eigenvalues, which determines whether the corresponding critical point is a strict-saddle that enables easy optimization .…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Embedding Principle: a hierarchical structure of loss landscape of deep neural networks

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

We prove a general Embedding Principle of loss landscape of deep neural networks (NNs) that unravels a hierarchical structure of the loss landscape of NNs, i.e., loss landscape of an NN contains all critical points of all the narrower NNs. This result is obtained by constructing a class of critical embeddings which map any critical point of a narrower NN to a critical point of the target NN with the same output function. By discovering a wide class of general compatible critical embeddings, we provide a gross estimate of the dimension of critical submanifolds embedded from critical points of narrower NNs. We further prove an irreversiblility property of any critical embedding that the number of negative/zero/positive eigenvalues of the Hessian matrix of a critical point may increase but never decrease as an NN becomes wider through the embedding. Using a special realization of general compatible critical embedding, we prove a stringent necessary condition for being a "truly-bad" critical point that never becomes a strict-saddle point through any critical embedding. This result implies the commonplace of strict-saddle points in wide NNs, which may be an important reason underlying the easy optimization of wide NNs widely observed in practice.

show abstract

Section: Related Workmentioning

confidence: 70%

Section: Introductionmentioning

confidence: 99%

Embedding Principle: a hierarchical structure of loss landscape of deep neural networks

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Due to its importance for the understanding of the behavior, performance, and limitations of machine learning algorithms, the study of the loss landscape of training problems for artificial neural networks has received considerable attention in the last years. Compare, for instance, with the early works [3,6,34] on this topic, with the contributions on stationary points and plateau phenomena in [1,9,15,17,50], with the results on suboptimal local minima and valleys in [11,19,24,37,41,48,52], and with the overview articles [5,45,46]. For fullyconnected feedforward neural networks involving activation functions with an affine segment, much of the research on landscape properties was initially motivated by the observation of Kawaguchi [30] that networks with linear activation functions give rise to learning problems that do not possess spurious (i.e., not globally optimal) local minima and thus behave -at least as far as the notion of local optimality is concernedlike convex problems.…”

mentioning

confidence: 83%

“…Before we demonstrate that the effects discussed in Theorems 3.1 and 3.2 and Corollary 5.3 can indeed affect the behavior of gradient-based optimization algorithms in practice, we would like to point out that the "space-filling" cases cl Z (ι(Ψ(D))) = Z and cl L p µ (K) (Ψ(D)) = L p µ (K) in Corollaries 5.1 to 5.3 are not as pathological as one might think at first glance. In fact, in many applications, neural networks are trained in an "overparameterized" regime in which the number of degrees of freedom in ψ exceeds the number of training samples by far and in which ψ is able to fit arbitrary training data with zero error, see [2,8,15,32,39]. In the situation of Lemma 3.3, this means that a measure µ of the form µ = 1 n n k=1 δ x k supported on a finite set…”

Section: Further Consequences Of the Nonexistence Of Supporting Half-...mentioning

confidence: 99%

On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems

Christof¹,

Kowalczyk²

2022

Preprint

View full text Add to dashboard Cite

We study the loss landscape of training problems for deep artificial neural networks with a one-dimensional real output whose activation functions contain an affine segment and whose hidden layers have width at least two. It is shown that such problems possess a continuum of spurious (i.e., not globally optimal) local minima for all target functions that are not affine. In contrast to previous works, our analysis covers all sampling and parameterization regimes, general differentiable loss functions, arbitrary continuous nonpolynomial activation functions, and both the finite-and infinite-dimensional setting. It is further shown that the appearance of the spurious local minima in the considered training problems is a direct consequence of the universal approximation theorem and that the underlying mechanisms also cause, e.g., L p -best approximation problems to be ill-posed in the sense of Hadamard for all networks that do not have a dense image. The latter result also holds without the assumption of local affine linearity and without any conditions on the hidden layers. The paper concludes with a numerical experiment which demonstrates that spurious local minima can indeed affect the convergence behavior of gradient-based solution algorithms in practice.

show abstract

“…The depth of circuits is used to determine the first updated parameter not to be lodged in a barren plateaus problem during training. Recently, it has investigated that the barren plateau is missing in QNNs and QCNNs with tree tensor network (TTN) architecture [25,26].…”

Section: Introductionmentioning

confidence: 99%

Variational Quantum Classifiers Through the Lens of the Hessian

Sen,

Bhatia,

Bhangu

et al. 2021

Preprint

View full text Add to dashboard Cite

In quantum computing, the variational quantum algorithms (VQAs) are well suited for finding optimal combinations of things in specific applications ranging from chemistry all the way to finance. The training of VQAs with gradient descent optimization algorithm has shown a good convergence. At an early stage, the simulation of variational quantum circuits on noisy intermediate-scale quantum (NISQ) devices suffers from noisy outputs. Just like classical deep learning, it also suffers from vanishing gradient problems. It is a realistic goal to study the topology of loss landscape, to visualize the curvature information and trainability of these circuits in the existence of vanishing gradients. In this paper, we calculated the Hessian and visualized the loss landscape of variational quantum classifiers at different points in parameter space. The curvature information of variational quantum classifiers (VQC) is interpreted and the loss function's convergence is shown. It helps us better understand the behavior of variational quantum circuits to tackle optimization problems efficiently. We investigated the variational quantum classifiers via Hessian on quantum computers, started with a simple 4-bit parity problem to gain insight into the practical behavior of Hessian, then thoroughly analyzed the behavior of Hessian's eigenvalues on training the variational quantum classifier for the Diabetes dataset.

show abstract

Global Minima of Overparameterized Neural Networks

Cited by 19 publications

References 1 publication

Embedding Principle: a hierarchical structure of loss landscape of deep neural networks

Embedding Principle: a hierarchical structure of loss landscape of deep neural networks

On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems

Variational Quantum Classifiers Through the Lens of the Hessian

Contact Info

Product

Resources

About