On the Optimization Landscape of Neural Collapse under MSE Loss: Global Optimality with Unconstrained Features

Zhou, Jian; Li, Xiao; Ding, Tianyu; You, Chun‐Xiang; Qu, Qing; Zhu, Zhongkui

doi:10.48550/arxiv.2203.01238

Cited by 2 publications

(18 citation statements)

References 31 publications

(51 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prior arts and related works on N C. The empirical N C phenomenon has inspired a recent line of theoretical studies on understanding why it occurs [17,21,44,47,71,86,87]. Like ours, most of these works studied the problem under the UFM.…”

Section: Motivations and Contributionsmentioning

confidence: 91%

“…Such a phenomenon is referred to as Neural Collapse (N C) [50], which has been shown empirically to persist across a broad range of canonical classification problems, on different loss functions (e.g., cross-entropy (CE) [16,50,87], mean-squared error (MSE) [71,86], and supervised contrasive (SC) losses [21]), on different neural network architectures (e.g., VGG [63], ResNet [24], and DenseNet [28]), and on a variety of standard datasets (such as MNIST [39], CIFAR [35], and ImageNet [12], etc). Recently, in independent lines of research, many works are devoted to learning maximally compact and separated features; see, e.g., [13,42,43,51,52,60,72,73,76].…”

Section: Average Ce Loss Average Accuracy No Normalizationmentioning

confidence: 99%

“…We consider the commonly used CE loss and formulate the problem as a Riemannian optimization problem over products of unit spheres (i.e., the oblique manifold). Our study is also based upon the assumption of the so-called unconstrained feature model (UFM) [47,86,87] or layer-peeled model [17], where the last-layer features of the deep network are treated as free optimization variables to simplify the nonlinear interactions across layers. The underlying reasoning is that modern deep networks are often highly overparameterized with the capacity of learning any representations [26,45,62], so that the last-layer features can approximate, or interpolate, any point in the feature space.…”

Section: Motivations and Contributionsmentioning

confidence: 99%

“…More specifically, we prove that every local minimizer is a global solution satisfying the N C properties, and all the other critical points exhibit directions with negative curvature. Our analysis for the manifold setting is based upon a nontrivial extension of recent studies for the N C with penalized formulations [23,71,86,87], which could be of independent interest. Our work brings new tools from Riemannian optimization for analyzing optimization landscapes of training deep networks with an increasingly common practice of feature normalization.…”

Section: Motivations and Contributionsmentioning

confidence: 99%

“…Depsite the tremendous success of deep learning in engineering and scientific applications over the past decades, the underlying mechanism of deep neural networks (DNNs) still largely remains mysterious. Towards the goal of understanding the learned deep representations, a recent seminal line of works [16,23,50,86,87] presents an intriguing phenomenon that persists across a range of canonical classification problems during the terminal phase of training. Specifically, it has been widely observed that last-layer features (i.e., the output of the penultimate layer) and last-layer linear classifiers of a trained DNN exhibit simple but elegant mathematical structures, in the sense that • (NC1) Variability Collapse: the individual features of each class concentrate to their classmeans.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Neural Collapse with Normalized Features: A Geometric Analysis over the Riemannian Manifold

Yaras¹,

Wang²,

Zhu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

When training overparameterized deep networks for classification tasks, it has been widely observed that the learned features exhibit a so-called "neural collapse" phenomenon. More specifically, for the output features of the penultimate layer, for each class the within-class features converge to their means, and the means of different classes exhibit a certain tight frame structure, which is also aligned with the last layer's classifier. As feature normalization in the last layer becomes a common practice in modern representation learning, in this work we theoretically justify the neural collapse phenomenon for normalized features. Based on an unconstrained feature model, we simplify the empirical loss function in a multi-class classification task into a nonconvex optimization problem over the Riemannian manifold by constraining all features and classifiers over the sphere. In this context, we analyze the nonconvex landscape of the Riemannian optimization problem over the product of spheres, showing a benign global landscape in the sense that the only global minimizers are the neural collapse solutions while all other critical points are strict saddles with negative curvature. Experimental results on practical deep networks corroborate our theory and demonstrate that better representations can be learned faster via feature normalization.

show abstract

Section: Motivations and Contributionsmentioning

confidence: 91%