2022
DOI: 10.48550/arxiv.2204.11326
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Multiscale Structure of Neural Network Loss Functions: The Effect on Optimization and Origin

Abstract: Local quadratic approximation has been extensively used to study the optimization of neural network loss functions around the minimum. Though, it usually holds in a very small neighborhood of the minimum, and cannot explain many phenomena observed during the optimization process. In this work, we study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of good quadratic approximation. Numerically, we observe that neural network loss functions possess… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 9 publications
(15 reference statements)
0
4
0
Order By: Relevance
“…The precision, recall, and F1-score metrics were reported for macro, micro, and weighted averaging. Additionally, the accuracy and loss function [16] were plotted against the number of epochs, revealing that the testing and training accuracy improved as the number of epochs increased, and the loss function decreased gradually with increasing epochs.…”
Section: Discussionmentioning
confidence: 99%
“…The precision, recall, and F1-score metrics were reported for macro, micro, and weighted averaging. Additionally, the accuracy and loss function [16] were plotted against the number of epochs, revealing that the testing and training accuracy improved as the number of epochs increased, and the loss function decreased gradually with increasing epochs.…”
Section: Discussionmentioning
confidence: 99%
“…The SGD optimization algorithm is used to optimized the biases and weights of the proposed Bi-LSTM-based CSI estimator. In this study, the suggested estimator is learned using one of three loss functions: Hinge, 55 MSLE, 56 and KLD. 57 The loss function calculates an assessment of the difference in value between the predicted and observed outcomes.…”
Section: Offline Training Of the Proposed Bi-lstm Schemementioning
confidence: 99%
“…Arora et al (2022) prove the edge of stability result occurs under certain conditions either on the learning rate or on the loss function. Ma et al (2022) empirically observe the multi-scale structure of the loss landscape in neural networks and use it to theoretically explain the edge of stability behavior of gradient descent. Chen and Bruna (2022) use low dimensional theoretical insights around a local minima to understand the edge of stability behavior.…”
Section: Edge Of Stability and The Importance Of The Hessianmentioning
confidence: 99%