2019
DOI: 10.48550/arxiv.1901.06053
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

Abstract: The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inap… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
37
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(38 citation statements)
references
References 39 publications
1
37
0
Order By: Relevance
“…For this case, we consider the unsupervised learning method described in [30]. We perform a tail index analysis of STDP in deep SNN, similar to the method adopted in [31].…”
Section: B Generalizability Of Stdpmentioning
confidence: 99%
See 1 more Smart Citation
“…For this case, we consider the unsupervised learning method described in [30]. We perform a tail index analysis of STDP in deep SNN, similar to the method adopted in [31].…”
Section: B Generalizability Of Stdpmentioning
confidence: 99%
“…Similar to the analysis done by Simsekli et. al, we evaluate the tail-index of the standard CNN trained with the SGD and the spiking convolutional neural networks trained with STDP processes [31]. The experiment is repeated with varying numbers of layers and is evaluated on the MNIST dataset and the tail indices are reported in Table I.…”
Section: B Generalizability Of Stdpmentioning
confidence: 99%
“…The first issue relates to the infiniteness of the second moment of SGN, assumed in [26] [27]. According to (11), the noise covariance is the product of the sampling noise and the gradient matrix…”
Section: A Issuesmentioning
confidence: 99%
“…Furthermore, it was recently argued that the critical assumption that SGN follows a Gaussian distribution does not necessarily hold [26] [27]. Instead, it was assumed that the second moment of SGN may not be finite, and it was concluded that the noise follows a Lévy random process by invoking the generalized CLT.…”
Section: Introductionmentioning
confidence: 99%
“…Recent studies [36,35,42] show that in several popular problems such as training BERT [38] on Wikipedia dataset the noise in the stochastic gradients is heavy-tailed. Moreover, in [42], the authors justify empirically that in such cases SGD works significantly worse than clipped-SGD [31] and Adam.…”
Section: Introductionmentioning
confidence: 99%