2018
DOI: 10.48550/arxiv.1809.00846
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards Understanding Regularization in Batch Normalization

Abstract: Batch Normalization (BN) improves both convergence and generalization in training neural networks. This work understands these phenomena theoretically. We analyze BN by using a basic block of neural networks, consisting of a kernel layer, a BN layer, and a nonlinear activation function. This basic network helps us understand the impacts of BN in three aspects. First, by viewing BN as an implicit regularizer, BN can be decomposed into population normalization (PN) and gamma decay as an explicit regularization. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
28
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 41 publications
(34 citation statements)
references
References 33 publications
1
28
0
Order By: Relevance
“…Each decoder block is composed of a first masked self-attention layer followed by a multi-head attention layer and a feed-forward block. Furthermore, all the sub-layers use a residual connection followed by dropout and batch normalization layers, to improve the capacity of generalization of the network [13]. In addition, to model the sequential information of the time series, a positional encoded vector, generated with sine and cosine functions, is added to the input sequences.…”
Section: Attention-based Deep Neural Networkmentioning
confidence: 99%
“…Each decoder block is composed of a first masked self-attention layer followed by a multi-head attention layer and a feed-forward block. Furthermore, all the sub-layers use a residual connection followed by dropout and batch normalization layers, to improve the capacity of generalization of the network [13]. In addition, to model the sequential information of the time series, a positional encoded vector, generated with sine and cosine functions, is added to the input sequences.…”
Section: Attention-based Deep Neural Networkmentioning
confidence: 99%
“…(iii) serves an implicit regularization [41] and enhances the models' generalization [28]; (iv) enables large-batch training [18] and smoothens the loss landscapes [55].…”
Section: Technical Approachmentioning
confidence: 99%
“…Despite the practical success deriving from this foundational principle, the reliance of BN on the mini-batch of data can sometimes be problematic. Most notably, when the minibatch is small or when the dataset is large, the regularisation coming from the noise in the mini-batch statistics µ c , σ c can be excessive or unwanted, leading to degraded performance (Ioffe, 2017;Wu & He, 2018;Masters & Luschi, 2018;Ying et al, 2018;Luo et al, 2018;Kolesnikov et al, 2020;Summers & Dinneen, 2020).…”
Section: Batch-independent Normalizationmentioning
confidence: 99%