2020
DOI: 10.48550/arxiv.2001.06838
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization

Junjie Yan,
Ruosi Wan,
Xiangyu Zhang
et al.

Abstract: Batch Normalization (BN) is one of the most widely used techniques in Deep Learning field. But its performance can awfully degrade with insufficient batch size. This weakness limits the usage of BN on many computer vision tasks like detection or segmentation, where batch size is usually small due to the constraint of memory consumption. Therefore many modified normalization techniques have been proposed, which either fail to restore the performance of BN completely, or have to introduce additional nonlinear op… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
14
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(15 citation statements)
references
References 17 publications
1
14
0
Order By: Relevance
“…There have been several works investigating the problems of BN [7,35,55,56]. BatchRenorm [23] and MABN [57] are extensions of BatchNorm which aim at reducing the dependence on batches and therefore the train-test discrepancy. Multiple alternative normalization layers have been proposed which only operate on single samples [3,32,33,43,46,52,55], like GroupNorm [55] or LayerNorm [3].…”
Section: Batch Normalization and Beyondmentioning
confidence: 99%
“…There have been several works investigating the problems of BN [7,35,55,56]. BatchRenorm [23] and MABN [57] are extensions of BatchNorm which aim at reducing the dependence on batches and therefore the train-test discrepancy. Multiple alternative normalization layers have been proposed which only operate on single samples [3,32,33,43,46,52,55], like GroupNorm [55] or LayerNorm [3].…”
Section: Batch Normalization and Beyondmentioning
confidence: 99%
“…Batch Renormalization [26] and EvalNorm [43] correct batch statistics during training and inference, while compared to Synced BN they generally perform worse. Moving averaged batch normalization [56] and online normalization [11] adopt similar momentum updating of statistics like our momentum BN during the forward pass. However, they need further correct backpropagation for valid SGD optimization, which requires additional computation and memory resources.…”
Section: Related Workmentioning
confidence: 99%
“…We also add a small diagonal matrix I to the covariance matrix to avoid rank-deficiency. Note that an attractive idea is to include the population statistics, i.e., the running average of the covariance, for stabilization [44,40]. We have tried but do not pursue this option due to an inherent limitation that the smoothing factor needs to be small to avoid transforming the current batch with a mismatched geometry, leading to a training explosion.…”
Section: Acceleration Techniquesmentioning
confidence: 99%
“…The wall time can be significantly reduced if we increase the batch size and reduce the number of function calls. At batch size 64, our slowest run takes 11 hours (3 hours faster), reaching Box AP 38.7, Mask AP 35.6, superior to a recent method that trains with 2× more steps[40]. Besides optimizing these operations through low-level changes to the software package, we can also use standardization layers in only a subset of the network.…”
mentioning
confidence: 99%