Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization

Yan, Junjie; Wan, Ruosi; Zhang, Xiangyu; Zhang, Wei; Wei, Yichen; Sun, Jian

doi:10.48550/arxiv.2001.06838

Cited by 6 publications

(15 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There have been several works investigating the problems of BN [7,35,55,56]. BatchRenorm [23] and MABN [57] are extensions of BatchNorm which aim at reducing the dependence on batches and therefore the train-test discrepancy. Multiple alternative normalization layers have been proposed which only operate on single samples [3,32,33,43,46,52,55], like GroupNorm [55] or LayerNorm [3].…”

Section: Batch Normalization and Beyondmentioning

confidence: 99%

On the Pitfalls of Batch Normalization for End-to-End Video Learning: A Study on Surgical Workflow Analysis

Rivoir¹,

Funke²,

Speidel³

2022

Preprint

View full text Add to dashboard Cite

Batch Normalization's (BN) unique property of depending on other samples in a batch is known to cause problems in several tasks, including sequential modeling, and has led to the use of alternatives in these fields. In video learning, however, these problems are less studied, despite the ubiquitous use of BN in CNNs for visual feature extraction. We argue that BN's properties create major obstacles for training CNNs and temporal models end to end in video tasks. Yet, end-to-end learning seems preferable in specialized domains such as surgical workflow analysis, which lack well-pretrained feature extractors. While previous work in surgical workflow analysis has avoided BN-related issues through complex, multi-stage learning procedures, we show that even simple, endto-end CNN-LSTMs can outperform the state of the art when CNNs without BN are used. Moreover, we analyze in detail when BN-related issues occur, including a "cheating" phenomenon in surgical anticipation tasks. We hope that a deeper understanding of BN's limitations and a reconsideration of end-to-end approaches can be beneficial for future research in surgical workflow analysis and general video learning.

show abstract

Section: Batch Normalization and Beyondmentioning

confidence: 99%

On the Pitfalls of Batch Normalization for End-to-End Video Learning: A Study on Surgical Workflow Analysis

Rivoir¹,

Funke²,

Speidel³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Batch Renormalization [26] and EvalNorm [43] correct batch statistics during training and inference, while compared to Synced BN they generally perform worse. Moving averaged batch normalization [56] and online normalization [11] adopt similar momentum updating of statistics like our momentum BN during the forward pass. However, they need further correct backpropagation for valid SGD optimization, which requires additional computation and memory resources.…”

Section: Related Workmentioning

confidence: 99%

Momentum^2 Teacher: Momentum Teacher with Momentum Statistics for Self-Supervised Learning

Li,

Liu,

Sun

2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we present a novel approach, Momentum 2 Teacher, for student-teacher based self-supervised learning. The approach performs momentum update on both network weights and batch normalization (BN) statistics. The teacher's weight is a momentum update of the student, and the teacher's BN statistics is a momentum update of those in history. The Momentum 2 Teacher is simple and efficient. It can achieve the state of the art results (74.5%) under ImageNet linear evaluation protocol using small-batch size(e.g., 128), without requiring large-batch training on special hardware like TPU or inefficient across GPU operation (e.g., shuffling BN, synced BN). Our implementation and pre-trained models will be given on GitHub 1 .

show abstract

“…We also add a small diagonal matrix I to the covariance matrix to avoid rank-deficiency. Note that an attractive idea is to include the population statistics, i.e., the running average of the covariance, for stabilization [44,40]. We have tried but do not pursue this option due to an inherent limitation that the smoothing factor needs to be small to avoid transforming the current batch with a mismatched geometry, leading to a training explosion.…”

Section: Acceleration Techniquesmentioning

confidence: 99%

“…The wall time can be significantly reduced if we increase the batch size and reduce the number of function calls. At batch size 64, our slowest run takes 11 hours (3 hours faster), reaching Box AP 38.7, Mask AP 35.6, superior to a recent method that trains with 2× more steps[40]. Besides optimizing these operations through low-level changes to the software package, we can also use standardization layers in only a subset of the network.…”

mentioning

confidence: 99%

Exploiting Invariance in Training Deep Neural Networks

Ye¹,

Xiong²,

McKinney³

et al. 2021

Preprint

View full text Add to dashboard Cite

Inspired by two basic mechanisms in animal visual systems, we introduce a feature transform technique that imposes invariance properties in the training of deep neural networks. The resulting algorithm requires less parameter tuning, trains well with an initial learning rate 1.0, and easily generalizes to different tasks. We enforce scale invariance with local statistics in the data to align similar samples generated in diverse situations. To accelerate convergence, we enforce a GL(n)-invariance property with global statistics extracted from a batch that the gradient descent solution should remain invariant under basis change. Tested on Im-ageNet, MS COCO, and Cityscapes datasets, our proposed technique requires fewer iterations to train, surpasses all baselines by a large margin, seamlessly works on both small and large batch size training, and applies to different computer vision tasks of image classification, object detection, and semantic segmentation.

show abstract

Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization

Cited by 6 publications

References 17 publications

On the Pitfalls of Batch Normalization for End-to-End Video Learning: A Study on Surgical Workflow Analysis

On the Pitfalls of Batch Normalization for End-to-End Video Learning: A Study on Surgical Workflow Analysis

Momentum^2 Teacher: Momentum Teacher with Momentum Statistics for Self-Supervised Learning

Exploiting Invariance in Training Deep Neural Networks

Contact Info

Product

Resources

About