SSN: Learning Sparse Switchable Normalization via SparsestMax

Shao, Wei; Meng, Tianjian; Li, Jingyu; Zhang, Ruimao; Li, Yudian; Wang, Xiaogang; Luo, Ping

doi:10.1109/cvpr.2019.00053

Cited by 44 publications

(18 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An intuitive explanation is that sparseness can effectively prevent the model from overfitting. The similar results are also presented in the recent proposed Sparse Switchable Normalization (SSN) [46]. It implies that we could increase sparsity in ratios to reduce computations of multiple normalizers while maintaining good performance.…”

Section: Ablation Studysupporting

confidence: 77%

Switchable Normalization for Learning-to-Normalize Deep Representation

Luo

Zhang

Ren

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

View full text Add to dashboard Cite

We address a learning-to-normalize problem by proposing Switchable Normalization (SN), which learns to select different normalizers for different normalization layers of a deep neural network. SN employs three distinct scopes to compute statistics (means and variances) including a channel, a layer, and a minibatch. SN switches between them by learning their importance weights in an end-to-end manner. It has several good properties. First, it adapts to various network architectures and tasks (see Fig. 1). Second, it is robust to a wide range of batch sizes, maintaining high performance even when small minibatch is presented (e.g. 2 images/GPU). Third, SN does not have sensitive hyper-parameter, unlike group normalization that searches the number of groups as a hyper-parameter. Without bells and whistles, SN outperforms its counterparts on various challenging benchmarks, such as ImageNet, COCO, CityScapes, ADE20K, MegaFace and Kinetics. Analyses of SN are also presented to answer the following three questions: (a) Is it useful to allow each normalization layer to select its own normalizer? (b) What impacts the choices of normalizers? (c) Do different tasks and datasets prefer different normalizers? We hope SN will help ease the usage and understand the normalization techniques in deep learning. The code of SN has been released at https://github.com/switchablenorms.

show abstract

Section: Ablation Studysupporting

confidence: 77%

Switchable Normalization for Learning-to-Normalize Deep Representation

Luo

Zhang

Ren

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Moreover, investigating the other normalizers such as instance normalization (IN) (Ulyanov et al, 2016) and layer normalization (LN) (Ba et al, 2016) is also important. Understanding the characteristics of these normalizers should be the first step to analyze some recent best practices such as whitening (Luo, 2017b;a), switchable normalization Shao et al, 2019), and switchable whitening (Pan et al, 2019).…”

Section: Discussionmentioning

confidence: 99%

Towards Understanding Regularization in Batch Normalization

Luo¹,

Wang²,

Shao³

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

Batch Normalization (BN) improves both convergence and generalization in training neural networks. This work understands these phenomena theoretically. We analyze BN by using a basic block of neural networks, consisting of a kernel layer, a BN layer, and a nonlinear activation function. This basic network helps us understand the impacts of BN in three aspects. First, by viewing BN as an implicit regularizer, BN can be decomposed into population normalization (PN) and gamma decay as an explicit regularization. Second, learning dynamics of BN and the regularization show that training converged with large maximum and effective learning rate. Third, generalization of BN is explored by using statistical mechanics. Experiments demonstrate that BN in convolutional neural networks share the same traits of regularization as the above analyses.

show abstract

“…This type of methods can restore the performance in small batch cases to some extent. However, instance-level normalization hardly meet industrial or commercial needs so far, for this type of methods have to compute instance-level statistics both in training and inference, which will introduce additional nonlinear operations in inference procedure and dramatically increase consumption Shao et al (2019). While vanilla BN uses the statistics computed over the whole training data instead of batch of samples when training finished.…”

Section: Introductionmentioning

confidence: 99%

Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization

Yan,

Wan,

Zhang

et al. 2020

Preprint

View full text Add to dashboard Cite

Batch Normalization (BN) is one of the most widely used techniques in Deep Learning field. But its performance can awfully degrade with insufficient batch size. This weakness limits the usage of BN on many computer vision tasks like detection or segmentation, where batch size is usually small due to the constraint of memory consumption. Therefore many modified normalization techniques have been proposed, which either fail to restore the performance of BN completely, or have to introduce additional nonlinear operations in inference procedure and increase huge consumption. In this paper, we reveal that there are two extra batch statistics involved in backward propagation of BN, on which has never been well discussed before. The extra batch statistics associated with gradients also can severely affect the training of deep neural network. Based on our analysis, we propose a novel normalization method, named Moving Average Batch Normalization (MABN). MABN can completely restore the performance of vanilla BN in small batch cases, without introducing any additional nonlinear operations in inference procedure. We prove the benefits of MABN by both theoretical analysis and experiments. Our experiments demonstrate the effectiveness of MABN in multiple computer vision tasks including ImageNet and COCO. The code has been released in https://github.com/megvii-model/MABN. * Equal Contribution. Work was done when Junjie Yan was an intern at Megvii Technology. † Corresponding author. 1 In the context of this paper, we use "batch size/normalization batch size" to refer the number of samples used to compute statistics unless otherwise stated. We use "gradient batch size" to refer the number of samples used to update weights.

show abstract

SSN: Learning Sparse Switchable Normalization via SparsestMax

Cited by 44 publications

References 13 publications

Switchable Normalization for Learning-to-Normalize Deep Representation

Switchable Normalization for Learning-to-Normalize Deep Representation

Towards Understanding Regularization in Batch Normalization

Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization

Contact Info

Product

Resources

About