Towards Understanding Regularization in Batch Normalization

Luo, Ping; Wang, Xinjiang; Shao, Wei; Peng, Zhanglin

doi:10.48550/arxiv.1809.00846

Cited by 41 publications

(34 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Each decoder block is composed of a first masked self-attention layer followed by a multi-head attention layer and a feed-forward block. Furthermore, all the sub-layers use a residual connection followed by dropout and batch normalization layers, to improve the capacity of generalization of the network [13]. In addition, to model the sequential information of the time series, a positional encoded vector, generated with sine and cosine functions, is added to the input sequences.…”

Section: Attention-based Deep Neural Networkmentioning

confidence: 99%

Evaluation of the Transformer Architecture for Univariate Time Series Forecasting

Lara-Benítez

Gallego-Ledesma

Carranza-García

et al. 2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The attention-based Transformer architecture is earning increasing popularity for many machine learning tasks. In this study, we aim to explore the suitability of Transformers for time series forecasting, which is a crucial problem in different domains. We perform an extensive experimental study of the Transformer with different architecture and hyper-parameter configurations over 12 datasets with more than 50,000 time series. The forecasting accuracy and computational efficiency of Transformers are compared with state-of-the-art deep learning networks such as LSTM and CNN. The obtained results demonstrate that Transformers can outperform traditional recurrent or convolutional models due to their capacity to capture long-term dependencies, obtaining the most accurate forecasts in five out of twelve datasets. However, Transformers are generally more difficult to parametrize and show higher variability of results. In terms of efficiency, Transformer models proved to be less competitive in inference time and similar to the LSTM in training time.

show abstract

Section: Attention-based Deep Neural Networkmentioning

confidence: 99%

Evaluation of the Transformer Architecture for Univariate Time Series Forecasting

Lara-Benítez

Gallego-Ledesma

Carranza-García

et al. 2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…(iii) serves an implicit regularization [41] and enhances the models' generalization [28]; (iv) enables large-batch training [18] and smoothens the loss landscapes [55].…”

Section: Technical Approachmentioning

confidence: 99%

"BNN - BN = ?": Training Binary Neural Networks without Batch Normalization

Chen¹,

Zhang²,

Ouyang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Batch normalization (BN) is a key facilitator and considered essential for state-of-the-art binary neural networks (BNN). However, the BN layer is costly to calculate and is typically implemented with non-binary parameters, leaving a hurdle for the efficient implementation of BNN training. It also introduces undesirable dependence between samples within each batch. Inspired by the latest advance on Batch Normalization Free (BN-Free) training [7], we extend their framework to training BNNs, and for the first time demonstrate that BNs can be completed removed from BNN training and inference regimes. By plugging in and customizing techniques including adaptive gradient clipping, scale weight standardization, and specialized bottleneck block, a BN-free BNN is capable of maintaining competitive accuracy compared to its BN-based counterpart. Extensive experiments validate the effectiveness of our proposal across diverse BNN backbones and datasets. For example, after removing BNs from the state-of-the-art ReActNets [38], it can still be trained with our proposed methodology to achieve 92.08%, 68.34%, and 68.0% accuracy on CIFAR-10, CIFAR-100, and ImageNet respectively, with marginal performance drop (0.23% ∼ 0.44% on CIFAR and 1.40% on ImageNet). Codes and pre-trained models are available at: https://github.com/VITA-Group/BNN_NoBN .

show abstract

“…Despite the practical success deriving from this foundational principle, the reliance of BN on the mini-batch of data can sometimes be problematic. Most notably, when the minibatch is small or when the dataset is large, the regularisation coming from the noise in the mini-batch statistics µ c , σ c can be excessive or unwanted, leading to degraded performance (Ioffe, 2017;Wu & He, 2018;Masters & Luschi, 2018;Ying et al, 2018;Luo et al, 2018;Kolesnikov et al, 2020;Summers & Dinneen, 2020).…”

Section: Batch-independent Normalizationmentioning

confidence: 99%

Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training

Masters,

Labatie,

Eaton-Rosen

et al. 2021

Preprint

View full text Add to dashboard Cite

Much recent research has been dedicated to improving the efficiency of training and inference for image classification. This effort has commonly focused on explicitly improving theoretical efficiency, often measured as ImageNet validation accuracy per FLOP. These theoretical savings have, however, proven challenging to achieve in practice, particularly on high-performance training accelerators.In this work, we focus on improving the practical efficiency of the state-of-the-art EfficientNet models on a new class of accelerator, the Graphcore IPU. We do this by extending this family of models in the following ways: (i) generalising depthwise convolutions to group convolutions; (ii) adding proxy-normalized activations to match batch normalization performance with batch-independent statistics; (iii) reducing compute by lowering the training resolution and inexpensively fine-tuning at higher resolution. We find that these three methods improve the practical efficiency for both training and inference. Our code will be made available online.

show abstract

Towards Understanding Regularization in Batch Normalization

Cited by 41 publications

References 33 publications

Evaluation of the Transformer Architecture for Univariate Time Series Forecasting

Evaluation of the Transformer Architecture for Univariate Time Series Forecasting

"BNN - BN = ?": Training Binary Neural Networks without Batch Normalization

Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training

Contact Info

Product

Resources

About