In this work we investigate the reasons why Batch Normalization (BN) improves the generalization performance of deep networks. We argue that one major reason, distinguishing it from data-independent normalization methods, is randomness of batch statistics. This randomness appears in the parameters rather than in activations and admits an interpretation as a practical Bayesian learning. We apply this idea to other (deterministic) normalization techniques that are oblivious to the batch size. We show that their generalization performance can be improved significantly by Bayesian learning of the same form. We obtain test performance comparable to BN and, at the same time, better validation losses suitable for subsequent output uncertainty estimation through approximate Bayesian posterior.Recent advances in hardware and deep NNs make it possible to use large capacity networks, so that the training accuracy becomes close to 100% even for rather difficult tasks. At the same time, however, we would like to ensure small generalization gaps, i.e. a high validation accuracy and a reliable confidence prediction. For this reason, regularization methods become very important.As the base model for this study we have chosen the All-CNN network of [23], a network with eight convolutional layers, and train it on the CIFAR-10 dataset. Recent work [7] compares different regularization techniques with this network and reports test accuracy of 91.87% with their probabilistic network and 90.88% with dropout but omits BN. Fig. 1 shows how well BN generalizes for this problem when applied to exactly the same network. It easily achieves validation accuracy 93%, being significantly better than the dedicated regularization techniques proposed in [7]. It appears that BN is a very powerful regularization method. The goal of this work is to try to understand and exploit the respective mechanism. Towards this end we identify two components: one is a non-linear reparametrization of the model that preconditions gradient descent and the other is stochasticity.The reparametrization may be as well achieved by other normalization techniques such as weight normalization [19] and analytic normalization [22] amongst others [14,1]. The advantage of these methods is that they are deterministic and thus do not rely on batch statistics, often require less computation overhead, are continuously differentiable [22] and can be applied more flexibly, e.g. to cases with a small batch size or recurrent neural networks. Unfortunately, these methods, while improving on the training loss, do not generalize as good as BN, which was observed experimentally in [8,22]. We therefore look at further aspects of BN that could explain its regularization.