Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift

Li, Xiang; Chen, Shuo; Hu, Xiaolin; Yang, Jian

doi:10.1109/cvpr.2019.00279

Cited by 225 publications

(139 citation statements)

References 23 publications

(37 reference statements)

Supporting

Mentioning

124

Contrasting

Unclassified

Order By: Relevance

“…Dropout is a widely-used regularization technique in deep neural networks [38,18]. Given a d-dimensional input vector x, in the training phase, we randomly zero the element x k , k = 1, 2,. .…”

Section: -Preserved Dropoutmentioning

confidence: 99%

Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation

Yang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

445

236

View full text Add to dashboard Cite

Domain adaptation enables the learner to safely generalize into novel environments by mitigating domain shifts across distributions. Previous works may not effectively uncover the underlying reasons that would lead to the drastic model degradation on the target task. In this paper, we empirically reveal that the erratic discrimination of the target domain mainly stems from its much smaller feature norms with respect to that of the source domain. To this end, we propose a novel parameter-free Adaptive Feature Norm approach. We demonstrate that progressively adapting the feature norms of the two domains to a large range of values can result in significant transfer gains, implying that those task-specific features with larger norms are more transferable. Our method successfully unifies the computation of both standard and partial domain adaptation with more robustness against the negative transfer issue. Without bells and whistles but a few lines of code, our method substantially lifts the performance on the target task and exceeds state-of-the-arts by a large margin (11.5% on Office-Home [45] and 17.1% on VisDA2017 [31]). We hope our simple yet effective approach will shed some light on the future research of transfer learning. Code is available at https

show abstract

Section: -Preserved Dropoutmentioning

confidence: 99%

Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation

Yang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

445

236

View full text Add to dashboard Cite

show abstract

“…We applied Group norm [57] after all ReLU activations as opposed to Batch norm [58] as batch sizes as small as ours can cause issues due to inaccurate batch statistics degrading the quality of the resulting models [59]. The original U-Net proposed in [49] used Dropout which we avoided as in some cases the combination of dropout and batch normalisation can cause worse results [60]. He initialisation [61] was used for all layers.…”

Section: U-net Implementation Architecturementioning

confidence: 99%

Segmentation of roots in soil with U-Net

et al. 2020

View full text Add to dashboard Cite

Background Plant root research can provide a way to attain stress-tolerant crops that produce greater yield in a diverse array of conditions. Phenotyping roots in soil is often challenging due to the roots being difficult to access and the use of time consuming manual methods. Rhizotrons allow visual inspection of root growth through transparent surfaces. Agronomists currently manually label photographs of roots obtained from rhizotrons using a line-intersect method to obtain root length density and rooting depth measurements which are essential for their experiments. We investigate the effectiveness of an automated image segmentation method based on the U-Net Convolutional Neural Network (CNN) architecture to enable such measurements. We design a data-set of 50 annotated chicory (Cichorium intybus L.) root images which we use to train, validate and test the system and compare against a baseline built using the Frangi vesselness filter. We obtain metrics using manual annotations and line-intersect counts. Results Our results on the held out data show our proposed automated segmentation system to be a viable solution for detecting and quantifying roots. We evaluate our system using 867 images for which we have obtained line-intersect counts, attaining a Spearman rank correlation of 0.9748 and an $$r^2$$r2 of 0.9217. We also achieve an $$F_1$$F1 of 0.7 when comparing the automated segmentation to the manual annotations, with our automated segmentation system producing segmentations with higher quality than the manual annotations for large portions of the image. Conclusion We have demonstrated the feasibility of a U-Net based CNN system for segmenting images of roots in soil and for replacing the manual line-intersect method. The success of our approach is also a demonstration of the feasibility of deep learning in practice for small research groups needing to create their own custom labelled dataset from scratch.

show abstract

“…There are several closely related works concurrent with this submission [20,25,2,15]. Work [20] argues that BN improves generalization because it leads to a smoother objective function, the authors of [15] study the question why BN is often found incompatible with dropout, and works [25,2] observe that randomness in batch normalization can be linked to optimizing a lower bound on the expected data likelihood [2] and to variational Bayesian learning [25]. However, these works focus on estimating the uncertainty of outputs in models that have been already trained using BN.…”

Section: Related Workmentioning

confidence: 99%

“…This makes sure that derivatives of both log σ and σ are bounded. Note that a simpler parametrization σ = e u has quickly growing derivatives of the linear terms in σ and that the data evidence as composition of log softmax and piecewise-linear layers is approximately linear in each variance σ as seen from the parametrization (15). Note that using a sampling-based estimate of the KL divergence as in [3] does not circumvent the problem because it contains exactly the same problematic term − log σ in every sample.…”

Section: Normalization With Bayesian Learningmentioning

confidence: 99%

Stochastic Normalizations as Bayesian Learning

Shekhovtsov

Flach

2019

Computer Vision – ACCV 2018

View full text Add to dashboard Cite

In this work we investigate the reasons why Batch Normalization (BN) improves the generalization performance of deep networks. We argue that one major reason, distinguishing it from data-independent normalization methods, is randomness of batch statistics. This randomness appears in the parameters rather than in activations and admits an interpretation as a practical Bayesian learning. We apply this idea to other (deterministic) normalization techniques that are oblivious to the batch size. We show that their generalization performance can be improved significantly by Bayesian learning of the same form. We obtain test performance comparable to BN and, at the same time, better validation losses suitable for subsequent output uncertainty estimation through approximate Bayesian posterior.Recent advances in hardware and deep NNs make it possible to use large capacity networks, so that the training accuracy becomes close to 100% even for rather difficult tasks. At the same time, however, we would like to ensure small generalization gaps, i.e. a high validation accuracy and a reliable confidence prediction. For this reason, regularization methods become very important.As the base model for this study we have chosen the All-CNN network of [23], a network with eight convolutional layers, and train it on the CIFAR-10 dataset. Recent work [7] compares different regularization techniques with this network and reports test accuracy of 91.87% with their probabilistic network and 90.88% with dropout but omits BN. Fig. 1 shows how well BN generalizes for this problem when applied to exactly the same network. It easily achieves validation accuracy 93%, being significantly better than the dedicated regularization techniques proposed in [7]. It appears that BN is a very powerful regularization method. The goal of this work is to try to understand and exploit the respective mechanism. Towards this end we identify two components: one is a non-linear reparametrization of the model that preconditions gradient descent and the other is stochasticity.The reparametrization may be as well achieved by other normalization techniques such as weight normalization [19] and analytic normalization [22] amongst others [14,1]. The advantage of these methods is that they are deterministic and thus do not rely on batch statistics, often require less computation overhead, are continuously differentiable [22] and can be applied more flexibly, e.g. to cases with a small batch size or recurrent neural networks. Unfortunately, these methods, while improving on the training loss, do not generalize as good as BN, which was observed experimentally in [8,22]. We therefore look at further aspects of BN that could explain its regularization.

show abstract

Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift

Cited by 225 publications

References 23 publications

Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation

Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation

Segmentation of roots in soil with U-Net

Stochastic Normalizations as Bayesian Learning

Contact Info

Product

Resources

About