2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00279
|View full text |Cite
|
Sign up to set email alerts
|

Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift

Abstract: This paper first answers the question "why do the two most powerful techniques Dropout and Batch Normalization (BN) often lead to a worse performance when they are combined together?" in both theoretical and statistical aspects. Theoretically, we find that Dropout would shift the variance of a specific neural unit when we transfer the state of that network from train to test. However, BN would maintain its statistical variance, which is accumulated from the entire learning procedure, in the test phase. The inc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
124
1
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 225 publications
(139 citation statements)
references
References 23 publications
(37 reference statements)
0
124
1
1
Order By: Relevance
“…Dropout is a widely-used regularization technique in deep neural networks [38,18]. Given a d-dimensional input vector x, in the training phase, we randomly zero the element x k , k = 1, 2,. .…”
Section: -Preserved Dropoutmentioning
confidence: 99%
“…Dropout is a widely-used regularization technique in deep neural networks [38,18]. Given a d-dimensional input vector x, in the training phase, we randomly zero the element x k , k = 1, 2,. .…”
Section: -Preserved Dropoutmentioning
confidence: 99%
“…We applied Group norm [57] after all ReLU activations as opposed to Batch norm [58] as batch sizes as small as ours can cause issues due to inaccurate batch statistics degrading the quality of the resulting models [59]. The original U-Net proposed in [49] used Dropout which we avoided as in some cases the combination of dropout and batch normalisation can cause worse results [60]. He initialisation [61] was used for all layers.…”
Section: U-net Implementation Architecturementioning
confidence: 99%
“…There are several closely related works concurrent with this submission [20,25,2,15]. Work [20] argues that BN improves generalization because it leads to a smoother objective function, the authors of [15] study the question why BN is often found incompatible with dropout, and works [25,2] observe that randomness in batch normalization can be linked to optimizing a lower bound on the expected data likelihood [2] and to variational Bayesian learning [25]. However, these works focus on estimating the uncertainty of outputs in models that have been already trained using BN.…”
Section: Related Workmentioning
confidence: 99%
“…This makes sure that derivatives of both log σ and σ are bounded. Note that a simpler parametrization σ = e u has quickly growing derivatives of the linear terms in σ and that the data evidence as composition of log softmax and piecewise-linear layers is approximately linear in each variance σ as seen from the parametrization (15). Note that using a sampling-based estimate of the KL divergence as in [3] does not circumvent the problem because it contains exactly the same problematic term − log σ in every sample.…”
Section: Normalization With Bayesian Learningmentioning
confidence: 99%