2021
DOI: 10.48550/arxiv.2103.01499
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization

Abstract: Batch Normalization (BN) is a commonly used technique to accelerate and stabilize training of deep neural networks. Despite its empirical success, a full theoretical understanding of BN is yet to be developed. In this work, we analyze BN through the lens of convex optimization. We introduce an analytic framework based on convex duality to obtain exact convex representations of weight-decay regularized ReLU networks with BN, which can be trained in polynomial-time. Our analyses also show that optimal layer weig… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3
1

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 14 publications
0
4
0
Order By: Relevance
“…1 The similar analysis procedure for implicit regularization are also presented in (Ergen et al 2021;Neyshabur, Tomioka, and Srebro 2014;Savarese et al 2019).…”
Section: Training Stabilizationmentioning
confidence: 95%
“…1 The similar analysis procedure for implicit regularization are also presented in (Ergen et al 2021;Neyshabur, Tomioka, and Srebro 2014;Savarese et al 2019).…”
Section: Training Stabilizationmentioning
confidence: 95%
“…Therefore, to ensure that β is not too large, we decay β by a factor γ 1 ∈ (0, 1). This also appears in (Ergen et al, 2021). On the other hand, if β is too small resulting the relaxed dual problem (4) infeasible, we increase β by multiplying γ −1 2 , where γ 2 ∈ (0, 1).…”
Section: Optimal Neural Network Approximation Of Wasserstein Gradientmentioning
confidence: 99%
“…The work in [24] studies convex duality of divergence measures, where the insights motivate regularizing the discriminator's Lipschitz constant for improved GAN performance. For supervised two-layer networks, a recent of line of work has established zero-duality gap and thus equivalent convex networks with ReLU activation that can be solved in polynomial time for global optimality; see e.g., [25][26][27][28][29][30]. These works focus on single-player networks for supervised learning.…”
Section: Related Workmentioning
confidence: 99%