Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization

Ergen, Tolga; Sahiner, A.; Ozturkler, Batu; Pauly, John M.; Mardani, Morteza; Pilancı, Mert

doi:10.48550/arxiv.2103.01499

Cited by 4 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1 The similar analysis procedure for implicit regularization are also presented in (Ergen et al 2021;Neyshabur, Tomioka, and Srebro 2014;Savarese et al 2019).…”

Section: Training Stabilizationmentioning

confidence: 95%

Scaled ReLU Matters for Training Vision Transformers

Wang

Luo

et al. 2022

AAAI

View full text Add to dashboard Cite

Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, optimizer and warmup epoch. The reasons for training difficulty are empirically analysed in the paper Early Convolutions Help Transformers See Better, and the authors conjecture that the issue lies with the patchify-stem of ViT models. In this paper, we further investigate this problem and extend the above conclusion: only early convolutions do not help for stable training, but the scaled ReLU operation in the convolutional stem (conv-stem) matters. We verify, both theoretically and empirically, that scaled ReLU in conv-stem not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops. In addition, extensive experiments are conducted to demonstrate that previous ViTs are far from being well trained, further showing that ViTs have great potential to be a better substitute of CNNs.

show abstract

“…1 The similar analysis procedure for implicit regularization are also presented in (Ergen et al 2021;Neyshabur, Tomioka, and Srebro 2014;Savarese et al 2019).…”

Section: Training Stabilizationmentioning

confidence: 95%

Scaled ReLU Matters for Training Vision Transformers

Wang

Luo

et al. 2022

AAAI

View full text Add to dashboard Cite

show abstract

“…Therefore, to ensure that β is not too large, we decay β by a factor γ 1 ∈ (0, 1). This also appears in (Ergen et al, 2021). On the other hand, if β is too small resulting the relaxed dual problem (4) infeasible, we increase β by multiplying γ −1 2 , where γ 2 ∈ (0, 1).…”

Section: Optimal Neural Network Approximation Of Wasserstein Gradientmentioning

confidence: 99%

Optimal Neural Network Approximation of Wasserstein Gradient Direction via Convex Optimization

Wang¹,

Chen²,

Pilancı³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The computation of Wasserstein gradient direction is essential for posterior sampling problems and scientific computing. The approximation of the Wasserstein gradient with finite samples requires solving a variational problem. We study the variational problem in the family of two-layer networks with squared-ReLU activations, towards which we derive a semi-definite programming (SDP) relaxation. This SDP can be viewed as an approximation of the Wasserstein gradient in a broader function family including two-layer networks. By solving the convex SDP, we obtain the optimal approximation of the Wasserstein gradient direction in this class of functions. Numerical experiments including PDE-constrained Bayesian inference and parameter estimation in COVID-19 modeling demonstrate the effectiveness of the proposed method.

show abstract

“…The work in [24] studies convex duality of divergence measures, where the insights motivate regularizing the discriminator's Lipschitz constant for improved GAN performance. For supervised two-layer networks, a recent of line of work has established zero-duality gap and thus equivalent convex networks with ReLU activation that can be solved in polynomial time for global optimality; see e.g., [25][26][27][28][29][30]. These works focus on single-player networks for supervised learning.…”

Section: Related Workmentioning

confidence: 99%

Hidden Convexity of Wasserstein GANs: Interpretable Generative Models with Closed-Form Solutions

Sahiner¹,

Ergen²,

Ozturkler³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Generative Adversarial Networks (GANs) are commonly used for modeling complex distributions of data. Both the generators and discriminators of GANs are often modeled by neural networks, posing a non-transparent optimization problem which is non-convex and non-concave over the generator and discriminator, respectively. Such networks are often heuristically optimized with gradient descent-ascent (GDA), but it is unclear whether the optimization problem contains any saddle points, or whether heuristic methods can find them in practice. In this work, we analyze the training of Wasserstein GANs with two-layer neural network discriminators through the lens of convex duality, and for a variety of generators expose the conditions under which Wasserstein GANs can be solved exactly with convex optimization approaches, or can be represented as convex-concave games. Using this convex duality interpretation, we further demonstrate the impact of different activation functions of the discriminator. Our observations are verified with numerical results demonstrating the power of the convex interpretation, with applications in progressive training of convex architectures corresponding to linear generators and quadratic-activation discriminators for CelebA image generation. The code for our experiments is available at https://github.com/ardasahiner/ProCoGAN. * Equal contribution Preprint. Under review.

show abstract

Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization

Cited by 4 publications

References 14 publications

Scaled ReLU Matters for Training Vision Transformers

Scaled ReLU Matters for Training Vision Transformers

Optimal Neural Network Approximation of Wasserstein Gradient Direction via Convex Optimization

Hidden Convexity of Wasserstein GANs: Interpretable Generative Models with Closed-Form Solutions

Contact Info

Product

Resources

About