Implicit Regularization in Over-parameterized Neural Networks

Kubo, Masayoshi; Banno, Ryotaro; Manabe, Hidetaka; Minoji, Masataka

doi:10.48550/arxiv.1903.01997

Cited by 6 publications

(7 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More recently, often motivated by neural networks, there has been work on implicit regularization that typically considered SGD-based optimization algorithms. See, e.g., theoretical results on simplified models (Neyshabur et al, 2014;Neyshabur, 2017;Soudry et al, 2018;Gunasekar et al, 2017;Arora et al, 2019;Kubo et al, 2019) as well as extensive empirical and phenomenological results on stateof-the-art neural network models Mahoney, 2018, 2019). The implicit regularization observed by us is different in that it is not caused by an inexact approximation algorithm (such as SGD) but rather by the selection of one out of many exact solutions (e.g., the minimum norm solution).…”

Section: Related Workmentioning

confidence: 99%

Exact expressions for double descent and implicit regularization via surrogate random design

Dereziński,

Liang,

Mahoney

2019

Preprint

View full text Add to dashboard Cite

Double descent refers to the phase transition that is exhibited by the generalization error of unregularized learning models when varying the ratio between the number of parameters and the number of training samples. The recent success of highly over-parameterized machine learning models such as deep neural networks has motivated a theoretical analysis of the double descent phenomenon in classical models such as linear regression which can also generalize well in the over-parameterized regime. We build on recent advances in Randomized Numerical Linear Algebra (RandNLA) to provide the first exact non-asymptotic expressions for double descent of the minimum norm linear estimator. Our approach involves constructing what we call a surrogate random design to replace the standard i.i.d. design of the training sample. This surrogate design admits exact expressions for the mean squared error of the estimator while preserving the key properties of the standard design. We also establish an exact implicit regularization result for over-parameterized training samples. In particular, we show that, for the surrogate design, the implicit bias of the unregularized minimum norm estimator precisely corresponds to solving a ridge-regularized least squares problem on the population distribution.

show abstract

Section: Related Workmentioning

confidence: 99%

Exact expressions for double descent and implicit regularization via surrogate random design

Dereziński,

Liang,

Mahoney

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…Remark 4.2. The concept of (implicit) regularization has been adopted in many recent studies on nonconvex optimization, including training neural networks (Allen-Zhu et al, 2018;Kubo et al, 2019), phase retrieval (Chen and Candes, 2015;Ma et al, 2017), matrix completion (Chen and Wainwright, 2015;Zheng and Lafferty, 2016), and blind deconvolution (Li et al, 2019), referring to any scheme that biases the search direction of gradient-based algorithms. Implicit regularization has been advocated as an important feature of (stochastic) gradient descent methods for solving these problems, which, as the name suggests, means that the algorithms without regularization may behave as if they are regularized.…”

Section: Implicit Regularizationmentioning

confidence: 99%

Policy Optimization for $\mathcal{H}_2$ Linear Control with $\mathcal{H}_\infty$ Robustness Guarantee: Implicit Regularization and Global Convergence

Zhang¹,

Hu²,

Başar³

2019

Preprint

View full text Add to dashboard Cite

Policy optimization (PO) is a key ingredient for modern reinforcement learning (RL), an instance of adaptive optimal control (Sutton et al., 1992). For control design, certain constraints are usually enforced on the policies to optimize, accounting for either the stability, robustness, or safety concerns on the system. Hence, PO is by nature a constrained (nonconvex) optimization in most cases, whose global convergence is challenging to analyze in general. More importantly, some constraints that are safety-critical, e.g., the closed-loop stability, or the H ∞ -norm constraint that guarantees the system robustness, can be difficult to enforce on the controller being learned as the PO methods proceed. Recently, policy gradient methods have been shown to converge to the global optimum of linear quadratic regulator (LQR), a classical optimal control problem, without regularizing/projecting the control iterates onto the stabilizing set (Fazel et al., 2018;Bu et al., 2019a), the (implicit) feasible set of the problem. This striking result is built upon the property that the cost function is coercive, ensuring that the iterates remain feasible and strictly separated from the infeasible set as the cost decreases. In this paper, we study the convergence theory of PO for H 2 linear control with H ∞ -norm robustness guarantee, for both discrete-and continuous-time settings. This general framework includes risk-sensitive linear control as a special case. One significant new feature of this problem is the lack of coercivity, i.e., the cost may have finite value around the boundary of the robustness constraint set, breaking the existing analyses for LQR. Interestingly, among the three proposed PO methods motivated by (Fazel et al., 2018;Bu et al., 2019a), two of them enjoy the implicit regularization property, i.e., the iterates preserve the H ∞ robustness constraint as if they are regularized by the algorithms. Furthermore, convergence to the globally optimal policies with globally sublinear and locally (super-)linear rates are provided under certain conditions, despite the nonconvexity of the problem. To the best of our knowledge, our work offers the first results on the implicit regularization property and global convergence of PO methods for robust/risk-sensitive control. Our proof techniques for implicit regularization are of independent interest, and may be used for analyzing other PO methods under H ∞ robustness constraints and non-coercive costs.

show abstract

“…These techniques improve the generalization ability by preventing fine-tuning in a pre-determined DNN architecture [7]. Regularization techniques have been studied from different perspectives [10][11][12]. Accordingly, regularization techniques have two main effects: explicit and implicit.…”

Section: Related Work On Regularizationmentioning

confidence: 99%

Skipout: An Adaptive Layer-Level Regularization Framework for Deep Neural Networks

Moayed

Mansoori

2022

IEEE Access

View full text Add to dashboard Cite

Regularization methods can surprisingly improve the generalization ability of deep neural networks. Among numerous methods, the branch of Dropout regularization is very popular in practice. However, Dropout-like regularization variants have some deficiencies: blindness of regularization, dependency of regularization on layer types, and how to determine the parameters of regularization. In this paper, we propose Skipout framework to tackle these drawbacks by breaking the chain of co-adapted units in consecutive layers. Skipout is a layer-level regularization method that categorizes the layers of a deep network adaptively during training. It divides them into robust and critical layers based on the network architecture and the given task. Instead of turning off some units as in Dropout methods, Skipout recognizes the robust layers and skips them out untrained while sustaining to train the critical layers. The trick of Skipout is in backpropagating the cumulated residual errors via the robust layers, where the activation functions of their units are temporarily set to identity. The units of robust layers are dually activated: by their own functions in the forward pass and by identity in the backward pass. Therefore, Skipout takes advantage of vanishing gradient solutions via robust layers. Moreover, its implementation is simple and applicable to both convolutional and fully-connected layers. Experiments on diverse benchmark datasets and different deep models confirm that Skipout framework can improve the generalization performance of deep neural networks.

show abstract

Implicit Regularization in Over-parameterized Neural Networks

Cited by 6 publications

References 40 publications

Exact expressions for double descent and implicit regularization via surrogate random design

Exact expressions for double descent and implicit regularization via surrogate random design

Policy Optimization for $\mathcal{H}_2$ Linear Control with $\mathcal{H}_\infty$ Robustness Guarantee: Implicit Regularization and Global Convergence

Skipout: An Adaptive Layer-Level Regularization Framework for Deep Neural Networks

Contact Info

Product

Resources

About