2022
DOI: 10.1162/neco_a_01457
|View full text |Cite
|
Sign up to set email alerts
|

Nonconvex Sparse Regularization for Deep Neural Networks and Its Optimality

Abstract: Recent theoretical studies proved that deep neural network (DNN) estimators obtained by minimizing empirical risk with a certain sparsity constraint can attain optimal convergence rates for regression and classification problems. However, the sparsity constraint requires knowing certain properties of the true model, which are not available in practice. Moreover, computation is difficult due to the discrete nature of the sparsity constraint. In this letter, we propose a novel penalized estimation method for spa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
20
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(20 citation statements)
references
References 15 publications
0
20
0
Order By: Relevance
“…where g(x) = x(1 − x). Applying ( 4) and (7), it is then shown in [10], Lemma A.2, that there is a network from F(m + 4, p) with p ∞ = 6, that for a given input (x, y) approximates the product xy with error 2 −m . As it follows from (4), ( 5) and ( 6), the set of parameters in the construction of that network that does not belong to {0, ± 1 2 , ±1} consists of • shift coordinates ±2 −k , k = 2, ..., 2m + 1;…”
Section: Proofsmentioning
confidence: 99%
See 2 more Smart Citations
“…where g(x) = x(1 − x). Applying ( 4) and (7), it is then shown in [10], Lemma A.2, that there is a network from F(m + 4, p) with p ∞ = 6, that for a given input (x, y) approximates the product xy with error 2 −m . As it follows from (4), ( 5) and ( 6), the set of parameters in the construction of that network that does not belong to {0, ± 1 2 , ±1} consists of • shift coordinates ±2 −k , k = 2, ..., 2m + 1;…”
Section: Proofsmentioning
confidence: 99%
“…Using this entropy bound, it is then shown in [10], that if the regression function is a composition of Hölder smooth functions, then sparse neural networks with depth log 2 n, width n t 2β+t and the number of non-zero parameters ∼ n t 2β+t log 2 n, where β > 0 and t ≥ 1 depend on the structure and the smoothness of the regression function, attain the minimax optimal prediction error rate n −2β 2β+t (up to a logarithmic factor). Entropy bounds for the spaces of neural networks with certain l 1 -related regularizations are provided in [7] and [11] and their derivation is also based on the sparsity induced by the imposed constraints. In particular, in [7] the above l 0 regularization is replaced by the clipped l 1 norm regularization with sufficiently small clipping threshold.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In nonparametric regression estimation we aim to recover an unknown d-variate function g 0 based on n observed input-output pairs (X i , Y i ) ∈ R d × R, i = 1, ..., n. Various regression estimating function classes, including wavelets, polynomials, splines and kernel estimates have been studied in the literature (see, e.g., [2], [5], [6], [7] and references therein). Along with the development of practical and theoretical applications of neural networks, regression estimations with neural networks are becoming popular in the recent literature (see, e.g., [1], [8], [9], [10], [13], [15], [18], [19], [21] and references therein). Usually a class of neural networks with properly chosen architecture and with weight vectors belonging to some regularized set W n is determined and the estimator ĝn of g 0 is selected to be either the regularized empirical risk minimizer…”
Section: Introductionmentioning
confidence: 99%
“…(i) deriving prediction rates of the empirical risk minimizers (1) or ( 2); (ii) finding an optimization algorithm that identifies the corresponding empirical risk minimizers. Convergence rates of empirical risk minimizers (ERM) over the classes of deep ReLU networks are studied in [4], [13], [15] and [18]. In [4] it is shown that the ERM of the form (1), with W n being the set of weight vectors with coordinates {0, ±1/2, ±1, 2}, attains, up to logarithmic factors, the minimax rates of prediction of β-smooth functions.…”
Section: Introductionmentioning
confidence: 99%