On the Optimal Memorization Power of ReLU Neural Networks

Vardi, Gal; Yehudai, Gilad; Shamir, Ohad

doi:10.48550/arxiv.2110.03187

Cited by 3 publications

(5 citation statements)

References 29 publications

(53 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then, Memorization power of neural networks. Our work is related to another line of works (e.g., Baum, 1988;Yun et al, 2019;Bubeck et al, 2020;Zhang et al, 2021;Rajput et al, 2021;Vardi et al, 2021) on the memorization power of neural networks. Among these works, Yun et al (2019) shows that a neural network with O(N ) parameters can memorize the data set with zero error, where N is the size of the data set.…”

Section: Related Workmentioning

confidence: 72%

“…Among these works, Yun et al (2019) shows that a neural network with O(N ) parameters can memorize the data set with zero error, where N is the size of the data set. Under an additional separable assumption, Vardi et al (2021) derives an improved upper bound of O( √ N ), which is shown to be optimal. In this work, we show that O(N d) parameters is sufficient for achieving low robust training error.…”

Section: Related Workmentioning

confidence: 99%

“…We leave it as a future direction to study whether this lower bound can be attained. While optimal (non-robust) memorization of N data points only needs constant width (Vardi et al, 2021), our construction in Theorem 2.2 has width O(N d). Therefore, if our upper bound is tight, then Theorem 2.2 can probably explain why increasing the network width can benefit robust training (Madry et al, 2017).…”

Section: Notationsmentioning

confidence: 99%

See 2 more Smart Citations

Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power

Li¹,

Jin²,

Han³

et al. 2022

Preprint

View full text Add to dashboard Cite

It is well-known that modern neural networks are vulnerable to adversarial examples. To mitigate this problem, a series of robust learning algorithms have been proposed. However, although the robust training error can be near zero via some methods, all existing algorithms lead to a high robust generalization error. In this paper, we provide a theoretical understanding of this puzzling phenomenon from the perspective of expressive power for deep neural networks. Specifically, for binary classification problems with well-separated data, we show that, for ReLU networks, while mild over-parameterization is sufficient for high robust training accuracy, there exists a constant robust generalization gap unless the size of the neural network is exponential in the data dimension d. Even if the data is linear separable, which means achieving low clean generalization error is easy, we can still prove an exp(Ω(d)) lower bound for robust generalization.Moreover, we establish an improved upper bound of exp(O(k)) for the network size to achieve low robust generalization error when the data lies on a manifold with intrinsic dimension k (k d). Nonetheless, we also have a lower bound that grows exponentially with respect to k -the curse of dimensionality is inevitable. By demonstrating an exponential separation between the network * Equal Contribution

show abstract

Section: Related Workmentioning

confidence: 72%

Section: Related Workmentioning

confidence: 99%

Section: Notationsmentioning

confidence: 99%

See 1 more Smart Citation

Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power

Li¹,

Jin²,

Han³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Since natural signals/images data with low intrinsic dimension can be represented approximately by neural networks is empirically verified in [16,27,43]. Next, we verify the assumption (33) by construct a generator G with properly chosen depth and width based on the recent approximation ideas of deep neural networks [45,51,47,21] by utilizing the bit extraction techniques [3,2]. To this end, we recall the definition of Minkowski dimension which is used to measures the intrinsic dimension of the target signals living in a large ambient dimension.…”

Section: Analysis Of the Least Square Decodermentioning

confidence: 89%

Just Least Squares: Binary Compressive Sampling with Low Generative Intrinsic Dimension

Jiao

Liu

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we consider recovering n dimensional signals from m binary measurements corrupted by noises and sign flips under the assumption that the target signals have low generative intrinsic dimension, i.e., the target signals can be approximately generated via an L-Lipschitz generator G :Although the binary measurements model is highly nonlinear, we propose a least square decoder and prove that, up to a constant c, with high probability, the least square decoder achieves a sharp estimation error O( k log(Ln) m ) as long as m ≥ O(k log(Ln)). Extensive numerical simulations and comparisons with state-of-the-art methods demonstrated the least square decoder is robust to noise and sign flips, as indicated by our theory. By constructing a ReLU network with properly chosen depth and width, we verify the (approximately) deep generative prior, which is of independent interest.

show abstract

“…However, deeper networks require much less neurons to reach the same expressive power, yielding a potential theoretical explanation of the dominance of deep networks in practice [7,29,42,44,53,62,65,68,79,80,83]. Other related work includes counting and bounding the number of linear regions [43,59,60,64,65,74], classifying the set of functions exactly representable by different architectures [7,23,46,47,61,86], or analyzing the memorization capacity of ReLU networks [82,84,85].…”

Section: Neural Networkmentioning

confidence: 99%

Training Fully Connected Neural Networks is $\exists\mathbb{R}$-Complete

Bertschinger¹,

Hertrich²,

Jungeblut³

et al. 2022

Preprint

View full text Add to dashboard Cite

We consider the algorithmic problem of finding the optimal weights and biases for a twolayer fully connected neural network to fit a given set of data points. This problem is known as empirical risk minimization in the machine learning community. We show that the problem is ∃R-complete. This complexity class can be defined as the set of algorithmic problems that are polynomial-time equivalent to finding real roots of a polynomial with integer coefficients. Our results hold even if the following restrictions are all added simultaneously.• There are exactly two output neurons.• There are exactly two input neurons.• The data has only 13 different labels.• The number of hidden neurons is a constant fraction of the number of data points.• The target training error is zero.• The ReLU activation function is used.

show abstract

On the Optimal Memorization Power of ReLU Neural Networks

Cited by 3 publications

References 29 publications

Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power

Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power

Just Least Squares: Binary Compressive Sampling with Low Generative Intrinsic Dimension

Training Fully Connected Neural Networks is $\exists\mathbb{R}$-Complete

Contact Info

Product

Resources

About