We analyze approximation rates of deep ReLU neural networks for Sobolev-regular functions with respect to weaker Sobolev norms. First, we construct, based on a calculus of ReLU networks, artificial neural networks with ReLU activation functions that achieve certain approximation rates. Second, we establish lower bounds for the approximation by ReLU neural networks for classes of Sobolev-regular functions. Our results extend recent advances in the approximation theory of ReLU networks to the regime that is most relevant for applications in the numerical analysis of partial differential equations.
,which encourages the network to encode information about the derivatives of f in its weights. The authors of [16] call this method Sobolev training and reported reduced generalization errors and better data-efficiency in a network compression task (see [31]) and in application to synthetic gradients (see [34]). In case of network compression, the approximated function f is a function realized by a possibly very large neural network N large (·|w), that has been trained for some supervised learning task and is learnt by a smaller network N small . In contrast to usual supervised learning settings, the approximated function f (·) = N large (·|w) is known and the derivatives can be computed.• Motivated by the performance of deep learning-based solutions in classical machine learning tasks and, in particular, by their ability to overcome the curse of dimension, neural networks are now also applied for the approximative solution of partial differential equations (PDEs) (see [26,36,54,59]).In [54] the authors present their deep Galerkin method for approximating solutions of high-dimensional quasilinear parabolic PDEs. For this, a functional J(f ) encoding the differential operator, boundary conditions, and initial conditions is introduced. A neural network N PDE with weights w is then trained to minimize the functional J(N PDE (w)). This is done by a discretization and randomly sampling spatial points.The theoretical foundation for approximating a function and higher-order derivatives with a neural network was already given in a less known version of the universal approximation theorem by Hornik in [32, Theorem 3]. In particular, it was shown that if the activation function ̺ is k-times continuously differentiable, non-constant, and bounded, then any k-times continuously differentiable function f and its derivatives up to order k can be uniformly approximated by a shallow neural network on compact sets. Note though that the conditions on the activation function are very restrictive and that, for example, the ReLU is not included in the above result. However, in [16], it was shown that the theorem also holds for shallow ReLU networks if k = 1. Theorem 3 in [32] was also used in [54] to show the existence of a shallow network approximating solutions of the PDEs considered in this paper. An important aspect, that is untouched by the previous approximation results is how the complexity of a network and, in particular, its depth...