This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating P * Nσ, for Nσ N (0, σ 2 I d ), byPn * Nσ, wherePn is the empirical measure, under different statistical distances. The convergence is examined in terms of the Wasserstein distance, total variation (TV), Kullback-Leibler (KL) divergence, and χ 2 -divergence. We show that the approximation error under the TV distance and 1-Wasserstein distance (W1) converges at the rate e O(d) n − 1 2 in remarkable contrast to a (typical) n − 1 d rate for unsmoothed W1(and d ≥ 3). Similarly, for the KL divergence, squared 2-Wasserstein distance (W 2 2 ), and χ 2 -divergence convergence rate is e O(d) n −1 , but only provided that P achieves finite input-output χ 2 mutual information across the additive white Gaussian noise (AWGN) channel. If the latter condition is not met, the rate changes to ω n −1 for the KL divergence and W 2 2 , while the χ 2 -divergence becomes infinite -a curious dichotomy. As a main application we consider estimating the differential entropy h(S + Z), where S ∼ P and Z ∼ Nσ are independent d-dimensional random variables. The distribution P is unknown and belongs to some nonparametric class, but n independently and identically distributed (i.i.d) samples from it are available. Despite the regularizing effect of noise, we first show that any good estimator (within an additive gap) for this problem must have a sample complexity that is exponential in d. We then leverage the empirical approximation results to show that the absolute-error risk of the plug-in estimator converges as e O(d) n − 1 2 , thus attaining the parametric rate. This establishes the plug-in estimator as minimax rate-optimal for the considered problem, with sharp dependence of the convergence rate both on n and on d. We provide numerical results comparing the performance of the plug-in estimator to that of general-purpose (unstructured) differential entropy estimators (based on kernel density estimation (KDE) or k nearest neighbors (kNN) techniques) applied to samples of S +Z. These results reveal a significant empirical superiority of the plug-in to stateof-the-art KDE and kNN methods. As a motivating utilization of the plug-in approach, we estimate information flows in deep neural networks and discuss Tishby's Information Bottleneck and the compression conjecture, among others.1 n n i=1 δ Si is the empirical measure. 1 Due to the popularity of the additive Gaussian noise model, we start by exploring this smoothed empirical approximation problem in detail, under several additional statistical distances.
A. Convergence of Empirical Measures Smoothed by a Gaussian KernelConsider the empirical approximation error Eδ(P S n * N σ , P * N σ ) under some statistical distance δ. Various choices of δ are considered, such as the 1-Wasserstein and (squared) 2-Wasserstein distances, total variation (TV), Kullback-Leibler (KL) divergence, and χ 2 -divergence. We show that, when P is subgaussian, the approximation error under the 1-Was...