In this paper, we use tools from rate-distortion theory to establish new upper bounds on the generalization error of statistical distributed learning algorithms. Specifically, there are K clients whose individually chosen models are aggregated by a central server. The bounds depend on the compressibility of each client's algorithm while keeping other clients' algorithms un-compressed, and leverage the fact that small changes in each local model change the aggregated model by a factor of only 1{K. Adopting a recently proposed approach by Sefidgaran et al., and extending it suitably to the distributed setting, this enables smaller rate-distortion terms which are shown to translate into tighter generalization bounds. The bounds are then applied to the distributed support vector machines (SVM), suggesting that the generalization error of the distributed setting decays faster than that of the centralized one with a factor of OplogpKq{ ? Kq. This finding is validated also experimentally. A similar conclusion is obtained for a multiple-round federated learning setup where each client uses stochastic gradient Langevin dynamics (SGLD). IntorductionA key performance indicator of any stochastic learning algorithm that uses a given finite set of data points is how well it performs on points that are outside that set, i.e., unseen data. This is often captured through the so-called generalization error. The questions of what really controls the generalization error of a given stochastic algorithm, and how to make it sufficiently small, are still not yet well understood, however. For example, while classic approaches [1] suggest that algorithms with over-parameterized models are likely to overfit, it is now known that there exist a few such ones which do generalize well [2]. Common approaches to studying the generalization error of a statistical learning algorithm often consider the effective hypothesis space induced by the algorithm, rather than the entire hypothesis space, or the information leakage about the training dataset. Examples include information-theoretic (mutual information) approaches [3, 4, 5, 6, 7, 8], compression-based approaches [9, 10, 11, 12, 13] and intrinsic-dimension or fractal based approaches [14,15,16]. Recently, a novel approach [17] that generalizes the notion of algorithm compressibility by using lossy covering from source coding concepts was used to show that the compression error rate of an algorithm is strongly connected to its generalization error both in expectation and with high probability; and, consequently, establish new rate-distortion-based bounds on the generalization error. The bounds of [17] were shown to possibly improve strictly upon those of [4,18] and [8]. The approach also has the advantage to offer a unifying perspective on mutual-information, compressibility, and fractal-based frameworks.Another major focus of machine learning research over the recent years has been the study of statistical learning algorithms when applied in distributed (network or graph) settings. In part, th...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.