Machine Learning (ML) solutions are nowadays distributed, according to the so-called server/worker architecture. One server holds the model parameters while several workers train the model. Clearly, such architecture is prone to various types of component failures, which can be all encompassed within the spectrum of a Byzantine behavior. Several approaches have been proposed recently to tolerate Byzantine workers. Yet all require trusting a central parameter server. We initiate in this paper the study of the "general" Byzantineresilient distributed machine learning problem where no individual component is trusted. In particular, we distribute the parameter server computation on several nodes. We show that this problem can be solved in an asynchronous system, despite the presence of 1 3 Byzantine parameter servers and 1 3 * Equal contribution. Authors are listed alphabetically.
This paper addresses the problem of combining Byzantine resilience with privacy in machine learning (ML). Specifically, we study whether a distributed implementation of the renowned Stochastic Gradient Descent (SGD) learning algorithm is feasible with both differential privacy (DP) and (𝛼, 𝑓 )-Byzantine resilience. To the best of our knowledge, this is the first work to tackle this problem from a theoretical point of view. A key finding of our analyses is that the classical approaches to these two (seemingly) orthogonal issues are incompatible. More precisely, we show that a direct composition of these techniques makes the guarantees of the resulting SGD algorithm depend unfavourably upon the number of parameters in the ML model, making the training of large models practically infeasible. We validate our theoretical results through numerical experiments on publicly-available datasets; showing that it is impractical to ensure DP and Byzantine resilience simultaneously. IntroductionThe massive amounts of data generated daily calls for distributed machine learning (ML). Essentially, different nodes collaborate to train a joint model on a collective dataset. Clearly, an aggregate model would be more accurate than the individual locally-trained models on small subsets of data. However, two reasons prevent the explicit sharing of the personal data. Firstly, in many classification tasks, training data can be sensitive and should remain private, e.g., financial, political, and medical fields. Secondly, datasets can be quite large (e.g., Open Images [21], ImageNet [13]) and their sharing computationally expensive.The most popular scheme to train ML model in a distributed setting is Stochastic Gradient Descent (SGD) [8]: it enables to train the aggregate model by simply exchanging gradients of the loss function (instead of the training data itself). Underlying SGD lies an iterative method to optimize the objective function 𝑄 (𝑤) by stochastically estimating the gradient ∇𝑄 (𝑤) and then computing a gradient descent step on 𝑤. There are several system models for distributed SGD training, such as the parameter server [23] and ring all-reduce models [27]. The parameter server model is one of the most adopted distributed learning topologies (Fig. 1(a)), where nodes send their gradients to a central trusted entity, namely the parameter server, responsible of updating the model parameters by aggregating the received gradients. The parameter server model is also the backbone of the popular setting in distributed learning today, Federated Learning [20]. Averaging the received gradients is typically used by the parameter server as aggregation method [28], assuming the nodes correctly compute unbiased estimates of the gradient. However, releasing gradients in a distributed framework results in the emergence of two orthogonal threats: Byzantine gradients and data leakage. Byzantine GradientsThe learning can be critically influenced by Byzantine gradients (i.e., vectors that are not unbiased estimates of the true gradient) sent by the ...
This paper addresses the problem of combining Byzantine resilience with privacy in machine learning (ML). Specifically, we study if a distributed implementation of the renowned Stochastic Gradient Descent (SGD) learning algorithm is feasible with both differential privacy (DP) and ( , )-Byzantine resilience. To the best of our knowledge, this is the first work to tackle this problem from a theoretical point of view. A key finding of our analyses is that the classical approaches to these two (seemingly) orthogonal issues are incompatible. More precisely, we show that a direct composition of these techniques makes the guarantees of the resulting SGD algorithm depend unfavourably upon the number of parameters of the ML model, making the training of large models practically infeasible. We validate our theoretical results through numerical experiments on publicly-available datasets; showing that it is impractical to ensure DP and Byzantine resilience simultaneously. CCS Concepts• Security and privacy → Privacy-preserving protocols; • Mathematics of computing → Continuous optimization.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.