We consider distributed gradient descent in the presence of stragglers. Recent work on gradient coding and approximate gradient coding have shown how to add redundancy in distributed gradient descent to guarantee convergence even if some workers are stragglers-that is, slow or non-responsive. In this work we propose an approximate gradient coding scheme called Stochastic Gradient Coding (SGC), which works when the stragglers are random. SGC distributes data points redundantly to workers according to a pair-wise balanced design, and then simply ignores the stragglers. We prove that the convergence rate of SGC mirrors that of batched Stochastic Gradient Descent (SGD) for the 2 loss function, and show how the convergence rate can improve with the redundancy. We also provide bounds for more general convex loss functions. We show empirically that SGC requires a small amount of redundancy to handle a large number of stragglers and that it can outperform existing approximate gradient codes when the number of stragglers is large.
M.We focus on the setting where some of the workers may be stragglers, i.e., slow or unresponsive. This setting has been studied before in the systems community [1]- [4], and recently in the coding theory community [5]- [7]. A typical approach is to introduce some redundancy: for example, the same piece of data a i might be held by several workers. There are several things that one might care about in such a scheme and in this paper we focus on the following four desiderata: (A) Convergence speed. We would like the error β t − β * 2 to shrink as quickly as possible. (B) Redundancy. We would like to minimize the amount of storage and computation overhead needed between the workers. (C) Communication. We would like to minimize the amount of communication between the master and the workers. (D) Flexibility. In practice, there is a great deal of variability in the number of stragglers over time. We would like an algorithm that degrades gracefully if more stragglers than expected occur. Much existing work has focused on simulating gradient descent exactly, even in the presence of worstcase stragglers, for example [5]- [8]. In that model, at each round an arbitrary set of s workers (for a fixed s) may not respond to the master. The goal is for the master to obtain the same update β t at round t that gradient descent would obtain. For this to happen, the master should be able to obtain an exact value of the gradient ∇L(A, β t ). This has given rise to (exact) gradient coding [5], which focuses on optimizing desiderata (A) and (C) above. However, these schemes (and necessarily, any scheme in this model) do not do so well on (B) and (D). First, it is not hard to see that in the presence of s worst-case stragglers, it is necessary for any n − s workers to be able to recover all of the data, which necessitates a certain amount of overhead. Namely, every data vector should be replicated on s + 1 different workers. Second, the gradient coding schemes for example in [5], [6] are brittle in the sense that they...