Fundamental Limits of Approximate Gradient Coding

Wang, Sinong; Liu, Jiashang; Shroff, Ness B.

doi:10.1145/3366700

Cited by 37 publications

(23 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [22], a distributed training method in the presence of stragglers is presented using the approximate gradient coding of [21], and some reduction in the total training time is reported. In [23], a fundamental trade-off among the computation load, the accuracy of the result, and the number of stragglers is characterized. Furthermore, the authors introduced two schemes to achieve the trade-off.…”

Section: Arxiv:210301589v1 [Csit] 2 Mar 2021mentioning

confidence: 99%

Optimal Communication-Computation Trade-Off in Heterogeneous Gradient Coding

Jahani-Nezhad

Maddah-Ali

2021

IEEE J. Sel. Areas Inf. Theory

View full text Add to dashboard Cite

Gradient coding allows a master node to derive the aggregate of the partial gradients, calculated by some worker nodes over the local data sets, with minimum communication cost, and in the presence of stragglers. In this paper, for gradient coding with linear encoding, we characterize the optimum communication cost for heterogeneous distributed systems with arbitrary data placement, with s ∈ N stragglers and a ∈ N adversarial nodes. In particular, we show that the optimum communication cost, normalized by the size of the gradient vectors, is equal to (r − s − 2a) −1 , where r ∈ N is the minimum number that a data partition is replicated. In other words, the communication cost is determined by the data partition with the minimum replication, irrespective of the structure of the placement. The proposed achievable scheme also allows us to target the computation of a polynomial function of the aggregated gradient matrix. It also allows us to borrow some ideas from approximation computing and propose an approximate gradient coding scheme for the cases when the repetition in data placement is smaller than what is needed to meet the restriction imposed on communication cost or when the number of stragglers appears to be more than the presumed value in the system design.

show abstract

Section: Arxiv:210301589v1 [Csit] 2 Mar 2021mentioning

confidence: 99%

Optimal Communication-Computation Trade-Off in Heterogeneous Gradient Coding

Jahani-Nezhad

Maddah-Ali

2021

IEEE J. Sel. Areas Inf. Theory

View full text Add to dashboard Cite

show abstract

“…A second difference between our work and [16] is that [16] does not provide a complete convergence analysis. The works [8], [19], [20] also study schemes similar in flavor to SGC, but these works also do not provide complete convergence analyses.…”

Section: B Relationship To Previous Work On Approximate Gradient Codingmentioning

confidence: 99%

“…We note that BGC is an approximation of the pairwisebalanced schemes we consider and can be seen as a case of SGC when all the data a i have the same norm. In [19] the authors present fundamental bounds on the error as function of the redundancy. In [18], the authors analyze the convergence rate of the fractional repetition scheme presented in [16] and show that under standard assumptions on the loss function, the algorithm maintains the convergence rate of centralized stochastic gradient descent.…”

Section: Related Workmentioning

confidence: 99%

“…Thus, this approach maintains the good communication cost (C) of the coded approaches by requiring each worker to send one linear combination of the gradients to the master, and improves on (B) and (D), but sacrifices (A), the convergence rate.A line of work known as approximate gradient coding [8], [16]-[20] introduces redundancy in order to speed up the convergence rate of such an approximate scheme. This line of work studies the data redundancy d (that is, the number of times each row a i of the data matrix A is replicated) needed to tolerate s stragglers and allow the master to compute an approximation of the gradient if more than s workers are stragglers [8], [16], [18], [19]. In [17] a variant of this idea is studied; in that work the data is encoded using LDPC code rather than being duplicated.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Stochastic Gradient Coding for Flexible Straggler Mitigation in Distributed Learning

Bitar

Wootters

Rouayheb

2019

2019 IEEE Information Theory Workshop (ITW)

View full text Add to dashboard Cite

We consider distributed gradient descent in the presence of stragglers. Recent work on gradient coding and approximate gradient coding have shown how to add redundancy in distributed gradient descent to guarantee convergence even if some workers are stragglers-that is, slow or non-responsive. In this work we propose an approximate gradient coding scheme called Stochastic Gradient Coding (SGC), which works when the stragglers are random. SGC distributes data points redundantly to workers according to a pair-wise balanced design, and then simply ignores the stragglers. We prove that the convergence rate of SGC mirrors that of batched Stochastic Gradient Descent (SGD) for the 2 loss function, and show how the convergence rate can improve with the redundancy. We also provide bounds for more general convex loss functions. We show empirically that SGC requires a small amount of redundancy to handle a large number of stragglers and that it can outperform existing approximate gradient codes when the number of stragglers is large. M.We focus on the setting where some of the workers may be stragglers, i.e., slow or unresponsive. This setting has been studied before in the systems community [1]- [4], and recently in the coding theory community [5]- [7]. A typical approach is to introduce some redundancy: for example, the same piece of data a i might be held by several workers. There are several things that one might care about in such a scheme and in this paper we focus on the following four desiderata: (A) Convergence speed. We would like the error β t − β * 2 to shrink as quickly as possible. (B) Redundancy. We would like to minimize the amount of storage and computation overhead needed between the workers. (C) Communication. We would like to minimize the amount of communication between the master and the workers. (D) Flexibility. In practice, there is a great deal of variability in the number of stragglers over time. We would like an algorithm that degrades gracefully if more stragglers than expected occur. Much existing work has focused on simulating gradient descent exactly, even in the presence of worstcase stragglers, for example [5]- [8]. In that model, at each round an arbitrary set of s workers (for a fixed s) may not respond to the master. The goal is for the master to obtain the same update β t at round t that gradient descent would obtain. For this to happen, the master should be able to obtain an exact value of the gradient ∇L(A, β t ). This has given rise to (exact) gradient coding [5], which focuses on optimizing desiderata (A) and (C) above. However, these schemes (and necessarily, any scheme in this model) do not do so well on (B) and (D). First, it is not hard to see that in the presence of s worst-case stragglers, it is necessary for any n − s workers to be able to recover all of the data, which necessitates a certain amount of overhead. Namely, every data vector should be replicated on s + 1 different workers. Second, the gradient coding schemes for example in [5], [6] are brittle in the sense that they...

show abstract

“…The authors of [1] proposed gradient coding, a scheme for exact recovery of the gradient when the objective loss function is additively separable. The exact recovery of the gradient is considered in several prior works, e.g., [1][2][3][4][5][6], while gradient coding for approximate recovery of the gradient is studied in [6][7][8][9][10][11][12][13]. Gradient coding requires the central server to receive the subtasks of a fixed fraction of any of the workers.…”

Section: Introductionmentioning

confidence: 99%

Weighted Gradient Coding with Leverage Score Sampling

Charalambides

Pilancı

Hero

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

A major hurdle in machine learning is scalability to massive datasets. Approaches to overcome this hurdle include compression of the data matrix and distributing the computations. Leverage score sampling provides a compressed approximation of a data matrix using an importance weighted subset. Gradient coding has been recently proposed in distributed optimization to compute the gradient using multiple unreliable worker nodes. By designing coding matrices, gradient coded computations can be made resilient to stragglers, which are nodes in a distributed network that degrade system performance. We present a novel weighted leverage score approach, that achieves improved performance for distributed gradient coding by utilizing an importance sampling.

show abstract

Fundamental Limits of Approximate Gradient Coding

Cited by 37 publications

References 17 publications

Optimal Communication-Computation Trade-Off in Heterogeneous Gradient Coding

Optimal Communication-Computation Trade-Off in Heterogeneous Gradient Coding

Stochastic Gradient Coding for Flexible Straggler Mitigation in Distributed Learning

Weighted Gradient Coding with Leverage Score Sampling

Contact Info

Product

Resources

About