Improving Distributed Gradient Descent Using Reed-Solomon Codes

Halbawi, Wael; Azizan, Navid; Salehi, Fariborz; Hassibi, Babak

doi:10.1109/isit.2018.8437467

Cited by 150 publications

(157 citation statements)

References 18 publications

Supporting

Mentioning

151

Contrasting

Unclassified

Order By: Relevance

“…Each column of the encoding matrix B corresponds to a partition D i and is associated with a polynomial that evaluates to zero at the respective workers who have not been assigned that partition part. For more details the reader is referred to [2]. The matrixT = T · diag(w) is equal to T with its columns each scaled by the respective entry of w, thus T (1) = w. A direct consequence of this is that a T IB I = e T 1T =T (1) = w, which completes the proof.…”

Section: Weighted Gradient Codingmentioning

confidence: 73%

“…Gradient coding requires the central server to receive the subtasks of a fixed fraction of any of the workers. We obtain an extension based on balanced Reed-Solomon codes [2,14], introducing weighted gradient coding, where the central server recovers a weighted sum of the partial gradients of the loss function. .…”

Section: Introductionmentioning

confidence: 99%

“…Let B and a I be an encoding matrix and decoding vector from[2], satisfying a T I B I = 1 1×k for any I. LetB := B · diag(w) for any w ∈ C 1×k . Then a T IB I = w. Proof.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Weighted Gradient Coding with Leverage Score Sampling

Charalambides

Pilancı

Hero

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

A major hurdle in machine learning is scalability to massive datasets. Approaches to overcome this hurdle include compression of the data matrix and distributing the computations. Leverage score sampling provides a compressed approximation of a data matrix using an importance weighted subset. Gradient coding has been recently proposed in distributed optimization to compute the gradient using multiple unreliable worker nodes. By designing coding matrices, gradient coded computations can be made resilient to stragglers, which are nodes in a distributed network that degrade system performance. We present a novel weighted leverage score approach, that achieves improved performance for distributed gradient coding by utilizing an importance sampling.

show abstract

Section: Weighted Gradient Codingmentioning

confidence: 73%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Weighted Gradient Coding with Leverage Score Sampling

Charalambides

Pilancı

Hero

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…The PS then updates θ θ θ i+1 , as well as θ θ θ i 1 = θ θ θ i and θ θ θ i m = θ θ θ i−1 m for m = 2, 3. The next iteration i + 1 then continues with a check of condition (13) by the PS in the same way.…”

Section: B Lazily Aggregated Gradient (Lag)mentioning

confidence: 99%

LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning

Zhang

Simeone

2021

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

Gradient-based distributed learning in Parameter Server (PS) computing architectures is subject to random delays due to straggling worker nodes, as well as to possible communication bottlenecks between PS and workers. Solutions have been recently proposed to separately address these impairments based on the ideas of gradient coding, worker grouping, and adaptive worker selection. This paper provides a unified analysis of these techniques in terms of wall-clock time, communication, and computation complexity measures. Furthermore, in order to combine the benefits of gradient coding and grouping in terms of robustness to stragglers with the communication and computation load gains of adaptive selection, novel strategies, named Lazily Aggregated Gradient Coding (LAGC) and Grouped-LAG (G-LAG), are introduced. Analysis and results show that G-LAG provides the best wall-clock time and communication performance, while maintaining a low computational cost, for two representative distributions of the computing times of the worker nodes. times at the workers can cause significant slowdowns in wall-clock run-time per iteration due to straggling workers [6]. Second, the communication overhead resulting from intensive two-way communications between the PS and the workers may require significant networking resources to be available in order not to dominate the overall run-time [7].Recently, solutions have been developed that aim at improving robustness to stragglers -namely Gradient Coding (GC) and grouping [8], [9] -or communication load -namely adaptive selection [10] (see Table I for a summary). GC, introduced The authors are with theTABLE I: Qualitative comparisons with respect to standard (distributed) Gradient Descent (GD) Coding Grouping Adaptive selection Robustness to stragglers better better same Communication load same same better Computation load worse worse better[8], increases robustness to stragglers by leveraging storage and computation redundancy at the worker nodes as compared to standard (distributed) Gradient Descent (GD) [11]. With a redundancy factor r > 1, each worker stores, and computes on, r times more data than with GD. Under GC, given a redundancy factor r > 1, up to r − 1 stragglers can be tolerated, while still allowing the PS to exactly compute the gradient at any iteration. GC requires coding the computed gradients prior to communication from the workers to the PS, as well as decoding at the PS.As a special case of GC, given a redundancy factor r equal to the number M of workers, each worker can store the entire dataset. Hence, the gradient can be obtained from any worker without requiring any coding or decoding operation. In the typical case in which r is smaller than M , the same simple procedure can be applied to groups of workers. In particular, given a redundancy factor r, the dataset can be partitioned so that each partition is available to all nodes of a group of r workers.The PS can then recover the gradient upon receiving the computations of any server for each group. The outlined group...

show abstract

“…It was presented in part in ISIT'18 [1] [2], CWIT'19 [3], and ICML'19 [4]. computing gradients, thus accelerating the training of largescale machine learning applications [12], [13], [14], [15], [16]. While matrix multiplication and gradient descent are two types of coded computing problems that have been studied, others include coded convolution [17], coded approximate computing [18], sparse coded matrix multiplication [19], and heterogeneous coded computing [20].…”

Section: Introductionmentioning

confidence: 99%

Hierarchical coded matrix multiplication

Kiani

Ferdinand

Draper

2019

2019 16th Canadian Workshop on Information Theory (CWIT)

View full text Add to dashboard Cite

In distributed computing systems slow working nodes, known as stragglers, can greatly extend finishing times. Coded computing is a technique that enables straggler-resistant computation. Most coded computing techniques presented to date provide robustness by ensuring that the time to finish depends only on a set of the fastest nodes. However, while stragglers do compute less work than non-stragglers, in real-world commercial cloud computing systems (e.g., Amazon's Elastic Compute Cloud (EC2)) the distinction is often a soft one. In this paper, we develop hierarchical coded computing that exploits the work completed by all nodes, both fast and slow, automatically integrating the potential contribution of each. We first present a conceptual framework to represent the division of work amongst nodes in coded matrix multiplication as a cuboid partitioning problem. This framework allows us to unify existing methods and motivates new techniques. We then develop three methods of hierarchical coded computing that we term bit-interleaved coded computation (BICC), multilevel coded computation (MLCC), and hybrid hierarchical coded computation (HHCC). In this paradigm, each worker is tasked with completing a sequence (a hierarchy) of ordered subtasks. The sequence of subtasks, and the complexity of each, is designed so that partial work completed by stragglers can be used in, rather than ignored. We note that our methods can be used in conjunction with any coded computing method. We illustrate this showing how we can use our methods to accelerate all previously developed coded computing technique by enabling them to exploit stragglers. Under a widely studied statistical model of completion time, our approach realizes a 66% improvement in expected finishing time. On Amazon EC2, the gain was 28% when stragglers are simulated.

show abstract

Improving Distributed Gradient Descent Using Reed-Solomon Codes

Cited by 150 publications

References 18 publications

Weighted Gradient Coding with Leverage Score Sampling

Weighted Gradient Coding with Leverage Score Sampling

LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning

Hierarchical coded matrix multiplication

Contact Info

Product

Resources

About