2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 2018
DOI: 10.1109/ipdpsw.2018.00137
|View full text |Cite
|
Sign up to set email alerts
|

Near-Optimal Straggler Mitigation for Distributed Gradient Methods

Abstract: Modern learning algorithms use gradient descent updates to train inferential models that best explain data. Scaling these approaches to massive data sizes requires proper distributed gradient descent schemes where distributed worker nodes compute partial gradients based on their partial and local data sets, and send the results to a master node where all the computations are aggregated into a full gradient and the learning model is updated. However, a major performance bottleneck that arises is that some of th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
80
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 88 publications
(81 citation statements)
references
References 22 publications
0
80
0
Order By: Relevance
“…3) The heterogeneity of computation, storage and communication capabilities across different devices brings unique system challenges to tame latency for on-device distributed training, e.g., the stragglers (i.e., devices that run slow) may cause significant delays [8], [17]. 4) The arbitrarily adversarial behaviors of the devices (e.g,.…”
Section: Introductionmentioning
confidence: 99%
“…3) The heterogeneity of computation, storage and communication capabilities across different devices brings unique system challenges to tame latency for on-device distributed training, e.g., the stragglers (i.e., devices that run slow) may cause significant delays [8], [17]. 4) The arbitrarily adversarial behaviors of the devices (e.g,.…”
Section: Introductionmentioning
confidence: 99%
“…Since err F (E) does not depend on the specific set of stragglers, but only the size of it, we get (5) from (17)…”
Section: Appendix a Matrix Inversion Lemmamentioning
confidence: 99%
“…Uncoded distributed computation with MMC (UC-MMC) is introduced in [5,13,14], and is shown to outperform coded computation in terms of average completion time, concluding that coded computation is more effective against persistent stragglers, and particularly when full gradient is required at each iteration. Coded GD strategies are mainly designed for full gradient computation; and hence, the master needs to wait until all the gradients can be recovered.…”
Section: Introductionmentioning
confidence: 99%