Straggler-Proofing Massive-Scale Distributed Matrix Multiplication with D-Dimensional Product Codes

Baharav, Tavor Z.; Lee, Kangwook; Ocal, Orhan; Ramchandran, Kannan

doi:10.1109/isit.2018.8437549

Cited by 72 publications

(56 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The scheme in [13] requires an additional decoding phase, and assume the existence of a powerful master that can store the entire product C in memory and decode for the missing blocks using the redundant chunks. This is also true for the other schemes in [14]- [16]. Moreover, these schemes would fail when the number of stragglers is more than the provisioned redundancy while OverSketch has a 'graceful degradation' as one can get away by ignoring more workers than provisioned at the cost of accuracy of the result.…”

Section: Comparison With Existing Straggler Mitigation Schemesmentioning

confidence: 98%

“…A simpler version of this has been known in the HPC community as Algorithm-Based-Fault-Tolerance (ABFT) [18]. Authors in [14] generalize the results in [13] to a d-dimensional product code with only one parity in each dimension. In [15], the authors develop polynomial codes for matrix multiplication, which is an improvement over [13] in terms of recovery threshold, that is, the minimum number of workers required to recover the product C.…”

Section: B Related Workmentioning

confidence: 99%

See 1 more Smart Citation

OverSketch: Approximate Matrix Multiplication for the Cloud

Gupta

Wang

Courtade

et al. 2018

2018 IEEE International Conference on Big Data (Big Data)

Self Cite

View full text Add to dashboard Cite

We propose OverSketch, an approximate algorithm for distributed matrix multiplication in serverless computing. OverSketch leverages ideas from matrix sketching and high-performance computing to enable cost-efficient multiplication that is resilient to faults and straggling nodes pervasive in low-cost serverless architectures. We establish statistical guarantees on the accuracy of OverSketch and empirically validate our results by solving a large-scale linear program using interior-point methods and demonstrate a 34% reduction in compute time on AWS Lambda.

show abstract

Section: Comparison With Existing Straggler Mitigation Schemesmentioning

confidence: 98%

Section: B Related Workmentioning

confidence: 99%

OverSketch: Approximate Matrix Multiplication for the Cloud

Gupta

Wang

Courtade

et al. 2018

2018 IEEE International Conference on Big Data (Big Data)

Self Cite

View full text Add to dashboard Cite

show abstract

“…MDS codes however have the disadvantage of having high encoding and decoding complexity, which could be restricting in setups with large number of workers. [2] attacks at this problem presenting a coded computation scheme based on d-dimensional product codes. [14] presents a scheme referred to as polynomial codes for coded matrix multiplication with input matrices from a large finite field.…”

Section: B Related Workmentioning

confidence: 99%

Straggler Resilient Serverless Computing Based on Polar Codes

Bartan

Pilancı

2019

2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton)

View full text Add to dashboard Cite

We propose a serverless computing mechanism for distributed computation based on polar codes. Serverless computing is an emerging cloud based computation model that lets users run their functions on the cloud without provisioning or managing servers. Our proposed approach is a hybrid computing framework that carries out computationally expensive tasks such as linear algebraic operations involving large-scale data using serverless computing and does the rest of the processing locally. We address the limitations and reliability issues of serverless platforms such as straggling workers using coding theory, drawing ideas from recent literature on coded computation. The proposed mechanism uses polar codes to ensure straggler-resilience in a computationally effective manner. We provide extensive evidence showing polar codes outperform other coding methods. We have designed a sequential decoder specifically for polar codes in erasure channels with full-precision input and outputs. In addition, we have extended the proposed method to the matrix multiplication case where both matrices being multiplied are coded. The proposed coded computation scheme is implemented for AWS Lambda. Experiment results are presented where the performance of the proposed coded computation technique is tested in optimization via gradient descent. Finally, we introduce the idea of partial polarization which reduces the computational burden of encoding and decoding at the expense of straggler-resilience.

show abstract

“…Coding has been applied to distributed fog computing and machine learning for dealing with the problem of stragglers [4] and reducing the usage of computation and communication resources [5]. For coded distributed machine learning [2], matrix multiplication [6], [7] and gradient descent [8], [9] have attracted considerable attention.…”

Section: Introductionmentioning

confidence: 99%

“…However, in a large-scale network with thousands of nodes, these MDS-based codes become impractical [1], [7] because of high computation and communication costs associated with encoding and decoding [13]. In addition, most of the previous works consider the master-worker pattern.…”

Section: Introductionmentioning

confidence: 99%

Coded Decentralized Learning With Gradient Descent for Big Data Analytics

Yue¹,

Xiao²

2020

IEEE Commun. Lett.

View full text Add to dashboard Cite

Machine learning is an effective technique for big data analytics. We focus on the study of big data analytics with decentralized learning in large-scale networks. Fountain codes are applied to the decentralized learning process to reduce communication load for exchanging intermediate learning parameters among fog nodes. Two scenarios, i.e., disjoint datasets and overlapping datasets, are analyzed. Comparison results show that communication load can be reduced significantly by the Fountain-based scheme for large-scale networks, especially when the quality of communication links is relatively bad and/or the number of fog nodes is large.

show abstract

Straggler-Proofing Massive-Scale Distributed Matrix Multiplication with D-Dimensional Product Codes

Cited by 72 publications

References 9 publications

OverSketch: Approximate Matrix Multiplication for the Cloud

OverSketch: Approximate Matrix Multiplication for the Cloud

Straggler Resilient Serverless Computing Based on Polar Codes

Coded Decentralized Learning With Gradient Descent for Big Data Analytics

Contact Info

Product

Resources

About