Songze Li scite author profile

Abstract-How can we optimally trade extra computing power to reduce the communication load in distributed computing? We answer this question by characterizing a fundamental tradeoff between computation and communication in distributed computing, i.e., the two are inversely proportional to each other.More specifically, a general distributed computing framework, motivated by commonly used structures like MapReduce, is considered, where the overall computation is decomposed into computing a set of "Map" and "Reduce" functions distributedly across multiple computing nodes. A coded scheme, named "Coded Distributed Computing" (CDC), is proposed to demonstrate that increasing the computation load of the Map functions by a factor of r (i.e., evaluating each function at r carefully chosen nodes) can create novel coding opportunities that reduce the communication load by the same factor.An information-theoretic lower bound on the communication load is also provided, which matches the communication load achieved by the CDC scheme. As a result, the optimal computation-communication tradeoff in distributed computing is exactly characterized.Finally, the coding techniques of CDC is applied to the Hadoop TeraSort benchmark to develop a novel CodedTeraSort algorithm, which is empirically demonstrated to speed up the overall job execution by 1.97× -3.39×, for typical settings of interest.

show abstract

A Unified Coding Framework for Distributed Computing with Straggling Servers

Maddah-Ali

Avestimehr

2016

177

213

View full text Add to dashboard Cite

Abstract-We propose a unified coded framework for distributed computing with straggling servers, by introducing a tradeoff between "latency of computation" and "load of communication" for some linear computation tasks. We show that the coded scheme of [1]-[3] that repeats the intermediate computations to create coded multicasting opportunities to reduce communication load, and the coded scheme of [4], [5] that generates redundant intermediate computations to combat against straggling servers can be viewed as special instances of the proposed framework, by considering two extremes of this tradeoff: minimizing either the load of communication or the latency of computation individually. Furthermore, the latencyload tradeoff achieved by the proposed coded framework allows to systematically operate at any point on that tradeoff to perform distributed computing tasks. We also prove an informationtheoretic lower bound on the latency-load tradeoff, which is shown to be within a constant multiplicative gap from the achieved tradeoff at the two end points.

show abstract

Coded MapReduce

2015

View full text Add to dashboard Cite

MapReduce is a commonly used framework for executing data-intensive tasks on distributed server clusters. We present "Coded MapReduce", a new framework that enables and exploits a particular form of coding to significantly reduce the inter-server communication load of MapReduce. In particular, Coded MapReduce exploits the repetitive mapping of data blocks at different servers to create coded multicasting opportunities in the shuffling phase, cutting down the total communication load by a multiplicative factor that grows linearly with the number of servers in the cluster. We also analyze the tradeoff between the "computation load" and the "communication load" of the Coded MapReduce.

show abstract

Coded Merkle Tree: Solving Data Availability Attacks in Blockchains

Sahraei

Li³

et al. 2020

120

View full text Add to dashboard Cite

A Fundamental Tradeoff Between Computation and Communication in Distributed Computing

Maddah-Ali

et al. 2018

IEEE Trans. Inform. Theory

356

View full text Add to dashboard Cite

Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy

Yu¹,

Li²,

Raviv³

et al. 2018

Preprint

View full text Add to dashboard Cite

We consider a scenario involving computations over a massive dataset stored distributedly across multiple workers, which is at the core of distributed learning algorithms. We propose Lagrange Coded Computing (LCC), a new framework to simultaneously provide (1) resiliency against stragglers that may prolong computations; (2) security against Byzantine (or malicious) workers that deliberately modify the computation for their benefit; and (3) (information-theoretic) privacy of the dataset amidst possible collusion of workers. LCC, which leverages the well-known Lagrange polynomial to create computation redundancy in a novel coded form across workers, can be applied to any computation scenario in which the function of interest is an arbitrary multivariate polynomial of the input dataset, hence covering many computations of interest in machine learning. LCC significantly generalizes prior works to go beyond linear computations. It also enables secure and private computing in distributed settings, improving the computation and communication efficiency of the state-of-the-art. Furthermore, we prove the optimality of LCC by showing that it achieves the optimal tradeoff between resiliency, security, and privacy, i.e., in terms of tolerating the maximum number of stragglers and adversaries, and providing data privacy against the maximum number of colluding workers. Finally, we show via experiments on Amazon EC2 that LCC speeds up the conventional uncoded implementation of distributed least-squares linear regression by up to 13.43×, and also achieves a 2.36×-12.65× speedup over the state-of-the-art straggler mitigation strategies.

show abstract

A Scalable Framework for Wireless Distributed Computing

Maddah-Ali

et al. 2017

IEEE/ACM Trans. Networking

122

View full text Add to dashboard Cite

Near-Optimal Straggler Mitigation for Distributed Gradient Methods

Kalan

Avestimehr

et al. 2018

View full text Add to dashboard Cite

Modern learning algorithms use gradient descent updates to train inferential models that best explain data. Scaling these approaches to massive data sizes requires proper distributed gradient descent schemes where distributed worker nodes compute partial gradients based on their partial and local data sets, and send the results to a master node where all the computations are aggregated into a full gradient and the learning model is updated. However, a major performance bottleneck that arises is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. We propose a distributed computing scheme, called Batched Coupon's Collector (BCC) to alleviate the effect of stragglers in gradient methods. We prove that our BCC scheme is robust to a near optimal number of random stragglers. We also empirically demonstrate that our proposed BCC scheme reduces the run-time by up to 85.4% over Amazon EC2 clusters when compared with other straggler mitigation strategies. We also generalize the proposed BCC scheme to minimize the completion time when implementing gradient descent-based algorithms over heterogeneous worker nodes.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Songze Li

Fundamental tradeoff between computation and communication in distributed computing

A Unified Coding Framework for Distributed Computing with Straggling Servers

Coded MapReduce

Coded Merkle Tree: Solving Data Availability Attacks in Blockchains

A Fundamental Tradeoff Between Computation and Communication in Distributed Computing

Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy

A Scalable Framework for Wireless Distributed Computing

Near-Optimal Straggler Mitigation for Distributed Gradient Methods

Contact Info

Product

Resources

About