On Biased Compression for Distributed Learning

Beznosikov, Aleksandr; Horváth, Samuel; Richtárik, Peter; Safaryan, Mher

doi:10.48550/arxiv.2002.12410

Cited by 49 publications

(103 citation statements)

References 12 publications

Supporting

Mentioning

101

Contrasting

Order By: Relevance

“…However, such a simple strategy does not converge to the accurate solution due to the compression error, and even may leads to divergence as the compression error would accumulate. Examples have been provided in [11], [12] to illustrate this. Therefore, communication compression in decentralized algorithms has gained considerable attention recently.…”

Section: A Related Work and Motivationmentioning

confidence: 99%

“…[33]- [37], and also includes biased and non-contractive compressors, such as the norm-sign compressor. Moreover, it is straightforward to check that the class of compressors satisfying Assumption 1 also covers the three classes of biased compressors considered in [12]. In other words, Assumption 1 is weaker than various commonly used assumptions for compressors in the literature.…”

Section: A Compressorsmentioning

confidence: 99%

See 1 more Smart Citation

Communication Compression for Distributed Nonconvex Optimization

Yi¹,

Zhang²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper considers decentralized nonconvex optimization with the cost functions being distributed over agents. Noting that information compression is a key tool to reduce the heavy communication load for decentralized algorithms as agents iteratively communicate with neighbors, we propose three decentralized primal-dual algorithms with compressed communication. The first two algorithms are applicable to a general class of compressors with bounded relative compression error and the third algorithm is suitable for two general classes of compressors with bounded absolute compression error.We show that the proposed decentralized algorithms with compressed communication have comparable convergence properties as state-of-the-art algorithms without communication compression. Specifically, we show that they can find first-order stationary points with sublinear convergence rate O(1/T ) when each local cost function is smooth, where T is the total number of iterations, and find global optima with linear convergence rate under an additional condition that the global cost function satisfies the Polyak-Łojasiewicz condition. Numerical simulations are provided to illustrate the effectiveness of the theoretical results.

show abstract

Section: A Related Work and Motivationmentioning

confidence: 99%

Section: A Compressorsmentioning

confidence: 99%

Communication Compression for Distributed Nonconvex Optimization

Yi¹,

Zhang²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To remove the synchronization barrier, asynchronous update methods are proposed [52,64,68,76,120,123]. There are also approaches that combines multiple strategies listed above [15,17,44,56,80]. On the other hand, researches about model parallelism attempt to study how to allocate model parameters and training computation across compute units in a cluster to maximize training throughput and minimize communication overheads.…”

Section: Distributed Deep Learningmentioning

confidence: 99%

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Lian¹,

Yuan²,

Zhu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep learning based models have dominated the current landscape of production recommender systems. Furthermore, recent years have witnessed an exponential growth of the model scale-from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. This difficulty is inherited from the staggering heterogeneity of the training computation-the model's embedding layer could include more than 99.99% of the total model size, which is extremely memory-intensive; while the rest neural network is increasingly computation-intensive. To support the training of such huge models, an efficient distributed training system is in urgent need. In this paper, we resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Specifically, in order to ensure both the training efficiency and the training accuracy, we design a novel hybrid training algorithm, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then we build a system called Persia (short for parallel recommendation training system with hybrid acceleration) to support this hybrid training algorithm. Both theoretical demonstrations and empirical studies up to 100 trillion parameters have been conducted to justified the system design and implementation of Persia. We make Persia publicly available (at https://github.com/PersiaML/Persia) so that anyone would be able to easily train a recommender model at the scale of 100 trillion parameters.

show abstract

“…The common choice of C(•) can be top-k (Basu et al, 2019) or sign operation (leads to signSGD (Bernstein et al, 2018)). Although this naive compression method is intuitive, it can diverge in practice, even in simple quadratic problems (Beznosikov et al, 2020) or constraint linear problems (Karimireddy et al, 2019). Intuitively speaking, one of the major drawbacks of naive compression is that the compression error is accumulating during the training process.…”

Section: Existing Solutions and Drawbacksmentioning

confidence: 99%

“…It directly compresses the local fresh gradient on each worker before uploading the gradient to the server. However, the compression would slow down the convergence or even diverge (Beznosikov et al, 2020) due to the loss of information at each compression step. Later on, error feedback strategy (Stich et al, 2018;Karimireddy et al, 2019) was proposed to alleviate this problem and reduce the information loss by proposing a compensating error sequence.…”

Section: Introductionmentioning

confidence: 99%

Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization

Wang¹,

Li²,

Chen³

2021

Preprint

View full text Add to dashboard Cite

Due to the explosion in the size of the training datasets, distributed learning has received growing interest in recent years. One of the major bottlenecks is the large communication cost between the central server and the local workers. While error feedback compression has been proven to be successful in reducing communication costs with stochastic gradient descent (SGD), there are much fewer attempts in building communication-efficient adaptive gradient methods with provable guarantees, which are widely used in training large-scale machine learning models. In this paper, we propose a new communication-compressed AMSGrad for distributed nonconvex optimization problem, which is provably efficient. Our proposed distributed learning framework features an effective gradient compression strategy and a worker-side model update design. We prove that the proposed communication-efficient distributed adaptive gradient method converges to the first-order stationary point with the same iteration complexity as uncompressed vanilla AMSGrad in the stochastic nonconvex optimization setting. Experiments on various benchmarks back up our theory.

show abstract

On Biased Compression for Distributed Learning

Cited by 49 publications

References 12 publications

Communication Compression for Distributed Nonconvex Optimization

Communication Compression for Distributed Nonconvex Optimization

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization

Contact Info

Product

Resources

About