Natural Compression for Distributed Deep Learning

Horváth, Samuel; Ho, Chen-Yu; Horvath, Ludovit; Sahu, Atal Narayan; Canini, Marco; Richtárik, Peter

doi:10.48550/arxiv.1905.10988

Cited by 38 publications

(59 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The purpose of the experiment is to understand whether the MASHA1 and MASHA2 methods are superior to those in the literature. As a comparison, we take QGD with natural dithering Horvath et al (2019), classical error feedback with Top 30% compression, as well as an extra step method, each step of which is used with natural rounding. In MASHA1 (Algorithm 1) we also used natural dithering, in MASHA2 (Algorithm 2) -Top30%.…”

Section: Bilinear Saddle Point Problemmentioning

confidence: 99%

Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees

Beznosikov¹,

Richtárik²,

Diskin³

et al. 2021

Preprint

View full text Add to dashboard Cite

Variational inequalities in general and saddle point problems in particular are increasingly relevant in machine learning applications, including adversarial learning, GANs, transport and robust optimization. With increasing data and problem sizes necessary to train high performing models across these and other applications, it is necessary to rely on parallel and distributed computing. However, in distributed training, communication among the compute nodes is a key bottleneck during training, and this problem is exacerbated for high dimensional and over-parameterized models models. Due to these considerations, it is important to equip existing methods with strategies that would allow to reduce the volume of transmitted information during training while obtaining a model of comparable quality. In this paper, we present the first theoretically grounded distributed methods for solving variational inequalities and saddle point problems using compressed communication: MASHA1 and MASHA2. Our theory and methods allow for the use of both unbiased (such as Randk; MASHA1) and contractive (such as Topk; MASHA2) compressors. We empirically validate our conclusions using two experimental setups: a standard bilinear min-max problem, and large-scale distributed adversarial training of transformers.

show abstract

Section: Bilinear Saddle Point Problemmentioning

confidence: 99%

Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees

Beznosikov¹,

Richtárik²,

Diskin³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The operator is allowed to be randomized, and typically operates on models Khaled & Richtárik (2019) or on gradients Alistarh et al (2017); Beznosikov et al (2020), both of which can be described as vectors in R 𝑑 . Besides sparsification (Alistarh et al, 2018), typical examples of useful compression mechanisms include quantization (Alistarh et al, 2017;Horváth et al, 2019a) and low-rank approximation (Vogels et al, 2019;Safaryan et al, 2021).…”

Section: Ef21 With Bells and Whistlesmentioning

confidence: 99%

EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback

Fatkhullin,

Sokolov,

Gorbunov

et al. 2021

Preprint

View full text Add to dashboard Cite

First proposed by Seide et al. (2014) as a heuristic, error feedback (EF) is a very popular mechanism for enforcing convergence of distributed gradient-based optimization methods enhanced with communication compression strategies based on the application of contractive compression operators. However, existing theory of EF relies on very strong assumptions (e.g., bounded gradients), and provides pessimistic convergence rates (e.g., while the best known rate for EF in the smooth nonconvex regime, and when full gradients are compressed, is 𝑂(1/𝑇 2/3 ), the rate of gradient descent in the same regime is 𝑂(1/𝑇 )). Recently, Richtárik et al. ( 2021) (2021) proposed a new error feedback mechanism, EF21, based on the construction of a Markov compressor induced by a contractive compressor. EF21 removes the aforementioned theoretical deficiencies of EF and at the same time works better in practice. In this work we propose six practical extensions of EF21, all supported by strong convergence theory: partial participation, stochastic approximation, variance reduction, proximal setting, momentum and bidirectional compression. Several of these techniques were never analyzed in conjunction with EF before, and in cases where they were (e.g., bidirectional compression), our rates are vastly superior.* The work of Ilyas Fatkhullin was performed during a Summer research internship conducted in the Optimization and Machine Learning Lab led by Peter Richtárik. At the time when this paper was first released, Ilyas Fatkhullin was a master's student at the

show abstract

“…It is well known that small floating point error does not dramatically affect the convergence and final accuracy of ML models [16,20,24,72]. This observation has motivated extensive prior research about training with low or mixed-precision FP operations [20,26,47,51,80,120] and compression or quantization [36,40,45,72].…”

Section: Characteristics Of Training Gradientsmentioning

confidence: 99%

Unlocking the Power of Inline Floating-Point Operations on Programmable Switches

Yang¹,

Alama²,

Sapio³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

1 The advent of switches with programmable dataplanes has enabled the rapid development of new network functionality, as well as providing a platform for acceleration of a broad range of application-level functionality. However, existing switch hardware was not designed with application acceleration in mind, and thus applications requiring operations or datatypes not used in traditional network protocols must resort to expensive workarounds. Applications involving floating point data, including distributed training for machine learning and distributed query processing, are key examples.In this paper, we propose FPISA, a floating point representation designed to work efficiently in programmable switches. We first implement FPISA on an Intel Tofino switch, but find that it has limitations that impact throughput and accuracy. We then propose hardware changes to address these limitations based on the open-source Banzai switch architecture, and synthesize them in a 15-nm standard-cell library to demonstrate their feasibility. Finally, we use FPISA to implement accelerators for training for machine learning and for query processing, and evaluate their performance on a switch implementing our changes using emulation. We find that FPISA allows distributed training to use 25-75% fewer CPU cores and provide up to 85.9% better throughput in a CPU-constrained environment than SwitchML. For distributed query processing with floating point data, FPISA enables up to 2.7× better throughput than Spark.

show abstract

Natural Compression for Distributed Deep Learning

Cited by 38 publications

References 17 publications

Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees

Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees

EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback

Unlocking the Power of Inline Floating-Point Operations on Programmable Switches

Contact Info

Product

Resources

About