2019
DOI: 10.48550/arxiv.1905.10988
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Natural Compression for Distributed Deep Learning

Abstract: Due to their hunger for big data, modern deep learning models are trained in parallel, often in distributed environments, where communication of model updates is the bottleneck. Various update compression (e.g., quantization, sparsification, dithering) techniques [2,47,48,22] have been proposed in recent years as a successful tool to alleviate this problem. In this work, we introduce a new, remarkably simple and theoretically and practically effective compression technique, which we call natural compression (C… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
59
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
6
3

Relationship

1
8

Authors

Journals

citations
Cited by 38 publications
(59 citation statements)
references
References 17 publications
0
59
0
Order By: Relevance
“…The purpose of the experiment is to understand whether the MASHA1 and MASHA2 methods are superior to those in the literature. As a comparison, we take QGD with natural dithering Horvath et al (2019), classical error feedback with Top 30% compression, as well as an extra step method, each step of which is used with natural rounding. In MASHA1 (Algorithm 1) we also used natural dithering, in MASHA2 (Algorithm 2) -Top30%.…”
Section: Bilinear Saddle Point Problemmentioning
confidence: 99%
“…The purpose of the experiment is to understand whether the MASHA1 and MASHA2 methods are superior to those in the literature. As a comparison, we take QGD with natural dithering Horvath et al (2019), classical error feedback with Top 30% compression, as well as an extra step method, each step of which is used with natural rounding. In MASHA1 (Algorithm 1) we also used natural dithering, in MASHA2 (Algorithm 2) -Top30%.…”
Section: Bilinear Saddle Point Problemmentioning
confidence: 99%
“…The operator is allowed to be randomized, and typically operates on models Khaled & Richtárik (2019) or on gradients Alistarh et al (2017); Beznosikov et al (2020), both of which can be described as vectors in R 𝑑 . Besides sparsification (Alistarh et al, 2018), typical examples of useful compression mechanisms include quantization (Alistarh et al, 2017;Horváth et al, 2019a) and low-rank approximation (Vogels et al, 2019;Safaryan et al, 2021).…”
Section: Ef21 With Bells and Whistlesmentioning
confidence: 99%
“…It is well known that small floating point error does not dramatically affect the convergence and final accuracy of ML models [16,20,24,72]. This observation has motivated extensive prior research about training with low or mixed-precision FP operations [20,26,47,51,80,120] and compression or quantization [36,40,45,72].…”
Section: Characteristics Of Training Gradientsmentioning
confidence: 99%