Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication

Sattler, Felix; Wiedemann, Simon; Müller, Klaus‐Robert; Samek, Wojciech

doi:10.1109/ijcnn.2019.8852172

Cited by 152 publications

(108 citation statements)

References 23 publications

(47 reference statements)

Supporting

Mentioning

107

Contrasting

Order By: Relevance

“…We adopt the D-DSGD scheme proposed in [25, Section III], which is an extension of the one proposed in [18], for digital transmission. With the D-DSGD scheme, gradient estimate g m (θ t ),…”

Section: Digital Dsgdmentioning

confidence: 99%

Federated Learning Over Wireless Fading Channels

Amiri

Gündüz

2020

IEEE Trans. Wireless Commun.

471

294

View full text Add to dashboard Cite

We study federated machine learning at the wireless network edge, where limited power wireless devices, each with its own dataset, build a joint model with the help of a remote parameter server (PS).We consider a bandwidth-limited fading multiple access channel (MAC) from the wireless devices to the PS, and implement distributed stochastic gradient descent (DSGD) over-the-air. We first propose a digital DSGD (D-DSGD) scheme, in which one device is selected opportunistically for transmission at each iteration based on the channel conditions; the scheduled device quantizes its gradient estimate to a finite number of bits imposed by the channel condition, and transmits these bits to the PS in a reliable manner. Next, motivated by the additive nature of the wireless MAC, we propose a novel analog communication scheme, referred to as the compressed analog DSGD (CA-DSGD), where the devices first sparsify their gradient estimates while accumulating error from previous iterations, and project the resultant sparse vector into a low-dimensional vector. We also design a power allocation scheme to align the received gradient vectors at the PS in an efficient manner. Numerical results show that the proposed CA-DSGD algorithm converges much faster than the D-DSGD scheme and other schemes in the literature, while providing a significantly higher accuracy.where M denotes the number of wireless devices, and g m (θ t )In FL, each device participating in the training can also carry out model updates as in (3) locally, and share the overall difference with respect to the previous model parameters with the PS [1].What distinguishes FL from conventional ML is the large number of devices that participate in the training, and the low-capacity and unreliable links that connect these devices to the PS. Therefore, there have been significant research efforts to reduce the communication requirements in FL [1]- [24]. However, these and follow-up studies consider orthogonal channels from the participating devices to the PS, and ignore the physical layer aspects of wireless connections, even though FL has been mainly motivated for mobile devices.

show abstract

Section: Digital Dsgdmentioning

confidence: 99%

Federated Learning Over Wireless Fading Channels

Amiri

Gündüz

2020

IEEE Trans. Wireless Commun.

471

294

View full text Add to dashboard Cite

show abstract

“…They can send more information bits at the beginning of the DSGD algorithm when the gradient estimates have higher variances, and reduce the number of transmitted bits over time as the variance decreases. We observed empirically that this improves the performance compared to the standard approach in the literature, where the same compression scheme is applied at each iteration [28].…”

Section: Digital Dsgd (D-dsgd)mentioning

confidence: 76%

“…The optimal solution for this scheme will require carefully allocating channel resources across the workers and the available power of each worker across iterations, together with an efficient gradient quantization scheme. For gradient compression, we will consider state-of-the-art quantization approaches together with local error accumulation [28].…”

Section: B Our Contributionsmentioning

confidence: 99%

“…We will adopt the scheme proposed in [28] for gradient compression at each iteration of the DSGD scheme, as it provides the state-of-the-art in convergence speed with the minimum number of bits transmitted by each worker at each iteration. However, we modify this scheme by allowing different numbers of bits to be transmitted by the workers at each iteration.…”

Section: Digital Dsgd (D-dsgd)mentioning

confidence: 99%

See 1 more Smart Citation

Machine Learning at the Wireless Edge: Distributed Stochastic Gradient Descent Over-the-Air

Amiri

Gündüz

2019

2019 IEEE International Symposium on Information Theory (ISIT)

186

347

View full text Add to dashboard Cite

We study collaborative machine learning (ML) at the wireless edge, where power and bandwidth-limited wireless devices with local datasets carry out distributed stochastic gradient descent (DSGD) with the help of a remote parameter server (PS). Standard approaches assume separate computation and communication, where local gradient estimates are compressed and communicated to the PS over orthogonal links. Following this digital approach, we introduce D-DSGD, in which the wireless terminals, referred to as the workers, employ gradient quantization and error accumulation, and transmit their gradient estimates to the PS over the underlying wireless multiple access channel (MAC).We then introduce an analog scheme, called A-DSGD, which exploits the additive nature of the wireless MAC for over-the-air gradient computation. In A-DSGD, the workers first sparsify their gradient estimates, and then project them to a lower dimensional space imposed by the available channel bandwidth. These projections are transmitted directly over the MAC without employing any digital code. Numerical results show that A-DSGD converges much faster than D-DSGD thanks to its more efficient use of the limited bandwidth and the natural alignment of the gradient estimates over the channel. The improvement is particularly compelling at low power and low bandwidth regimes. We also observe that the performance of A-DSGD improves with the number of workers (keeping the total size of the dataset constant), while D-DSGD deteriorates, limiting the ability of the latter in harnessing the computation power of edge devices. The lack of quantization and channel encoding/decoding in A-DSGD further speeds up communication, making it very attractive for low-latency ML applications at the wireless network edge. 1 |Bm,t| u n ∈Bm,t ∇f (θ t , u n ) is the stochastic gradient of the current model computed at worker m, m ∈ [M ], using the

show abstract

“…However, with the peculiarity that now D(w, q) (approximately) measures the distortion of w and q in the space of output distributions instead the Euclidian space. The advantage of the rate-distortion objective (9) is that, after the FIM has been calculated, it can be solved by applying common techniques from the source coding literature, such as the scalar Lloyd algorithm.…”

Section: T E X I T S H a 1 _ B A S E 6 4 = " K G Y 4 + Y H Q J B E H mentioning

confidence: 99%

DeepCABAC: A Universal Compression Algorithm for Deep Neural Networks

Wiedemann

Kirchhoffer

Matlage

et al. 2020

IEEE J. Sel. Top. Signal Process.

Self Cite

View full text Add to dashboard Cite

The field of video compression has developed some of the most sophisticated and efficient compression algorithms known in the literature, enabling very high compressibility for little loss of information. Whilst some of these techniques are domain specific, many of their underlying principles are universal in that they can be adapted and applied for compressing different types of data. In this work we present DeepCABAC, a compression algorithm for deep neural networks that is based on one of the state-of-the-art video coding techniques. Concretely, it applies a Context-based Adaptive Binary Arithmetic Coder (CABAC) to the network's parameters, which was originally designed for the H.264/AVC video coding standard and became the state-of-the-art for lossless compression. Moreover, DeepCABAC employs a novel quantization scheme that minimizes the rate-distortion function while simultaneously taking the impact of quantization onto the accuracy of the network into account. Experimental results show that DeepCABAC consistently attains higher compression rates than previously proposed coding techniques for neural network compression. For instance, it is able to compress the VGG16 ImageNet model by x63.6 with no loss of accuracy, thus being able to represent the entire network with merely 8.7MB. The source code for encoding and decoding can be found at https://github.com/fraunhoferhhi/DeepCABAC.

show abstract

Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication

Cited by 152 publications

References 23 publications

Federated Learning Over Wireless Fading Channels

Federated Learning Over Wireless Fading Channels

Machine Learning at the Wireless Edge: Distributed Stochastic Gradient Descent Over-the-Air

DeepCABAC: A Universal Compression Algorithm for Deep Neural Networks

Contact Info

Product

Resources

About