Efficient Decentralized Deep Learning by Dynamic Model Averaging

Kamp, Michael; Adilova, Linara; Sicking, Joachim; Hüger, Fabian; Schlicht, Peter; Wirtz, Tim; Wrobel, Stefan

doi:10.1007/978-3-030-10925-7_24

Cited by 83 publications

(68 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The superior training speed-up performance of model averaging has been empirically observed in various deep learning scenarios, e.g., CNN for MNIST in (Zhang et al 2016) (Kamp et al 2018)(McMahan et al 2017; VGG for CIFAR10 in (Zhou and Cong 2017); DNN-GMM for speech recognition in (Chen and Huo 2016) (Su, Chen, and Xu 2018); and LSTM for language modeling in (McMahan et al 2017). A thorough empirical study of ResNet over CI-FAR and ImageNet is also available in the recent work (Lin, Stich, and Jaggi 2018).…”

Section: Methodsmentioning

confidence: 97%

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Yang

Zhu

2019

AAAI

385

274

View full text Add to dashboard Cite

In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradient in parallel, aggregates all gradients in a single server to obtain the average, and update each worker's local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.1 Equivalently, we can let the server update its solution using the averaged gradient and broadcast this solution to all local workers. Another equivalent implementation is to let each worker take a single SGD step using its own gradient and send the updated local solution to the server; let the server calculate the average of all workers' updated solutions and refresh each worker's local solution with the averaged version. arXiv:1807.06629v3 [math.OC]

show abstract

Section: Methodsmentioning

confidence: 97%

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Yang

Zhu

2019

AAAI

385

274

View full text Add to dashboard Cite

show abstract

“…Later, the design was enhanced by coding the updates such that the full update-averaging can be still realized using only a portion of coded updates [9]. The second approach also aims at reducing the number of transmitting devices, but the scheduling criterion is update significance instead of computation speed [10], [11]. If FEEL is implemented based on model averaging, the update significance is measured by the model variance which indicates the divergence of a particular local model from the average across all local models [10].…”

Section: A Federated Edge Learning and Multi-accessmentioning

confidence: 99%

Broadband Analog Aggregation for Low-Latency Federated Edge Learning

Zhu

Wang

Huang

2020

IEEE Trans. Wireless Commun.

632

489

View full text Add to dashboard Cite

To leverage rich data distributed at the network edge, a new machine-learning paradigm, called edge learning, has emerged where learning algorithms are deployed at the edge for providing intelligent services to mobile users.While computing speeds are advancing rapidly, the communication latency is becoming the bottleneck of fast edge learning. To address this issue, this work is focused on designing a low-latency multi-access scheme for edge learning. To this end, we consider a popular privacy-preserving framework, federated edge learning (FEEL), where a global AI-model at an edge-server is updated by aggregating (averaging) local models trained at edge devices. It is proposed that the updates simultaneously transmitted by devices over broadband channels should be analog aggregated "over-the-air" by exploiting the waveform-superposition property of a multi-access channel.Such broadband analog aggregation (BAA) results in dramatical communication-latency reduction compared with the conventional orthogonal access (i.e., OFDMA). In this work, the effects of BAA on learning performance are quantified targeting a single-cell random network. First, we derive two tradeoffs between communication-andlearning metrics, which are useful for network planning and optimization. The power control ("truncated channel inversion") required for BAA results in a tradeoff between the update-reliability [as measured by the receive signalto-noise ratio (SNR)] and the expected update-truncation ratio. Consider the scheduling of cell-interior devices to constrain path loss. This gives rises to the other tradeoff between the receive SNR and fraction of data exploited in learning. Next, the latency-reduction ratio of the proposed BAA with respect to the traditional OFDMA scheme is proved to scale almost linearly with the device population. Experiments based on a neural network and a real dataset are conducted for corroborating the theoretical results. In addition, we discuss the extensions of BAA to acquire safety against adversary attacks and integrate beamforming for enhancing cell-edge links.

show abstract

“…Kamp et al [52] proposed to average models dynamically depending on the utility of the communication, which leads to a reduction of communication by an order of magnitude compared to periodically communicating state-of-the-art approaches. This facet is well suited for massively distributed systems with limited communication infrastructure.…”

Section: Updates Reductionmentioning

confidence: 99%