Distributed Deep Learning with Event-Triggered Communication

George, Jemin; Gurram, Prudhvi

doi:10.48550/arxiv.1909.05020

Cited by 5 publications

(7 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The event-triggered threshold plays an important role in this algorithm since it decides the number of communication cost among clients and the convergence performance. While the work from [28] measures the difference between the current local model and the new one as a threshold to broadcast their new models, other works [29,30] use the relevant gradient metric from the clients to decide the period when this SGD computation can be sent to other neighbors.…”

Section: Distributed Deep Learningmentioning

confidence: 99%

Distributed Learning and its Application for Time-Series Prediction

Nguyen¹,

Légitime²

2021

Preprint

View full text Add to dashboard Cite

Extreme events are occurrences whose magnitude and potential cause extensive damage on people, infrastructure, and the environment. Motivated by the extreme nature of the current global health landscape, which is plagued by the coronavirus pandemic, we seek to better understand and model extreme events. Modeling extreme events is common in practice and plays an important role in time-series prediction applications. Our goal is to (i) compare and investigate the effect of some common extreme events modeling methods to explore which method can be practical in reality and (ii) accelerate the deep learning training process, which commonly uses deep recurrent neural network (RNN), by implementing the asynchronous local Stochastic Gradient Descent (SGD) framework among multiple compute nodes. In order to verify our distributed extreme events modeling, we evaluate our proposed framework on a stock data set S&P500, with a standard recurrent neural network. Our intuition is to explore the (best) extreme events modeling method which could work well under the distributed deep learning setting. Moreover, by using asynchronous distributed learning, we aim to significantly reduce the communication cost among the compute nodes and central server, which is the main bottleneck of almost all distributed learning frameworks.We implement our proposed work and evaluate its performance on representative data sets, such as S&P500 stock in 5year period. The experimental results validate the correctness of the design principle and show a significant training duration reduction upto 8x, compared to the baseline single compute node. Our results also show that our proposed work can achieve the same level of test accuracy, compared to the baseline setting.

show abstract

Section: Distributed Deep Learningmentioning

confidence: 99%

Distributed Learning and its Application for Time-Series Prediction

Nguyen¹,

Légitime²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Notable works include using average consensus [7] and Bayesian methods [8], [9], which trade convergence time with resilience to individual failures. The approach of George et al [10] offers convergence speed comparable to a server-based approach at the cost of assuming the communication topology of the clients to be fixed and predetermined.…”

Section: Related Workmentioning

confidence: 99%

Flow-FL: Data-Driven Federated Learning for Spatio-Temporal Predictions in Multi-Robot Systems

Majcherczyk¹,

Srishankar²,

Pinciroli³

2020

Preprint

View full text Add to dashboard Cite

In this paper, we show how the Federated Learning (FL) framework enables learning collectively from distributed data in connected robot teams. This framework typically works with clients collecting data locally, updating neural network weights of their model, and sending updates to a server for aggregation into a global model. We explore the design space of FL by comparing two variants of this concept. The first variant follows the traditional FL approach in which a server aggregates the local models. In the second variant, that we call Flow-FL, the aggregation process is serverless thanks to the use of a gossip-based shared data structure. In both variants, we use a data-driven mechanism to synchronize the learning process in which robots contribute model updates when they collect sufficient data. We validate our approach with an agent trajectory forecasting problem in a multi-agent setting. Using a centralized implementation as a baseline, we study the effects of staggered online data collection, and variations in dataflow, number of participating robots, and time delays introduced by the decentralization of the framework in a multi-robot setting.

show abstract

“…The paper closest to ours seems to be [53] where the authors considered a federated learning scenario and proposed an event-triggered communication scheme for the model parameters based on thresholds that are dependent on the learning rate and showed reduction in communication for distributed training. As compared to that work, we consider an adaptive threshold rather than selecting the same threshold across all parameters.…”

Section: Related Workmentioning

confidence: 99%

“…Hence the adaptive threshold makes our algorithm robust to different neural network models and different datasets. Our theoretical results are based on a generic bound on the threshold unlike [53] which provides a bound considering a certain form of threshold dependent on the learning rate. Further, we highlight the implementation challenges of event-triggered communication in an HPC environment which is different than the federated learning setting considered in [53] that usually involves wireless communication.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Event-Triggered Communication in Parallel Computing

Ghosh¹,

Saha²,

Gupta³

et al. 2018

2018 IEEE/ACM 9th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (scalA)

View full text Add to dashboard Cite

Communication in parallel systems imposes significant overhead which often turns out to be a bottleneck in parallel machine learning. To relieve some of this overhead, in this paper, we present EventGraD -an algorithm with event-triggered communication for stochastic gradient descent in parallel machine learning. The main idea of this algorithm is to modify the requirement of communication at every iteration in standard implementations of stochastic gradient descent in parallel machine learning to communicating only when necessary at certain iterations. We provide theoretical analysis of convergence of our proposed algorithm. We also implement the proposed algorithm for data-parallel training of a popular residual neural network used for training the CIFAR-10 dataset and show that EventGraD can reduce the communication load by up to 60% while retaining the same level of accuracy.

show abstract

Distributed Deep Learning with Event-Triggered Communication

Cited by 5 publications

References 21 publications

Distributed Learning and its Application for Time-Series Prediction

Distributed Learning and its Application for Time-Series Prediction

Flow-FL: Data-Driven Federated Learning for Spatio-Temporal Predictions in Multi-Robot Systems

Event-Triggered Communication in Parallel Computing

Contact Info

Product

Resources

About