Enda Yu scite author profile

Enda Yu

5Publications

3Citation Statements Received

0Citation Statements Given

How they've been cited

How they cite others

Affiliations

National University of Defense Technology

Publications

Order By: Most citations

CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay Compensation

Dong

et al. 2021

View full text Add to dashboard Cite

Distributed stochastic gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning. However, the staggers and limited bandwidth may induce random computational/communication delays, thereby severely hindering the learning process. Therefore, how to accelerate asynchronous SGD by efficiently scheduling multiple workers is an important issue. In this paper, a unified framework is presented to analyze and optimize the convergence of asynchronous SGD based on stochastic delay differential equations (SDDEs) and the Poisson approximation of aggregated gradient arrivals. In particular, we present the run time and staleness of distributed SGD without a memorylessness assumption on the computation times. Given the learning rate, we reveal the relevant SDDE's damping coefficient and its delay statistics, as functions of the number of activated clients, staleness threshold, the eigenvalues of the Hessian matrix of the objective function, and the overall computational/communication delay. The formulated SDDE allows us to present both the distributed SGD's convergence condition and speed by calculating its characteristic roots, thereby optimizing the scheduling policies for asynchronous/event-triggered SGD. It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness. Moreover, a small degree of staleness does not necessarily slow down the convergence, while a large degree of staleness will result in the divergence of distributed SGD. Numerical results demonstrate the potential of our SDDE framework, even in complex learning tasks with non-convex objective functions.

show abstract

In-network aggregation for data center networks: A survey

Feng

Dong

Lei

et al. 2023

Computer Communications

View full text Add to dashboard Cite

vSketchDLC: A Sketch on Distributed Deep Learning Communication via Fine-grained Tracing Visualization

Wang

Ouyang

Dong

et al. 2022

View full text Add to dashboard Cite

DNNEmu: A Lightweight Performance Emulator for Distributed DNN Training

Wang¹,

Yu²,

Dong³

et al. 2023

View full text Add to dashboard Cite

CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay Compensation

Yu¹,

Dong²,

Xu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combining with parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exists two problems of gradient compression technique to be solved. Firstly, gradient compression brings in extra computation cost, which will delay the next training iteration. Secondly, gradient compression usually leads to the decrease of convergence accuracy. In this paper, we combine parallel mechanism with gradient quantization and delayed full-gradient compensation, and propose a new distributed optimization method named CD-SGD, which can hide the overhead of gradient compression, overlap part of the communication and obtain high convergence accuracy. The local update operation in CD-SGD allows the next iteration to be launched quickly without waiting for the completion of gradient compression and current communication process. Besides, the accuracy loss caused by gradient compression is solved by k-step correction method introduced in CD-SGD. We prove that CD-SGD has convergence guarantee and it achieves at least 𝑂 ( 1 √ 𝐾 + 1 𝐾 ) convergence rate. We conduct extensive experiments on MXNet to verify the convergence properties and scaling performance of CD-SGD. Experimental results on a 16-GPU cluster show that convergence accuracy of CD-SGD is close to or even slightly better than that of S-SGD, and its end-to-end time is 30% less than 2 bit gradient compression under 56Gbps bandwidth environment. CCS CONCEPTS• Computing methodologies → Machine learning; Distributed algorithms; • Networks → Network algorithms.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Enda Yu

CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay Compensation

In-network aggregation for data center networks: A survey

vSketchDLC: A Sketch on Distributed Deep Learning Communication via Fine-grained Tracing Visualization

DNNEmu: A Lightweight Performance Emulator for Distributed DNN Training

CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay Compensation

Contact Info

Product

Resources

About