Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

Lu, Yucheng; Li, Conglong; Zhang, Minjia; De, Christopher; He, Yuxiong

doi:10.48550/arxiv.2202.06009

Cited by 1 publication

(1 citation statement)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, the zero redundancy optimizer (ZeRO) facilitates memory efficiency by partitioning model states and gradients. In addition, 1-bit Adam [292], 0/1 Adam [293], and 1-bit LAMB [294] optimizers reduce the communication resource demand in DeepSpeed. Sparse attention kernels support long sequence input and sparse structures with faster execution and comparable performance.…”

Section: Software Framework For Large-scale Distributed Trainingmentioning

confidence: 99%

Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review

et al. 2023

View full text Add to dashboard Cite

Successful integration of deep neural networks (DNNs) or deep learning (DL) has resulted in breakthroughs in many areas. However, deploying these highly accurate models for data-driven, learned, automatic, and practical machine learning (ML) solutions to end-user applications remains challenging. DL algorithms are often computationally expensive, power-hungry, and require large memory to process complex and iterative operations of millions of parameters. Hence, training and inference of DL models are typically performed on high-performance computing (HPC) clusters in the cloud. Data transmission to the cloud results in high latency, round-trip delay, security and privacy concerns, and the inability of real-time decisions. Thus, processing on edge devices can significantly reduce cloud transmission cost. Edge devices are end devices closest to the user, such as mobile phones, cyber-physical systems (CPSs), wearables, the Internet of Things (IoT), embedded and autonomous systems, and

show abstract

Section: Software Framework For Large-scale Distributed Trainingmentioning

confidence: 99%