Ammar Ahmad Awan scite author profile

Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective methods is error-compensated compression, which offers robust convergence speed even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like SGD and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offer state-of-the-art convergence efficiency and accuracy for models like BERT. In this paper, we propose 1-bit Adam that reduces the communication volume by up to 5×, offers much better scalability, and provides the same convergence speed as uncompressed Adam. Our key finding is that Adam's variance (non-linear term) becomes stable (after a warmup phase) and can be used as a fixed precondition for the rest of the training (compression phase). Experiments on up to 256 GPUs show that 1-bit Adam enables up to 3.3× higher throughput for BERT-Large pre-training and up to 2.9× higher throughput for SQuAD fine-tuning. In addition, we provide theoretical analysis for our proposed work.

show abstract

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

Awan

Subramoni

Panda

2017

View full text Add to dashboard Cite

Privacy-aware searching with oblivious term matching for cloud storage

Pervez¹,

Awan²,

Khattak³

et al. 2012

J Supercomput

View full text Add to dashboard Cite

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Awan

Hamidouche

Venkatesh

et al. 2016

View full text Add to dashboard Cite

GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training

Jain

Awan

Aljuhani

et al. 2020

View full text Add to dashboard Cite

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Rajbhandari¹,

Li²,

Yao³

et al. 2022

Preprint

View full text Add to dashboard Cite

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.

show abstract

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

Chu

Hamidouche

Venkatesh

et al. 2016

View full text Add to dashboard Cite

12 3 4 5 6

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ammar Ahmad Awan

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

Privacy-aware searching with oblivious term matching for cloud storage

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

Contact Info

Product

Resources

About