AIACC-Training: Optimizing Distributed Deep Learning Training through Multi-streamed and Concurrent Gradient Communications

Li, Lin; Qiu, Shenghao; Yu, Ziqi; Liang, You; Long, Xin; Sun, Xiaoyang; Jie, Xu; Wang, Zheng

doi:10.1109/icdcs54860.2022.00087

Cited by 4 publications

(3 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Inferencing frameworks like TensorRT [40] are not suitable for this scenario because they do not produce activations of intermediate layers. As can be seen from Figure 13, [41], [38], [42]. The former partitions the training data across multiple GPUs [5], [43], and the latter splits the model layers vertically and then distributes different layers onto GPUs to reduce the memory pressure of the model states on a single GPU [6].…”

Section: Further Analysis 1) Training Efficiencymentioning

confidence: 99%

STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training

Sun

Wang

Qiu

et al. 2022

SC22: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

Deep neural networks (DNNs) with billion-scale parameters have demonstrated impressive performance in solving many tasks. Unfortunately, training a billion-scale DNN is out of the reach of many data scientists because it requires highperformance GPU servers that are too expensive to purchase and maintain. We present STRONGHOLD, a novel approach for enabling large DNN model training with no change to the user code. STRONGHOLD scales up the largest trainable model size by dynamically offloading data to the CPU RAM and enabling the use of secondary storage. It automatically determines the minimum amount of data to be kept in the GPU memory to minimize GPU memory usage. Compared to state-of-the-art offloading-based solutions, STRONGHOLD improves the trainable model size by 1.9x∼6.5x on a 32GB V100 GPU, with 1.2x∼3.7x improvement on the training throughput. It has been deployed into production to successfully support large-scale DNN training.

show abstract

Section: Further Analysis 1) Training Efficiencymentioning

confidence: 99%

STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training

Sun

Wang

Qiu

et al. 2022

SC22: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

show abstract

“…The asymmetry multiplier µ = 8. We also apply gradient compression [27] to the per-sample gradient to keep the top 90% values only. Our DPAF is configured to be C2-C1-× for MNIST/FMNIST, C2-C2-C1 for CelebA, and C3-C1-× for FFHQ, where the notation Cx 1 -Cx 2 -Cx 3 means that the D uses x 1 layers as conv1, x 2 layers as conv2*, and x 3 layers as conv3*.…”

Section: Experiments Setupmentioning

confidence: 99%

“…The Impact of Gradient Compression. Gradient compression (GC) [27] is originally proposed to reduce the communication cost in federated learning. The rationale behind gradient compression is that most of the values in the gradient contribute nearly no information on the update.…”

Section: Pre-training the Model With Public Datamentioning

confidence: 99%