HET-GMP: A Graph-based System Approach to Scaling Large Embedding Model Training

Miao, Xupeng; Shi, Yu; Zhang, Hailin; Zhang, Xin; Nie, Xiaonan; Yang, Zhi; Cui, Bin

doi:10.1145/3514221.3517902

Cited by 13 publications

(3 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Angel-PTM offers a comprehensive solution for efficient deep learning model training in industrial settings. It leverages some key techniques [30,33,38] from Hetu [31], gets implemented over PyTorch [40], and features the Page abstraction for memory efficiency and a unified scheduling method for resource utilization. Furthermore, Angel-PTM has undergone extensive optimization on A100 servers, enabling it to take full advantage of hardware capabilities for deep learning tasks.…”

Section: Methodsmentioning

confidence: 99%

Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent

Nie¹,

Liu²,

Fu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recent years have witnessed the unprecedented achievements of large-scale pre-trained models, especially the Transformer models. Many products and services in Tencent Inc., such as WeChat, QQ, and Tencent Advertisement, have been opted in to gain the power of pre-trained models. In this work, we present Angel-PTM, a productive deep learning system designed for pre-training and fine-tuning Transformer models. Angel-PTM can train extremely large-scale models with hierarchical memory efficiently. The key designs of Angel-PTM are the fine-grained memory management via the Page abstraction and a unified scheduling method that coordinate the computations, data movements, and communications. Furthermore, Angel-PTM supports extreme model scaling with SSD storage and implements the lock-free updating mechanism to address the SSD I/O bandwidth bottlenecks. Experimental results demonstrate that Angel-PTM outperforms existing systems by up to 114.8% in terms of maximum model scale as well as up to 88.9% in terms of training throughput. Additionally, experiments on GPT3-175B and T5-MoE-1.2T models utilizing hundreds of GPUs verify the strong scalability of Angel-PTM.

show abstract

Section: Methodsmentioning

confidence: 99%

Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent

Nie¹,

Liu²,

Fu³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…At the same time, a single clicking log may contain only hundreds of non-zero entries. As a result, when we create the embedding for each feature, the whole embedding layer can be extremely large, and the parameters of the CTR prediction model are dominated (e.g., 99.9%) by the embedding part instead of the deep network part (Miao et al 2021;Ginart et al 2021). Table 1 shows the case under our experimental setting.…”

Section: Related Workmentioning

confidence: 99%

“…Datasets. We evaluate our algorithms on the following public datasets which are widely adopted by the community (Cheng et al 2016;Li et al 2019;Deng et al 2021;Wang et al 2021;Miao et al 2021). Criteo (Labs 2014) is a real-world CTR prediction dataset.…”

Section: Experiments Experimental Settingmentioning

confidence: 99%

CowClip: Reducing CTR Prediction Model Training Time from 12 Hours to 10 Minutes on 1 GPU

Zheng

Zou³

et al. 2023

AAAI

View full text Add to dashboard Cite

The click-through rate (CTR) prediction task is to predict whether a user will click on the recommended item. As mind-boggling amounts of data are produced online daily, accelerating CTR prediction model training is critical to ensuring an up-to-date model and reducing the training cost. One approach to increase the training speed is to apply large batch training. However, as shown in computer vision and natural language processing tasks, training with a large batch easily suffers from the loss of accuracy. Our experiments show that previous scaling rules fail in the training of CTR prediction neural networks. To tackle this problem, we first theoretically show that different frequencies of ids make it challenging to scale hyperparameters when scaling the batch size. To stabilize the training process in a large batch size setting, we develop the adaptive Column-wise Clipping (CowClip). It enables an easy and effective scaling rule for the embeddings, which keeps the learning rate unchanged and scales the L2 loss. We conduct extensive experiments with four CTR prediction networks on two real-world datasets and successfully scaled 128 times the original batch size without accuracy loss. In particular, for CTR prediction model DeepFM training on the Criteo dataset, our optimization framework enlarges the batch size from 1K to 128K with over 0.1% AUC improvement and reduces training time from 12 hours to 10 minutes on a single V100 GPU. Our code locates at github.com/bytedance/LargeBatchCTR.

show abstract