Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

Mudigere, Dheevatsa; Hao, Yajiang; Huang, Jianyu; Jia, Zhihao; Tulloch, A. J.; Sridharan, Srinivas; Liu, Xing; Özdal, Mustafa; Nie, Jade; Park, Jongsoo; Luo, Lei; Yang, Jie; Gao, Leon; Ivchenko, Dmytro; Basant, Aarti; Hu, Yuanman; Yang, Jiyan; Ardestani, Ehsan K.; Wang, Xiaodong; Komuravelli, Rakesh; Chu, C. Y. Cyrus; Yılmaz, Serhat; Li, Huayu; Qian, Jiyuan; Feng, Zhuobo; Ma, Yinbin; Yang, Junjie; Wen, Ellie; Li, Hong; Yang, Lin; Sun, Cuicui; Zhao, Whitney; Melts, Dimitry; Dhulipala, Krishna; Kishore, K R; Graf, Tyler N.; Eisenman, Assaf; Matam, Kiran Kumar; Gangidi, Adi; Chen, Guoqiang Jerry; Krishnan, Manoj N.; Nayak, Avinash P.; Nair, Krishnakumar; Muthiah, Bharath; khorashadi, Mahmoud; Bhattacharya, P.; Lapukhov, Petr; Naumov, Maxim; Mathews, Ajit; Lin, Qiao; Smelyanskiy, Mikhail; Jia, Bill; Rao, Vijay M.

doi:10.48550/arxiv.2104.05158

Cited by 4 publications

(7 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The large volume of communication from fetching remote high-dimensional embedding features as well as the frequent parameter exchange from high-order cross features cause severe communication overhead in distributed WDL workloads. We take CAN [8] for instance, which is recently derived from DIN [4] and DLRM [23]. CAN contains a combination of feature interaction modules over a substantial number of feature fields, and therefore it brings up an extensive communication overhead by around 60% in MP mode and 70% in PS mode as shown in Fig.…”

Section: Characterization Of Wdl Workloadmentioning

confidence: 99%

“…Testing Models and Datasets. DLRM [23] is a benchmarking model proposed by Facebook and adopted by MLPerf; DeepFM [3], derived from Wide&Deep model, is widely applied in industrial recommender systems; DIN [4] and DIEN [5] are two models training multi-field categorical data with complicated feature interaction modules. We also utilize the three representative models discussed in §II for a systemdesign evaluation.…”

Section: A Experimental Setupmentioning

confidence: 99%

“…HugeCTR/Merlin is a customized framework running on NVIDIA's DGX-1/DGX-2 supernodes equipped with high-end interconnects named NV-Switch. Zion [12] and RecSpeed [54] customize their node specification for DLRM [23] and its variants by adding more NICs and RoCEs to alleviate the I/O bottleneck. Nevertheless, hardware customization is still expensive and a waste of resources when facing rapid shifts in WDL designs.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems

Zhang¹,

Chen²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

The development of personalized recommendation has significantly improved the accuracy of information matching and the revenue of e-commerce platforms. Recently, it has two trends: 1) recommender systems must be trained timely to cope with ever-growing new products and ever-changing user interests from online marketing and social network; 2) state-of-the-art recommendation models introduce deep neural network (DNN) modules to improve prediction accuracy. Traditional CPU-based recommender systems cannot meet these two trends, and GPUcentric training has become a trending approach. However, we observe that GPU devices in training recommender systems are underutilized, and they cannot attain an expected throughput improvement as what it has achieved in Computer Vision (CV) and Neural Language Processing (NLP) areas. This issue can be explained by two characteristics of these recommendation models: First, they contain up to a thousand of input feature fields, introducing fragmentary and memory-intensive operations; Second, the multiple constituent feature interaction submodules introduce substantial small-sized compute kernels. To remove this roadblock to the development of recommender systems, we propose a novel framework named PICASSO to accelerate the training of recommendation models on commodity hardware. Specifically, we conduct a systematic analysis to reveal the bottlenecks encountered in training recommendation models. We leverage the model structure and data distribution to unleash the potential of hardware through our packing, interleaving, and caching optimization. Experiments show that PICASSO increases the hardware utilization by an order of magnitude on the basis of state-of-the-art baselines and brings up to 6× throughput improvement for a variety of industrial recommendation models. Using the same hardware budget in production, PICASSO on average shortens the walltime of daily training tasks by 7 hours, significantly reducing the delay of continuous delivery.

show abstract

Section: Characterization Of Wdl Workloadmentioning

confidence: 99%

Section: A Experimental Setupmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems

Zhang¹,

Chen²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…All models are trained with hundreds of sparse (categorical) features and thousands of dense (numerical) features. The full-sync training scheme ensures both model performance and training throughput can be reproduced [26]. We use Normalized Entropy loss to evaluate the CTR prediction accuracy [14].…”

Section: Experiments Setupmentioning

confidence: 99%

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Zhang¹,

Luo²,

Liu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, we observe that the practical performance of those designs can vary from dataset to dataset, even when the order of interactions claimed to be captured is the same. That indicates different designs may have different advantages and the interactions captured by them have non-overlapping information. Motivated by this observation, we propose DHEN -a deep and hierarchical ensemble architecture that can leverage strengths of heterogeneous interaction modules and learn a hierarchy of the interactions under different orders. To overcome the challenge brought by DHEN's deeper and multi-layer structure in training, we propose a novel co-designed training system that can further improve the training efficiency of DHEN. Experiments of DHEN on large-scale dataset from CTR prediction tasks attained 0.27% improvement on the Normalized Entropy (NE) of prediction and 1.2x better training throughput than state-of-the-art baseline, demonstrating their effectiveness in practice.

show abstract

“…First, they enable important components and services across a wide breadth of domains, seeing widespread adoption at Facebook [8,[19][20][21]34], Google [12,15,23], Microsoft [18], Baidu [50], and many other hyperscale companies [41,51]. Secondly, training these models, which often consist of trillions of parameters [32,37], places enormous demands on the end-to-end training and data ingestion pipeline. Training a production recommendation system takes weeks, requiring numerous training jobs each using hundreds of distributed GPUs.…”

Section: Introductionmentioning

confidence: 99%

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

Zhao,

Agarwal,

Basant

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

Cited by 4 publications

References 0 publications

PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems

PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

Contact Info

Product

Resources

About