2021
DOI: 10.48550/arxiv.2104.05158
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4

Relationship

3
1

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 0 publications
0
7
0
Order By: Relevance
“…The large volume of communication from fetching remote high-dimensional embedding features as well as the frequent parameter exchange from high-order cross features cause severe communication overhead in distributed WDL workloads. We take CAN [8] for instance, which is recently derived from DIN [4] and DLRM [23]. CAN contains a combination of feature interaction modules over a substantial number of feature fields, and therefore it brings up an extensive communication overhead by around 60% in MP mode and 70% in PS mode as shown in Fig.…”
Section: Characterization Of Wdl Workloadmentioning
confidence: 99%
See 2 more Smart Citations
“…The large volume of communication from fetching remote high-dimensional embedding features as well as the frequent parameter exchange from high-order cross features cause severe communication overhead in distributed WDL workloads. We take CAN [8] for instance, which is recently derived from DIN [4] and DLRM [23]. CAN contains a combination of feature interaction modules over a substantial number of feature fields, and therefore it brings up an extensive communication overhead by around 60% in MP mode and 70% in PS mode as shown in Fig.…”
Section: Characterization Of Wdl Workloadmentioning
confidence: 99%
“…Testing Models and Datasets. DLRM [23] is a benchmarking model proposed by Facebook and adopted by MLPerf; DeepFM [3], derived from Wide&Deep model, is widely applied in industrial recommender systems; DIN [4] and DIEN [5] are two models training multi-field categorical data with complicated feature interaction modules. We also utilize the three representative models discussed in §II for a systemdesign evaluation.…”
Section: A Experimental Setupmentioning
confidence: 99%
See 1 more Smart Citation
“…All models are trained with hundreds of sparse (categorical) features and thousands of dense (numerical) features. The full-sync training scheme ensures both model performance and training throughput can be reproduced [26]. We use Normalized Entropy loss to evaluate the CTR prediction accuracy [14].…”
Section: Experiments Setupmentioning
confidence: 99%
“…First, they enable important components and services across a wide breadth of domains, seeing widespread adoption at Facebook [8,[19][20][21]34], Google [12,15,23], Microsoft [18], Baidu [50], and many other hyperscale companies [41,51]. Secondly, training these models, which often consist of trillions of parameters [32,37], places enormous demands on the end-to-end training and data ingestion pipeline. Training a production recommendation system takes weeks, requiring numerous training jobs each using hundreds of distributed GPUs.…”
Section: Introductionmentioning
confidence: 99%