Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 2022
DOI: 10.1145/3503222.3507777
|View full text |Cite
|
Sign up to set email alerts
|

RecShard: statistical feature-based memory optimization for industry-scale neural recommendation

Abstract: We propose RecShard, a fine-grained embedding table (EMB) partitioning and placement technique for deep learning recommendation models (DLRMs). RecShard is designed based on two key observations. First, not all EMBs are equal, nor all rows within an EMB are equal in terms of access patterns. EMBs exhibit distinct memory characteristics, providing performance optimization opportunities for intelligent EMB partitioning and placement across a tiered memory hierarchy. Second, in modern DLRMs, EMBs function as hash… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 22 publications
(9 citation statements)
references
References 34 publications
0
9
0
Order By: Relevance
“…Baselines. We compare DreamShard against human expert strategies from previous work [27,8,28], including size-based, dim-based, lookup-based, size-lookup-based greedy balancing strategies. We also include an RNN-based RL algorithm [13], which uses RNN architecture to map operators to devices.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…Baselines. We compare DreamShard against human expert strategies from previous work [27,8,28], including size-based, dim-based, lookup-based, size-lookup-based greedy balancing strategies. We also include an RNN-based RL algorithm [13], which uses RNN architecture to map operators to devices.…”
Section: Methodsmentioning
confidence: 99%
“…Starting from the table with the highest cost, greedy algorithm will assign tables one-by-one to the device with the lowest sum of the costs so far, so that each device will have roughly an equal sum of the costs in the end. We consider the heuristics with the following cost functions, which have been proven to show strong performance in prior work [26]: the size (the product of dimension and hash size) of the table (size-greedy), the dimension of the table (dim-greedy), the product of the dimension and mean pooling factor of the table (lookup-greedy). We further include a random sharding baseline (rand).…”
Section: Experimental Settingsmentioning
confidence: 99%
See 2 more Smart Citations
“…Deep-RecSys [35] and RecSSD [79] optimized inference requests across CPUs and GPUs, and SSDs, respectively. Acun et al [18] characterized DLRM architectures on GPU trainers, Sethi et al [70] presented an optimized embedding sharding strategy for DLRM training, Maeng et al [51] explored checkpointing trainer state, and AIBox [86] optimized training using hierarchical memory for parameters. However, these prior works do not discuss the DSI pipeline, a critical part of ML training.…”
Section: Related Workmentioning
confidence: 99%