RecShard: statistical feature-based memory optimization for industry-scale neural recommendation

Sethi, Geet; Acun, Bilge; Agarwal, Niket; Kozyrakis, Christos; Trippel, Caroline; Wu, Carole-Jean

doi:10.1145/3503222.3507777

Cited by 22 publications

(9 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Baselines. We compare DreamShard against human expert strategies from previous work [27,8,28], including size-based, dim-based, lookup-based, size-lookup-based greedy balancing strategies. We also include an RNN-based RL algorithm [13], which uses RNN architecture to map operators to devices.…”

Section: Methodsmentioning

confidence: 99%

“…Starting from the table with the highest cost, greedy algorithm will assign tables one-by-one to the device with the lowest sum of the costs so far, so that each device will have roughly an equal sum of the costs in the end. We consider the heuristics with the following cost functions, which have been proven to show strong performance in prior work [26]: the size (the product of dimension and hash size) of the table (size-greedy), the dimension of the table (dim-greedy), the product of the dimension and mean pooling factor of the table (lookup-greedy). We further include a random sharding baseline (rand).…”

Section: Experimental Settingsmentioning

confidence: 99%

“…Our work is orthogonal to these methods since AutoShard can be applied to compressed tables as well. A related work explored using the tiered memory hierarchy to store the embedding tables [26]. They exploited the unequal access patterns of embedding tables to improve the efficiency by placing hot rows in the GPU memory.…”

Section: Related Workmentioning

confidence: 99%

“…They exploited the unequal access patterns of embedding tables to improve the efficiency by placing hot rows in the GPU memory. Our efforts complement [26] by providing an end-to-end learning-based framework for cost approximation and partitioning optimization.…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

AutoShard: Automated Embedding Table Sharding for Recommender Systems

Zha

Liu

Bhushanam

et al. 2022

Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

Embedding learning is an important technique in deep recommendation models to map categorical features to dense vectors. However, the embedding tables often demand an extremely large number of parameters, which become the storage and efficiency bottlenecks. Distributed training solutions have been adopted to partition the embedding tables into multiple devices. However, the embedding tables can easily lead to imbalances if not carefully partitioned. This is a significant design challenge of distributed systems named embedding table sharding, i.e., how we should partition the embedding tables to balance the costs across devices, which is a non-trivial task because 1) it is hard to efficiently and precisely measure the cost, and 2) the partition problem is known to be NP-hard. In this work, we introduce our novel practice in Meta, namely AutoShard, which uses a neural cost model to directly predict the multi-table costs and leverages deep reinforcement learning to solve the partition problem. Experimental results on an open-sourced large-scale synthetic dataset and Meta's production dataset demonstrate the superiority of AutoShard over the heuristics. Moreover, the learned policy of AutoShard can transfer to sharding tasks with various numbers of tables and different ratios of the unseen tables without any fine-tuning. Furthermore, AutoShard can efficiently shard hundreds of tables in seconds. The effectiveness, transferability, and efficiency of AutoShard make it desirable for production use. Our algorithms have been deployed in Meta production environment. A prototype is available at https://github.com/daochenzha/autoshard CCS CONCEPTS• Computing methodologies → Reinforcement learning; Machine learning approaches.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Experimental Settingsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

AutoShard: Automated Embedding Table Sharding for Recommender Systems

Zha

Liu

Bhushanam

et al. 2022

Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

show abstract

“…Deep-RecSys [35] and RecSSD [79] optimized inference requests across CPUs and GPUs, and SSDs, respectively. Acun et al [18] characterized DLRM architectures on GPU trainers, Sethi et al [70] presented an optimized embedding sharding strategy for DLRM training, Maeng et al [51] explored checkpointing trainer state, and AIBox [86] optimized training using hierarchical memory for parameters. However, these prior works do not discuss the DSI pipeline, a critical part of ML training.…”

Section: Related Workmentioning

confidence: 99%

Understanding data storage and ingestion for large-scale deep recommendation model training

Zhao¹,

Agarwal²,

Basant³

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

Self Cite

View full text Add to dashboard Cite

Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasinglycomplex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efficiency and throughput, the DSI pipeline is becoming the dominating factor that constrains the overall training performance and capacity. Innovations that improve the efficiency and performance of DSI systems and hardware are urgent, demanding a deep understanding of DSI characteristics and infrastructure at scale. This paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that scales to eliminate data stalls. We characterize how hundreds of models are collaboratively trained across geo-distributed datacenters via diverse and continuous training jobs. These training jobs read and heavily filter massive and evolving datasets, resulting in popular features and samples used across training jobs. We measure the intense network, memory, and compute resources required by each training job to preprocess samples during training. Finally, we synthesize key takeaways based on our production infrastructure characterization. These include identifying hardware bottlenecks, discussing opportunities for heterogeneous DSI hardware, motivating research in datacenter scheduling and benchmark datasets, and assimilating lessons learned in optimizing DSI infrastructure. CCS CONCEPTS• Software and its engineering → Distributed systems organizing principles; • Information systems → Database management system engines; • Computing methodologies → Machine learning.

show abstract