Aarti Basant scite author profile

Aarti Basant

3Publications

39Citation Statements Received

101Citation Statements Given

How they've been cited

How they cite others

101

Affiliations

Meta (United States)

Publications

Order By: Most citations

Software-hardware co-design for fast and scalable training of deep learning recommendation models

Mudigere

Hao

Huang

et al. 2022

View full text Add to dashboard Cite

Deep learning recommendation models (DLRMs) have been used across many business-critical services at Meta and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper, we present Neo, a software-hardware co-designed system for high-performance distributed training of large-scale DLRMs. Neo employs a novel 4D parallelism strategy that combines table-wise, row-wise, column-wise, and data parallelism for training massive embedding operators in DLRMs. In addition, Neo enables extremely high-performance and memoryefficient embedding computations using a variety of critical systems optimizations, including hybrid kernel fusion, software-managed caching, and quality-preserving compression. Finally, Neo is paired with ZionEX , a new hardware platform co-designed with Neo's 4D parallelism for optimizing communications for large-scale DLRM training. Our evaluation on 128 GPUs using 16 ZionEX nodes shows that Neo outperforms existing systems by up to 40× for training 12-trillion-parameter DLRM models deployed in production.

show abstract

Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

Mudigere¹,

Hao²,

Huang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Understanding data storage and ingestion for large-scale deep recommendation model training

Zhao¹,

Agarwal²,

Basant³

et al. 2022

View full text Add to dashboard Cite

Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasinglycomplex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efficiency and throughput, the DSI pipeline is becoming the dominating factor that constrains the overall training performance and capacity. Innovations that improve the efficiency and performance of DSI systems and hardware are urgent, demanding a deep understanding of DSI characteristics and infrastructure at scale. This paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that scales to eliminate data stalls. We characterize how hundreds of models are collaboratively trained across geo-distributed datacenters via diverse and continuous training jobs. These training jobs read and heavily filter massive and evolving datasets, resulting in popular features and samples used across training jobs. We measure the intense network, memory, and compute resources required by each training job to preprocess samples during training. Finally, we synthesize key takeaways based on our production infrastructure characterization. These include identifying hardware bottlenecks, discussing opportunities for heterogeneous DSI hardware, motivating research in datacenter scheduling and benchmark datasets, and assimilating lessons learned in optimizing DSI infrastructure. CCS CONCEPTS• Software and its engineering → Distributed systems organizing principles; • Information systems → Database management system engines; • Computing methodologies → Machine learning.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Aarti Basant

Software-hardware co-design for fast and scalable training of deep learning recommendation models

Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

Understanding data storage and ingestion for large-scale deep recommendation model training

Contact Info

Product

Resources

About