Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications

Park, Jongsoo; Naumov, Maxim; Basu, Protonu; Deng, Summer; Kalaiah, Aravind; Khudia, Daya Shanker; Law, James; Malani, Parth; Malevich, Andrey; Nadathur, Satish; Pino, Juan; Schätz, Martin; Schmeißer, Alexander; Sivakumar, V.; Tulloch, A. J.; Wang, Xiaodong; Wu, Yingliang; Yuen, Hector; Diril, Utku; Dzhulgakov, Dmytro; Hazelwood, Kim; Jia, Bill; Jia, Yangqing; Lin, Qiao; Rao, Vijay M.; Rotem, Nadav; Yoo, Sungjoo; Smelyanskiy, Mikhail

doi:10.48550/arxiv.1811.09886

Cited by 41 publications

(58 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Simulations: N-body, raytracing, and Monte-Carlo [4,97,112,107]; and 7. Machine learning: various supervised and unsupervised learning algorithms are implemented using GEMM kernels Deep learning utilizes GEMM kernels for convolution layers [78,66,22,75,102,8,89,103,108]. This thesis's motivation lies in improving the performance of SpGEMM kernels, which will have a significant impact on many important applications.…”

Section: Applicationsmentioning

confidence: 99%

SMASH: Sparse Matrix Atomic Scratchpad Hashing

Shivdikar

2021

Preprint

View full text Add to dashboard Cite

Sparse matrices, more specifically Sparse Matrix-Matrix Multiply (SpGEMM) kernels, are commonly found in a wide range of applications, spanning graph-based path-finding to machine learning algorithms (e.g., neural networks). A particular challenge in implementing SpGEMM kernels has been the pressure placed on DRAM memory. One approach to tackle this problem is to use an inner product method for the SpGEMM kernel implementation. While the inner product produces fewer intermediate results, it can end up saturating the memory bandwidth, given the high number of redundant fetches of the input matrix elements. Using an outer product-based SpGEMM kernel can reduce redundant fetches, but at the cost of increased overhead due to extra computation and memory accesses for producing/managing partial products.In this thesis, we introduce a novel SpGEMM kernel implementation based on the rowwise product approach. We leverage atomic instructions to merge intermediate partial products as they are generated. The use of atomic instructions eliminates the need to create partial product matrices, thus eliminating redundant DRAM fetches.To evaluate our row-wise product approach, we map an optimized SpGEMM kernel to a custom accelerator designed to accelerate graph-based applications. The targeted accelerator is an experimental system named PIUMA, being developed by Intel. PIUMA provides several attractive features, including fast context switching, user-configurable caches, globally addressable memory, non-coherent caches, and asynchronous pipelines. We tailor our SpGEMM kernel to exploit many of the features of the PIUMA fabric.This thesis compares our SpGEMM implementation against prior solutions, all mapped to the PIUMA framework. We briefly describe some of the PIUMA architecture features and then delve into the details of our optimized SpGEMM kernel. Our SpGEMM kernel can achieve 9.4× speedup as compared to competing approaches.x xi

show abstract

Section: Applicationsmentioning

confidence: 99%

SMASH: Sparse Matrix Atomic Scratchpad Hashing

Shivdikar

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…While DNNs have demonstrated its effectiveness in various internet application domains, the cost of using DNNs for web-scale real-time online inference becomes the major burden for most companies to adopt the techniques [11,17] On the one hand, the time consumption (e.g., latency) of the online service is critical for user experience [5] and can influence the long term retention rate [4]. On the other hand, the resource consumption (e.g., hardware and energy usages) of supporting DNNs would request significant serving infrastructure investment (e.g., high-performance clusters) with higher power consumption and sometimes makes the systems design, implementation and operation over-budget [29].…”

Section: Introductionmentioning

confidence: 99%

JIZHI: A Fast and Cost-Effective Model-As-A-Service System for Web-Scale Online Inference at Baidu

Gao

et al. 2021

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

In modern internet industries, deep learning based recommender systems have became an indispensable building block for a wide spectrum of applications, such as search engine, news feed, and short video clips. However, it remains challenging to carry the well-trained deep models for online real-time inference serving, with respect to the time-varying web-scale traffics from billions of users, in a cost-effective manner. In this work, we present JiZhia Model-as-a-Service system -that per second handles hundreds of millions of online inference requests to huge deep models with more than trillions of sparse parameters, for over twenty real-time recommendation services at Baidu, Inc.In JiZhi, the inference workflow of every recommendation request is transformed to a Staged Event-Driven Pipeline (SEDP), where each node in the pipeline refers to a staged computation or I/O intensive task processor. With traffics of real-time inference requests arrived, each modularized processor can be run in a fully asynchronized way and managed separately. Besides, JiZhi introduces the heterogeneous and hierarchical storage to further accelerate the online inference process by reducing unnecessary computations and potential data access latency induced by ultrasparse model parameters. Moreover, an intelligent resource manager has been deployed to maximize the throughput of JiZhi over the shared infrastructure by searching the optimal resource allocation plan from historical logs and fine-tuning the load shedding policies over intermediate system feedback. Extensive experiments have been done to demonstrate the advantages of JiZhi from the perspectives of end-to-end service latency, system-wide throughput, and resource consumption. Since launched in July 2019, JiZhi has helped Baidu saved more than ten million US dollars in hardware and utility costs per year while handling 200% more traffics without sacrificing the inference efficiency. † Equal contribution.

show abstract

“…With the number of categories as large as tens of millions for each feature, embedding tables can take up over 99.9% of the total memory. Namely, memory footprint can be multiple gigabytes or even terabytes [6,7,8]. In practice, deploying these large models often requires the model to be decomposed and distributed across different machines due to memory capacity restrictions [9].…”

Section: Introductionmentioning

confidence: 99%

“…7 shows we only need around 100K samples for the Criteo dataset out of 4.5M samples. Below we discuss a few considera-tions in relation to LMA applied to DLRM model.…”

mentioning

confidence: 99%

Semantically Constrained Memory Allocation (SCMA) for Embedding in Efficient Recommendation Systems

Desai¹,

Pan²,

Sun³

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep learning-based models are utilized to achieve state-of-the-art performance for recommendation systems. A key challenge for these models is to work with millions of categorical classes or tokens. The standard approach is to learn end-to-end, dense latent representations or embeddings for each token. The resulting embeddings require large amounts of memory that blow up with the number of tokens. Training and inference with these models create storage, and memory bandwidth bottlenecks leading to significant computing and energy consumption when deployed in practice. To this end, we present the problem of Memory Allocation under budget for embeddings and propose a novel formulation of memory shared embedding, where memory is shared in proportion to the overlap in semantic information. Our formulation admits a practical and efficient randomized solution with Locality sensitive hashing based Memory Allocation (LMA). We demonstrate a significant reduction in the memory footprint while maintaining performance. In particular, our LMA embeddings achieve the same performance compared to standard embeddings with a 16× reduction in memory footprint.* First two authors have equal contribution Moreover, LMA achieves an average improvement of over 0.003 AUC across different memory regimes than standard DLRM models on Criteo and Avazu datasets

show abstract

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications

Cited by 41 publications

References 47 publications

SMASH: Sparse Matrix Atomic Scratchpad Hashing

SMASH: Sparse Matrix Atomic Scratchpad Hashing

JIZHI: A Fast and Cost-Effective Model-As-A-Service System for Web-Scale Online Inference at Baidu

Semantically Constrained Memory Allocation (SCMA) for Embedding in Efficient Recommendation Systems

Contact Info

Product

Resources

About