Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning

Ren, Jie; Luo, Jiaolin; Wu, Kai; Zhang, Minjia; Jeon, Hyeran; Li, Dong

doi:10.1109/hpca51647.2021.00057

Cited by 28 publications

(14 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We observe, however, that such parallel processing of multiple mini-batches invoke complex data hazards so ScratchPipe employs a novel hazard resolution mechanism to guarantee that the algorithmic nature of RecSys training is not altered. While not specifically focusing on recommendation models, there is a rich set of prior literature exploring heterogeneous memory systems for training large-scale ML algorithms [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49]. In general, the key contribution of our ScratchPipe is orthogonal to these prior studies.…”

Section: Related Workmentioning

confidence: 99%

Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards

Kwon¹,

Rhu²

2022

Preprint

View full text Add to dashboard Cite

Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the "future" cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can "always" be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed.

show abstract

Section: Related Workmentioning

confidence: 99%

Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards

Kwon¹,

Rhu²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The second approach uses compression techniques such as using low or mixed precision [16] for model training, saving on both model states and activations. The third approach uses an external memory such as the CPU memory as an extension of GPU memory to increase memory capacity during training [8,9,11,17,23,24,33].…”

Section: Background and Related Workmentioning

confidence: 99%

“…Heterogeneous DL training is a promising approach to reduce GPU memory requirement by exploiting CPU memory. Many efforts have been made in this direction [8,9,11,17,23,23,24,[32][33][34]. Nearly all of them target CNN based models, where activation memory is the memory bottleneck, and model size is fairly small (less than 500M).…”

Section: Introductionmentioning

confidence: 99%

ZeRO-Offload: Democratizing Billion-Scale Model Training

Ren,

Rajbhandari,

Aminabadi

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training accessible to nearly everyone. It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework such as PyTorch, and it does so without requiring any model change from the data scientists or sacrificing computational efficiency.ZeRO-Offload enables large model training by offloading data and compute to CPU. To preserve compute efficiency, it is designed to minimize the data movement to/from GPU, and reduce CPU compute time while maximizing memory savings on GPU. As a result, ZeRO-Offload can achieve 40 TFlops/GPU on a single NVIDIA V100 GPU for 10B parameter model compared to 30TF using PyTorch alone for a 1.4B parameter model, the largest that can be trained without running out of memory. ZeRO-Offload is also designed to scale on multiple-GPUs when available, offering near linear speedup on up to 128 GPUs. Additionally, it can work together with model parallelism to train models with over 70 billion parameters on a single DGX-2 box, a 4.5x increase in model size compared to using model parallelism alone.By combining compute and memory efficiency with easeof-use, ZeRO-Offload democratizes large-scale model training making it accessible to even data scientists with access to just a single GPU.

show abstract

“…Memory over-commitment in NN training. Prior work studies using storage or slow memory (e.g., host memory) as an extension of fast memory (e.g., GPU memory) to increase memory capacity for NN training (Rhu et al, 2016;Hildebrand et al, 2020;Huang et al, 2020;Peng et al, 2020;Jin et al, 2018;Ren et al, 2021). However, most of these works target at optimizing the conventional offline learning scenarios by swapping optimizer states, activations, or model weights between the fast memory and slow memory (or storage), whereas we focus on swapping samples in between episodic memory and storage to tackle the forgetting problem in the context of continual learning.…”

Section: Related Workmentioning

confidence: 99%

Carousel Memory: Rethinking the Design of Episodic Memory for Continual Learning

Lee

Weerakoon

Choi

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Continual Learning (CL) is an emerging machine learning paradigm that aims to learn from a continuous stream of tasks without forgetting knowledge learned from the previous tasks. To avoid performance decrease caused by forgetting, prior studies exploit episodic memory (EM), which stores a subset of the past observed samples while learning from new non-i.i.d. data. Despite the promising results, since CL is often assumed to execute on mobile or IoT devices, the EM size is bounded by the small hardware memory capacity and makes it infeasible to meet the accuracy requirements for real-world applications. Specifically, all prior CL methods discard samples overflowed from the EM and can never retrieve them back for subsequent training steps, incurring loss of information that would exacerbate catastrophic forgetting. We explore a novel hierarchical EM management strategy to address the forgetting issue. In particular, in mobile and IoT devices, real-time data can be stored not just in high-speed RAMs but in internal storage devices as well, which offer significantly larger capacity than the RAMs. Based on this insight, we propose to exploit the abundant storage to preserve past experiences and alleviate the forgetting by allowing CL to efficiently migrate samples between memory and storage without being interfered by the slow access speed of the storage. We call it Carousel Memory (CarM). As CarM is complementary to existing CL methods, we conduct extensive evaluations of our method with seven popular CL methods and show that CarM significantly improves the accuracy of the methods across different settings by large margins in final average accuracy (up to 28.4%) while retaining the same training efficiency.

show abstract

Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning

Cited by 28 publications

References 25 publications

Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards

Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards

ZeRO-Offload: Democratizing Billion-Scale Model Training

Carousel Memory: Rethinking the Design of Episodic Memory for Continual Learning

Contact Info

Product

Resources

About