Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning

Kwon, Youngeun; Rhu, Minsoo

doi:10.1109/micro.2018.00021

Cited by 49 publications

(23 citation statements)

References 64 publications

(87 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In relation to the latter, new solutions are developed on specialized servers. They use a high-speed network for communication between throttled memory allowing to expand the memory capacity available in training by using multiple accelerators without using the system bus as Kwon proposes [14].…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Computational Resource Consumption in Convolutional Neural Network Training – A Focus on Memory

Barrio-Parra

Barrios

Denneulin

2021

JSFI

View full text Add to dashboard Cite

Deep neural networks (DNNs) have grown in popularity in recent years thanks to the increase in computing power and the size and relevance of data sets. This has made it possible to build more complex models and include more areas of research and application. At the same time, the amount of data generated during the training process of these models puts great pressure on the capacity and bandwidth of the memory subsystem and, as a direct consequence, has become one of the biggest bottlenecks for the scalability of neural networks. Therefore, the optimizing of the workloads produced by DNNs in the memory subsystem requires a detailed understanding of access to the memory and the interactions between the processor, accelerator devices, and the system memory hierarchy. However, contrary to what would be expected, most DNN profilers work at a high level, so they only perform an analysis of the model and individual layers of the network leaving aside the complex interactions between all the hardware components involved in the training. This article shows the characterization performed using a convolutional neural network implemented in the two most popular frameworks: TensorFlow and Pytorch. Likewise, the behavior of the component interactions is discussed by varying the batch size for two sets of synthetic data and showing the results obtained by the profiler created for the study. Moreover, the results obtained when evaluating the AlexNet version on TensorFlow and its similarity in behavior when using a basic CNN are included.

show abstract

Section: Discussionmentioning

confidence: 99%

“…This type of architecture discusses memory modules decoupled from the PCIe and parked locally within the interconnection of devices, using NVlink for example. This maximizes the communication bandwidth between them while expanding the total memory capacity of the system [14].…”

Section: Related Workmentioning

confidence: 99%

Computational Resource Consumption in Convolutional Neural Network Training – A Focus on Memory

Barrio-Parra

Barrios

Denneulin

2021

JSFI

View full text Add to dashboard Cite

show abstract

“…Several follow-up works offer improvements over this first attempt. In order to reduce the overhead incurred by the communications, some authors [35] recommend to add compression to decrease the communication time, while others [36] design a memory-centric architecture to help with data transfers. In [37,38], the authors implement memory virtualization by manipulating the computational graphs and inserting special operations called swap in and swap out that send the activations in and out of GPU memory.…”

Section: Memory Offloadingmentioning

confidence: 99%

Optimal GPU-CPU Offloading Strategies for Deep Neural Network Training

Beaumont

Eyraud-Dubois

Shilova

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Training Deep Neural Networks is known to be an expensive operation, both in terms of computational cost and memory load. Indeed, during training, all intermediate layer outputs (called activations) computed during the forward phase must be stored until the corresponding gradient has been computed in the backward phase. These memory requirements sometimes prevent to consider larger batch sizes and deeper networks, so that they can limit both convergence speed and accuracy. Recent works have proposed to offload some of the computed forward activations from the memory of the GPU to the memory of the CPU. This requires to determine which activations should be offloaded and when these transfers from and to the memory of the GPU should take place. We prove that this problem is NP-hard in the strong sense, and we propose two heuristics based on relaxations of the problem. We perform extensive experimental evaluation on standard Deep Neural Networks. We compare the performance of our heuristics against previous approaches from the literature, showing that they achieve much better performance in a wide variety of situations.

show abstract

“…Second, adding more stacks horizontally on the silicon interposer is limited by the wiring complexity of the silicon interposer and the number of pins of chips [57]. Third, as GPGPU application working sets continue to become larger [50,51], application developers will still need to take the size of GPU memory into account despite the increased capacity. Alternatively, dividing tasks across multiple GPUs or smaller kernels with smaller memory footprint [30,53] requires non-trivial programming effort to break a complex GPU kernel into multiple GPUs or kernels.…”

Section: An Application-transparent Frameworkmentioning

confidence: 99%

A Framework for Memory Oversubscription Management in Graphics Processing Units

Ausavarungnirun

Rossbach³

et al. 2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

View full text Add to dashboard Cite

Modern discrete GPUs support unified memory and demand paging. Automatic management of data movement between CPU memory and GPU memory dramatically reduces developer effort. However, when application working sets exceed physical memory capacity, the resulting data movement can cause great performance loss. This paper proposes a memory management framework, called ETC, that transparently improves GPU performance under memory oversubscription using new techniques to overlap eviction latency of GPU pages, reduce thrashing cost, and increase effective memory capacity. Eviction latency can be hidden by eagerly creating space for demand-paged data with proactive eviction (E). Thrashing costs can be ameliorated with memory-aware throttling (T), which dynamically reduces the GPU parallelism when page fault frequencies become high. Capacity compression (C) can enable larger working sets without increasing physical memory capacity. No single technique fits all workloads, and, thus, ETC integrates proactive eviction, memory-aware throttling and capacity compression into a principled framework that dynamically selects the most effective combination of techniques, transparently to the running software. To this end, ETC categorizes applications into three categories: regular applications without data sharing across kernels, regular applications with data sharing across kernels, and irregular applications. Our evaluation shows that ETC fully mitigates the oversubscription overhead for regular applications without data sharing and delivers performance similar to the ideal unlimited GPU memory baseline. We also show that ETC outperforms the state-of-the-art baseline by 60.4% and

show abstract

Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning

Cited by 49 publications

References 64 publications

Computational Resource Consumption in Convolutional Neural Network Training – A Focus on Memory

Computational Resource Consumption in Convolutional Neural Network Training – A Focus on Memory

Optimal GPU-CPU Offloading Strategies for Deep Neural Network Training

A Framework for Memory Oversubscription Management in Graphics Processing Units

Contact Info

Product

Resources

About