2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) 2018
DOI: 10.1109/micro.2018.00021
|View full text |Cite
|
Sign up to set email alerts
|

Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning

Abstract: As the models and the datasets to train deep learning (DL) models scale, system architects are faced with new challenges, one of which is the memory capacity bottleneck, where the limited physical memory inside the accelerator device constrains the algorithm that can be studied. We propose a memory-centric deep learning system that can transparently expand the memory capacity available to the accelerators while also providing fast inter-device communication for parallel training. Our proposal aggregates a pool… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
23
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 49 publications
(23 citation statements)
references
References 64 publications
(87 reference statements)
0
23
0
Order By: Relevance
“…In relation to the latter, new solutions are developed on specialized servers. They use a high-speed network for communication between throttled memory allowing to expand the memory capacity available in training by using multiple accelerators without using the system bus as Kwon proposes [14].…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…In relation to the latter, new solutions are developed on specialized servers. They use a high-speed network for communication between throttled memory allowing to expand the memory capacity available in training by using multiple accelerators without using the system bus as Kwon proposes [14].…”
Section: Discussionmentioning
confidence: 99%
“…This type of architecture discusses memory modules decoupled from the PCIe and parked locally within the interconnection of devices, using NVlink for example. This maximizes the communication bandwidth between them while expanding the total memory capacity of the system [14].…”
Section: Related Workmentioning
confidence: 99%
“…Several follow-up works offer improvements over this first attempt. In order to reduce the overhead incurred by the communications, some authors [35] recommend to add compression to decrease the communication time, while others [36] design a memory-centric architecture to help with data transfers. In [37,38], the authors implement memory virtualization by manipulating the computational graphs and inserting special operations called swap in and swap out that send the activations in and out of GPU memory.…”
Section: Memory Offloadingmentioning
confidence: 99%
“…Second, adding more stacks horizontally on the silicon interposer is limited by the wiring complexity of the silicon interposer and the number of pins of chips [57]. Third, as GPGPU application working sets continue to become larger [50,51], application developers will still need to take the size of GPU memory into account despite the increased capacity. Alternatively, dividing tasks across multiple GPUs or smaller kernels with smaller memory footprint [30,53] requires non-trivial programming effort to break a complex GPU kernel into multiple GPUs or kernels.…”
Section: An Application-transparent Frameworkmentioning
confidence: 99%