SC20: International Conference for High Performance Computing, Networking, Storage and Analysis 2020
DOI: 10.1109/sc41405.2020.00023
|View full text |Cite
|
Sign up to set email alerts
|

Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
3

Relationship

1
8

Authors

Journals

citations
Cited by 11 publications
(6 citation statements)
references
References 20 publications
0
6
0
Order By: Relevance
“…However, these single GPU systems are not designed to cope with challenges stemming from pipeline parallelism ( §2). A recent study [54] partially offloads the recompute overhead to the CPU processors. This work is complementary to VPIPE and can be integrated into VPIPE to further reduce the recompute overhead.…”
Section: Related Workmentioning
confidence: 99%
“…However, these single GPU systems are not designed to cope with challenges stemming from pipeline parallelism ( §2). A recent study [54] partially offloads the recompute overhead to the CPU processors. This work is complementary to VPIPE and can be integrated into VPIPE to further reduce the recompute overhead.…”
Section: Related Workmentioning
confidence: 99%
“…These algorithms move data back and forth between the CPU and the GPU to free up space on the GPU. KARMA [47] is a framework built over PyTorch that extends this out-of-core approach to data parallelism on multiple GPUs. They design an efficient algorithm for automatic offloading and prefetching of activations and parameters of the neural network to and from the CPU DRAM.…”
Section: A Data Parallelismmentioning
confidence: 99%
“…These nodes include: NVIDIA Tesla A100, Google TPU, or Intel GAUDI. On such nodes, training efficiency depends on model parallelization (HoroVod [24] and KARMA [28]) and effective communications between accelerators performed by a specialized network such as NVIDIA NVLink [16]. Now, despite these innovative designs, the use of increasingly deep and dense network topologies has made the resources available for training still a problem, particularly memory capacity.…”
Section: Related Workmentioning
confidence: 99%