2020
DOI: 10.48550/arxiv.2008.11421
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

Abstract: The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in addition to, data parallelism.We propose a performance model based on the concurrency analysis of out-of-core training be… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(6 citation statements)
references
References 28 publications
0
6
0
Order By: Relevance
“…Reduce required memory by using lower precision (quantization) [43], gradient checking point [6], out-of-core methods [49],…”
Section: Notationmentioning
confidence: 99%
See 1 more Smart Citation
“…Reduce required memory by using lower precision (quantization) [43], gradient checking point [6], out-of-core methods [49],…”
Section: Notationmentioning
confidence: 99%
“…Larger DL models, using ADAM optimizer in specific, may need a higher computation time for weight. Especially, large Transformer based models report up to 45% time on weight update and more than 60% extra memory requirements since ADAM requires four variables per weight [49]. One alternative to address this is to shard the weight update among GPUs across iterations, and Allgather the weights before forward/backward passes [52].…”
Section: 33mentioning
confidence: 99%
“…In this scenario, our work combines the tensor partition and rematerialization together for memory management on DCG. Although these two optimization techniques can both reduce memory footprint, they were developed separately [28,29] based on different application requirements. In fact, the tensor partition was designed for training a large model on multiple machines, which aimed to achieve load balance to maximize parallelism and locality to minimize network communication.…”
Section: Introductionmentioning
confidence: 99%
“…However, three complications make the combination non-trivial. (1) Since recent tensor partition plans [15,19,23,28,29,33] are usually defined before the execution and fixed during the training, these plans don't take into account the runtime information which essentially determines the tensor rematerialization. Hence, the first issue is how to achieve dynamic adjustment of tensor partition during the runtime.…”
Section: Introductionmentioning
confidence: 99%
“…Reduce required memory by using lower precision (quantization) [71], gradient checking point [89], out-of-core methods [90],…”
Section: • • • • • ---mentioning
confidence: 99%