2021
DOI: 10.48550/arxiv.2111.05972
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 11 publications
0
6
0
Order By: Relevance
“…To shorten the training time, researchers distribute the training among devices in scales with particular parallel strategies. Uniting data parallelism, pipeline model parallelism and tensor model parallelism, 3D parallelism [4,18,19,26,29] leverages their merits and becomes the SOTA distribute training method for big models. Data parallelism (DP).…”
Section: Background and Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…To shorten the training time, researchers distribute the training among devices in scales with particular parallel strategies. Uniting data parallelism, pipeline model parallelism and tensor model parallelism, 3D parallelism [4,18,19,26,29] leverages their merits and becomes the SOTA distribute training method for big models. Data parallelism (DP).…”
Section: Background and Related Workmentioning
confidence: 99%
“…Megatron-LM [41] analyses the architecture of transformer [43] based models and divides weight matrices along row or column dimension with additional AllReduce operations. SageMaker [19] implements a more efficient memory solution by adopting Reduce-Scatter. A line of workers [5,44,46] further expand TMP to more dimension of weight parameters and input tensors for reducing both the redundancy of activation and the communication overheads.…”
Section: Systemmentioning
confidence: 99%
See 2 more Smart Citations
“…Popular options include TensorFlow [11], PyTorch [8], MXNet [22], PaddlePaddle [9], MindSpore [6], etc. Extensions and modifications have been made based on these general purpose learning systems for efficient distributed learning (e.g., Horovod [73], BytePS [41], Bagua [29], Megatron [75], ZeRO [69], SageMaker [42], etc.). However, even including these extensions, the current general purpose deep learning systems do not consider the challenges about handling the heterogeneity over a hybrid infrastructure.…”
Section: Distributed Deep Learningmentioning
confidence: 99%