Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training

Karakus, Can; Huilgol, Rahul; Wu, Fei; Subramanian, Anirudh; Daniel, Cade; Cavdar, Derya; Xu, Teng; Chen, Haohan; Rahnama, Arash; Arias, Luis Angel Quintela

doi:10.48550/arxiv.2111.05972

Cited by 4 publications

(6 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To shorten the training time, researchers distribute the training among devices in scales with particular parallel strategies. Uniting data parallelism, pipeline model parallelism and tensor model parallelism, 3D parallelism [4,18,19,26,29] leverages their merits and becomes the SOTA distribute training method for big models. Data parallelism (DP).…”

Section: Background and Related Workmentioning

confidence: 99%

“…Megatron-LM [41] analyses the architecture of transformer [43] based models and divides weight matrices along row or column dimension with additional AllReduce operations. SageMaker [19] implements a more efficient memory solution by adopting Reduce-Scatter. A line of workers [5,44,46] further expand TMP to more dimension of weight parameters and input tensors for reducing both the redundancy of activation and the communication overheads.…”

Section: Systemmentioning

confidence: 99%

“…TMP customizes operators hence little work could escape from manual module settings. For PMP, 3D parallelism of DeepSpeed [26] and Megatron-LM [29] need flattening models to construct a layer sequence; Colossal-AI [4] requires redefining models with its particular interface; Varuna [3] asks for adding a specific CutPoint instance to model codes; and SageMaker [19] claims that it can apply PMP into any model but it is a proprietary solution and only available on AWS.…”

Section: Motivations 31 General and User-friendly 3d Parallelismmentioning

confidence: 99%

“…Moreover, to accelerate the shard procedure, we can provide an upper bound of partition number based on the model layer number or device number. And we handle outputs of current subgraph and increase subgraph id if we create a new subgraph (lines [15][16][17][18][19][20]. After traversing all nodes, we create subgraphs with corresponding nodes, inputs and outputs (lines [23][24][25][26][27][28].…”

Section: Graph Sharding Algorithmmentioning

confidence: 99%

See 3 more Smart Citations

Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Lai¹,

Li²,

Tang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Foundation models are becoming the dominant deep learning technologies. Pretraining a foundation model is always time-consumed due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the training process is extremely memory-intensive and communication-intensive. These features make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism and tensor model parallelism, to achieve high training efficiency.To achieve this goal, some custom software frameworks such as Megatron-LM and DeepSpeed are developed. However, current 3D parallelism frameworks still meet two issues: i) they are not transparent to model developers, which need to manually modify the model to parallelize training. ii) their utilization of computation, GPU memory and network bandwidth are not sufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak also presents the non-intrusive API for scaling out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine in Merak. It uses several techniques to exploit available training resources, including shifted critical path pipeline schedule that brings a higher computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs show Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42×, 1.39×, 1.43×, and 1.61×, respectively.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Section: Systemmentioning

confidence: 99%

Section: Motivations 31 General and User-friendly 3d Parallelismmentioning

confidence: 99%

Section: Graph Sharding Algorithmmentioning

confidence: 99%

See 2 more Smart Citations

Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Lai¹,

Li²,

Tang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Popular options include TensorFlow [11], PyTorch [8], MXNet [22], PaddlePaddle [9], MindSpore [6], etc. Extensions and modifications have been made based on these general purpose learning systems for efficient distributed learning (e.g., Horovod [73], BytePS [41], Bagua [29], Megatron [75], ZeRO [69], SageMaker [42], etc.). However, even including these extensions, the current general purpose deep learning systems do not consider the challenges about handling the heterogeneity over a hybrid infrastructure.…”

Section: Distributed Deep Learningmentioning

confidence: 99%

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Lian¹,

Yuan²,

Zhu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep learning based models have dominated the current landscape of production recommender systems. Furthermore, recent years have witnessed an exponential growth of the model scale-from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. This difficulty is inherited from the staggering heterogeneity of the training computation-the model's embedding layer could include more than 99.99% of the total model size, which is extremely memory-intensive; while the rest neural network is increasingly computation-intensive. To support the training of such huge models, an efficient distributed training system is in urgent need. In this paper, we resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Specifically, in order to ensure both the training efficiency and the training accuracy, we design a novel hybrid training algorithm, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then we build a system called Persia (short for parallel recommendation training system with hybrid acceleration) to support this hybrid training algorithm. Both theoretical demonstrations and empirical studies up to 100 trillion parameters have been conducted to justified the system design and implementation of Persia. We make Persia publicly available (at https://github.com/PersiaML/Persia) so that anyone would be able to easily train a recommender model at the scale of 100 trillion parameters.

show abstract

Modoru: Clos nanosecond optical switching for distributed deep training [Invited]

Wang,

Yoshikane,

Elson

et al. 2023

J. Opt. Commun. Netw.

View full text Add to dashboard Cite

Distributed deep training has become a significant consumer of bandwidth across datacenter-scale networks. The diverse parallel strategies employed in deep training require different communication patterns, necessitating the periodic adaptation of dynamic topologies. Since electrical switching approaches its capacity limit due to high bandwidths and has difficulties in regard to topology adaptation (i.e., logical and physical topologies are isomorphic), optical switching has become an attractive option to address these bottlenecks. In this paper, we propose Modoru, a wavelength- and datarate-agnostic Clos architecture with a switching speed of O(10ns). Modoru is a drop-in replacement solution that has no constraints on achieving a high radix. To verify its topological flexibility, we also develop topology-as-a-service, which provisions sequentially dynamic topologies for training jobs and guarantees high topology availability over the entire network. Large-scale simulations show a basic 7.9× acceleration in deep training jobs using Modoru. Additionally, experiments on the Modoru prototype demonstrate acceleration of deep training jobs through the provisioning of adaptive topologies.

show abstract

Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training

Cited by 4 publications

References 11 publications

Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Modoru: Clos nanosecond optical switching for distributed deep training [Invited]

Contact Info

Product

Resources

About