Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining 2020
DOI: 10.1145/3394486.3406703
|View full text |Cite
|
Sign up to set email alerts
|

DeepSpeed

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
69
0
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 324 publications
(115 citation statements)
references
References 1 publication
0
69
0
1
Order By: Relevance
“…DeepSpeed developed by Microsoft is built on PyTorch. It allows three-way parallelism (models, data, and pipeline) that facilitates memory and communication efficiency [290]. DeepSpeed enables high throughput and low latency for large DL models (having a trillion or more parameters) utilizing distributed computing resources.…”
Section: Software Framework For Large-scale Distributed Trainingmentioning
confidence: 99%
“…DeepSpeed developed by Microsoft is built on PyTorch. It allows three-way parallelism (models, data, and pipeline) that facilitates memory and communication efficiency [290]. DeepSpeed enables high throughput and low latency for large DL models (having a trillion or more parameters) utilizing distributed computing resources.…”
Section: Software Framework For Large-scale Distributed Trainingmentioning
confidence: 99%
“…Note that DP and MP are orthogonal and so one can use both simultaneously to train larger models with higher computation and memory capacity. For example, Megatron-LM * [65] and DeepSpeed [73] [12] stores low-precision approximate copies of activations while computing the forward pass exactly, which helps to reduce the overall memory consumption during training. The saved activations are then dequantized to the original precision in the backward pass to calculate gradients.…”
Section: Memory Efficiencymentioning
confidence: 99%
“…There are several widely adopted prototypes for training large Transformer models at scale, in which Microsoft DeepSpeed 5 , HPC-AI Tech Colossal-AI 6 and Nvidia Megatron-LM 7 are the pioneering ones. Specifically, DeepSpeed is implemented mainly based on [73] and ZeRO series works [72,74], Colossal-AI is built upon [8], and Megatron-LM implements [65]. All of these support data and model parallelism in mixed precision, along with other general practices such as offloading and rematerialization.…”
Section: Memory Efficiencymentioning
confidence: 99%
“…We note that the implementation of both algorithms is problem agnostic and does not incorporate any prior information on solutions to be approximated, which makes the performance of these algorithms to be dependent on the data size and model parameters. In the literature, the implementation of data and model parallel approaches is primarily carried out for problems pertaining to the classification and natural language processing (Goyal et al, 2017; Rasley et al, 2020), which are based on large amounts of training data. Therefore, the efficiency of data and the model parallel approach for scientific machine learning is not explored, which is primarily dominated by the high-dimensional and sparse data set.…”
Section: Physics-informed Neural Networkmentioning
confidence: 99%