DeepSpeed

Rasley, Jeff; Rajbhandari, Samyam; Ruwase, Olatunji; He, Yuxiong

doi:10.1145/3394486.3406703

Cited by 324 publications

(115 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…DeepSpeed developed by Microsoft is built on PyTorch. It allows three-way parallelism (models, data, and pipeline) that facilitates memory and communication efficiency [290]. DeepSpeed enables high throughput and low latency for large DL models (having a trillion or more parameters) utilizing distributed computing resources.…”

Section: Software Framework For Large-scale Distributed Trainingmentioning

confidence: 99%

Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review

et al. 2023

View full text Add to dashboard Cite

Successful integration of deep neural networks (DNNs) or deep learning (DL) has resulted in breakthroughs in many areas. However, deploying these highly accurate models for data-driven, learned, automatic, and practical machine learning (ML) solutions to end-user applications remains challenging. DL algorithms are often computationally expensive, power-hungry, and require large memory to process complex and iterative operations of millions of parameters. Hence, training and inference of DL models are typically performed on high-performance computing (HPC) clusters in the cloud. Data transmission to the cloud results in high latency, round-trip delay, security and privacy concerns, and the inability of real-time decisions. Thus, processing on edge devices can significantly reduce cloud transmission cost. Edge devices are end devices closest to the user, such as mobile phones, cyber-physical systems (CPSs), wearables, the Internet of Things (IoT), embedded and autonomous systems, and

show abstract

Section: Software Framework For Large-scale Distributed Trainingmentioning

confidence: 99%

Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review

et al. 2023

View full text Add to dashboard Cite

show abstract

“…Note that DP and MP are orthogonal and so one can use both simultaneously to train larger models with higher computation and memory capacity. For example, Megatron-LM * [65] and DeepSpeed [73] [12] stores low-precision approximate copies of activations while computing the forward pass exactly, which helps to reduce the overall memory consumption during training. The saved activations are then dequantized to the original precision in the backward pass to calculate gradients.…”

Section: Memory Efficiencymentioning

confidence: 99%

“…There are several widely adopted prototypes for training large Transformer models at scale, in which Microsoft DeepSpeed 5 , HPC-AI Tech Colossal-AI 6 and Nvidia Megatron-LM 7 are the pioneering ones. Specifically, DeepSpeed is implemented mainly based on [73] and ZeRO series works [72,74], Colossal-AI is built upon [8], and Megatron-LM implements [65]. All of these support data and model parallelism in mixed precision, along with other general practices such as offloading and rematerialization.…”

Section: Memory Efficiencymentioning

confidence: 99%

A Survey on Efficient Training of Transformers

Zhuang¹,

Liu²,

Pan³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recent advances in Transformers have come with a huge requirement on computing resources, highlighting the importance of developing efficient training techniques to make Transformer training faster, at lower cost, and to higher accuracy by the efficient use of computation and memory resources. This survey provides the first systematic overview of the efficient training of Transformers, covering the recent progress in acceleration arithmetic and hardware, with a focus on the former. We analyze and compare methods that save computation and memory costs for intermediate tensors during training, together with techniques on hardware/algorithm co-design. We finally discuss challenges and promising areas for future research.

show abstract

“…We note that the implementation of both algorithms is problem agnostic and does not incorporate any prior information on solutions to be approximated, which makes the performance of these algorithms to be dependent on the data size and model parameters. In the literature, the implementation of data and model parallel approaches is primarily carried out for problems pertaining to the classification and natural language processing (Goyal et al, 2017; Rasley et al, 2020), which are based on large amounts of training data. Therefore, the efficiency of data and the model parallel approach for scientific machine learning is not explored, which is primarily dominated by the high-dimensional and sparse data set.…”

Section: Physics-informed Neural Networkmentioning

confidence: 99%

Scalable algorithms for physics-informed neural and graph networks

Shukla

Trask

et al. 2022

DCE

View full text Add to dashboard Cite

Physics-informed machine learning (PIML) has emerged as a promising new approach for simulating complex physical and biological systems that are governed by complex multiscale processes for which some data are also available. In some instances, the objective is to discover part of the hidden physics from the available data, and PIML has been shown to be particularly effective for such problems for which conventional methods may fail. Unlike commercial machine learning where training of deep neural networks requires big data, in PIML big data are not available. Instead, we can train such networks from additional information obtained by employing the physical laws and evaluating them at random points in the space–time domain. Such PIML integrates multimodality and multifidelity data with mathematical models, and implements them using neural networks or graph networks. Here, we review some of the prevailing trends in embedding physics into machine learning, using physics-informed neural networks (PINNs) based primarily on feed-forward neural networks and automatic differentiation. For more complex systems or systems of systems and unstructured data, graph neural networks (GNNs) present some distinct advantages, and here we review how physics-informed learning can be accomplished with GNNs based on graph exterior calculus to construct differential operators; we refer to these architectures as physics-informed graph networks (PIGNs). We present representative examples for both forward and inverse problems and discuss what advances are needed to scale up PINNs, PIGNs and more broadly GNNs for large-scale engineering problems.

show abstract

DeepSpeed

Cited by 324 publications

References 1 publication

Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review

Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review

A Survey on Efficient Training of Transformers

Scalable algorithms for physics-informed neural and graph networks

Contact Info

Product

Resources

About