vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

Zhao, Shixiong; Li, Fanxin; Chen, Xusheng; Guan, Xiuxian; Jiang, Jianyu; Huang, Dong; Qing, Yuhao; Wang, Sen; Wang, Peng; Zhang, Gong; Li, Cheng; Luo, Ping; Cui, Heming

doi:10.1109/tpds.2021.3094364

Cited by 23 publications

(17 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GPipe [26], Pipedream [38], and Narayanan et al [39] proposed pipeline training to improve model parallelism, by dividing the forward and backward pass into several mini-batches, which are then pipelined across devices. vPipe [53] improves these works by providing higher GPU utilization. CoCoNet improves on these works by overlapping inter and intra-node communication operations.…”

Section: Related Workmentioning

confidence: 99%

Breaking the computation and communication abstraction barrier in distributed machine learning workloads

Jangda

Huang

Liu

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

Recent trends towards large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, the current logical separation between computation and communication kernels in machine learning frameworks misses optimization opportunities across this barrier. Breaking this abstraction can provide many optimizations to improve the performance of distributed workloads. However, manually applying these optimizations requires modifying the underlying computation and communication libraries for each scenario, which is both time consuming and error-prone.Therefore, we present CoCoNet, which contains (i) a domain specific language to express a distributed machine learning program in the form of computation and communication operations, (ii) a set of semantics preserving transformations to optimize the program, and (iii) a compiler to generate jointly optimized communication and computation GPU kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. Co-CoNet enabled us to optimize data-, model-and pipeline-parallel workloads in large language models with only a few lines of code. Our experiments show that CoCoNet significantly outperforms state-of-the-art distributed machine learning implementations.

show abstract

Section: Related Workmentioning

confidence: 99%

Breaking the computation and communication abstraction barrier in distributed machine learning workloads

Jangda

Huang

Liu

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

show abstract

“…However, efficiently switching subnets between GPU and CPU memory is very challenging because the training of each subnet is usually very fast (e.g., 256 [9] samples), and the exploration schedule is generated by the exploration algorithm at runtime. Existing optimizations [11,22,25,30,48] towards DNN training memory reduction or GPU-CPU memory switching are all not designed for NAS supernet to capture correlations between subnets.…”

Section: Motivationsmentioning

confidence: 99%

“…However, considering optimal (balanced) partitions for all subnet execution, an operator often belongs to different stages (GPUs). One approach [48] is to on-demand migrate an operator between stages when it is needed by another subnet's best partition. However, as the subnet switching of a NAS supernet training is often at second-level frequency, this design inevitably incurs high initialization and synchronization costs.…”

Section: Motivationsmentioning

confidence: 99%

“…Although existing systems carry efficient DNN operator switching features [3,11,48], all these algorithms are designed for training a static DNN. Therefore, existing DNN operator switching designs often assume a pre-known DNN execution so that they [3] can predetermine a schedule that pipelines the DNN context switch Figure 1: Comparison of ASP, BSP, and CSP pipeline on executing an ordered list of subnets with causal dependencies.…”

Section: Introductionmentioning

confidence: 99%

“…NASPipe serves as a training system behind a supernet-based NAS algorithm, which can be described by Retiarii [45] or any other NAS programming frameworks. We compared NASPipe to three recent pipeline training systems: Pipedream [21], GPipe [12], and VPipe [48]. The evaluation shows that:…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

NASPipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism

Zhao

Chen

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Self Cite

View full text Add to dashboard Cite

Supernet training, a prevalent and important paradigm in Neural Architecture Search, embeds the whole DNN architecture search space into one monolithic supernet, iteratively activates a subset of the supernet (i.e., a subnet) for fitting each batch of data, and searches a high-quality subnet which meets specific requirements.Although training subnets in parallel on multiple GPUs is desirable for acceleration, there inherently exists a race hazard that concurrent subnets may access the same DNN layers. Existing systems support neither efficiently parallelizing subnets' training executions, nor resolving the race hazard deterministically, leading to unreproducible training procedures and potentiallly non-trivial accuracy loss.We present NASPipe, the first high-performance and reproducible distributed supernet training system via causal synchronous parallel (CSP) pipeline scheduling abstraction: NASPipe partitions a supernet across GPUs and concurrently executes multiple generated sub-tasks (subnets) in a pipelined manner; meanwhile, it oversees the correlations between the subnets and deterministically resolves any causal dependency caused by subnets' layer sharing. To obtain high performance, NASPipe's CSP scheduler exploits the fact that the larger a supernet spans, the fewer dependencies manifest between chronologically close subnets; therefore, it aggressively schedules the subnets with larger chronological orders into execution, only if they are not causally dependent on unfinished

show abstract

CSIMD: Cross-Search Algorithm with Improved Multi-dimensional Dichotomy for Micro-Batch-Based Pipeline Parallel Training in DNN

Zhou,

Lan,

Xie

et al. 2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

Cited by 23 publications

References 36 publications

Breaking the computation and communication abstraction barrier in distributed machine learning workloads

Breaking the computation and communication abstraction barrier in distributed machine learning workloads

NASPipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism

CSIMD: Cross-Search Algorithm with Improved Multi-dimensional Dichotomy for Micro-Batch-Based Pipeline Parallel Training in DNN

Contact Info

Product

Resources

About