2022
DOI: 10.1109/tpds.2021.3094364
|View full text |Cite
|
Sign up to set email alerts
|

vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

Abstract: The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-scale DNNs significantly exceed GPU's physical memory limit, they cannot be trained by conventional methods such as data parallelism. Pipeline parallelism that partitions a large DNN into small subnets and trains them on different GPUs is a plausible solution. Unfortunat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 23 publications
(17 citation statements)
references
References 36 publications
0
14
0
Order By: Relevance
“…GPipe [26], Pipedream [38], and Narayanan et al [39] proposed pipeline training to improve model parallelism, by dividing the forward and backward pass into several mini-batches, which are then pipelined across devices. vPipe [53] improves these works by providing higher GPU utilization. CoCoNet improves on these works by overlapping inter and intra-node communication operations.…”
Section: Related Workmentioning
confidence: 99%
“…GPipe [26], Pipedream [38], and Narayanan et al [39] proposed pipeline training to improve model parallelism, by dividing the forward and backward pass into several mini-batches, which are then pipelined across devices. vPipe [53] improves these works by providing higher GPU utilization. CoCoNet improves on these works by overlapping inter and intra-node communication operations.…”
Section: Related Workmentioning
confidence: 99%
“…However, efficiently switching subnets between GPU and CPU memory is very challenging because the training of each subnet is usually very fast (e.g., 256 [9] samples), and the exploration schedule is generated by the exploration algorithm at runtime. Existing optimizations [11,22,25,30,48] towards DNN training memory reduction or GPU-CPU memory switching are all not designed for NAS supernet to capture correlations between subnets.…”
Section: Motivationsmentioning
confidence: 99%
“…However, considering optimal (balanced) partitions for all subnet execution, an operator often belongs to different stages (GPUs). One approach [48] is to on-demand migrate an operator between stages when it is needed by another subnet's best partition. However, as the subnet switching of a NAS supernet training is often at second-level frequency, this design inevitably incurs high initialization and synchronization costs.…”
Section: Motivationsmentioning
confidence: 99%
See 2 more Smart Citations