ooc_cuDNN: Accommodating convolutional neural networks over GPU memory capacity

Ito, Yuki; Matsumiya, Ryo; Endo, Tetsuro

doi:10.1109/bigdata.2017.8257926

Cited by 16 publications

(11 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many approaches, such as pipelining [38]- [41] and microbatching [42] are orthogonal to our 3D spatial partitioning. Others directly target memory pressure during training, but perform additional computation, including gradient accumulation [43], out-of-core algorithms [44]- [46], and recomputation [47], [48].…”

Section: Related Workmentioning

confidence: 99%

The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism

Oyama

Maruyama

Dryden

et al. 2020

Preprint

View full text Add to dashboard Cite

We present scalable hybrid-parallel algorithms for training large-scale 3D convolutional neural networks. Deep learning-based emerging scientific workflows often require model training with large, high-dimensional samples, which can make training much more costly and even infeasible due to excessive memory usage. We solve these challenges by extensively applying hybrid parallelism throughout the end-to-end training pipeline, including both computations and I/O. Our hybrid-parallel algorithm extends the standard data parallelism with spatial parallelism, which partitions a single sample in the spatial domain, realizing strong scaling beyond the mini-batch dimension with a larger aggregated memory capacity. We evaluate our proposed training algorithms with two challenging 3D CNNs, CosmoFlow and 3D U-Net. Our comprehensive performance studies show that good weak and strong scaling can be achieved for both networks using up to 2K GPUs. More importantly, we enable training of CosmoFlow with much larger samples than previously possible, realizing an order-of-magnitude improvement in prediction accuracy.

show abstract

Section: Related Workmentioning

confidence: 99%

The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism

Oyama

Maruyama

Dryden

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…vDNN [51] is a memory that virtualizes GPU memory in DNN training. ooc cuDNN [25] extends cuDNN and applies cuDNN-compatible operators even when a layer exceeds GPU memory capacity by swapping at the granularity of individual tensor dimensions. Gradient checkpointing [10] reduces the memory needed to store the intermediate outputs and gradients with the cost of doubling the forward pass computational cost [10,26].…”

Section: Related Workmentioning

confidence: 99%

A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs

Qararyah,

Wahib,

Dikbayır

et al. 2020

Preprint

View full text Add to dashboard Cite

We propose P DNN, an automatic, generic, and non-intrusive partitioning strategy for large DNN models that do not t into single device memory. P DNN decides a placement of DNN's underlying computational graph operations across multiple devices so that the devices' memory constraints are met and the training time is minimized. P DNN is completely independent of the deep learning aspects of a DNN and requires no modi cation neither at the model nor at the systems level implementation of operation kernels. It partitions DNNs having billions of parameters and hundreds of thousands of operations in seconds to few minutes. Our experiments with TensorFlow on 16 GPUs demonstrate e cient training of 5 very large models while achieving superlinear scaling for both the batch size and training throughput. In comparison to related work (Mesh-TensorFlow and Gradient Checkpointing), P DNN either outperforms or qualitatively improves upon them.

show abstract

“…Several approaches to alleviating memory pressure on GPUs have been used. If at least one sample can fit in GPU memory, an out-of-core "micro-batching" approach, where mini-batches are split into micro-batches and updates accumulated, can be used, but this can increase training time [43]. Other approaches utilize recomputation to avoid keeping intermediate values [44].…”

Section: Related Workmentioning

confidence: 99%

Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism

Dryden

Maruyama

Benson

et al. 2019

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

B. Performance modelingWe make use of analytic models for some of our performance estimates, particularly communication. We use a linear model [21] for communication, where α is the latency and β is the inverse bandwidth. Then the cost to send a message between two nodes is α + βn. We additionally assume that the network is full-duplex and that there is no interference.Collective communication operations such as allreduce will be important for some operations; for these, we use the performance models of Thakur et al [22]. For distributed matrix multiplication, we use the performance models developed for the Elemental library [23]. C. NotationWe now define some notation for distributed tensors that will be used throughout this paper. Our notation is heavily based on the tensor notation developed for the FLAME project [23]- [25].A tensor is an M -dimensional array, where the size of dimension m is I m , and we write I = (I 0 , . . . , I M −1 ) to refer to the shape of an entire tensor.

show abstract

ooc_cuDNN: Accommodating convolutional neural networks over GPU memory capacity

Cited by 16 publications

References 9 publications

The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism

The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism

A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs

Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism

Contact Info

Product

Resources

About