A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks

Barrachina, Sergio; Castelló, Adrián; Catalán, Mar; Dolz, Manuel F.; Mestre, Jose I.

doi:10.1109/ipdpsw52791.2021.00110

Cited by 6 publications

(7 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A second article on PyDTNN [3] provided practical evidence that the distributed training on GPUs using PyDTNN attains similar accuracy and parallel performance to those achieved by Tensor-Flow+Horovod on GPUs. In that case, the GPU backend of PyDTNN was used, which internally calls the NVIDIA cuDNN library to perform the model layers related operations.…”

Section: Comparison With Tensorflow+horovodmentioning

confidence: 99%

“…The use of the "reshape" operator A ≡ Reshape(F ) there re-arranges the input 4D filter tensor F as the 2D matrix A. In addition, the reshape followed by a transpose O ≡ Reshape(C ) T (1,2,0,3) , where the superindex (1, 2, 0, 3) specifies the permutation applied to the dimensions of Reshape(C ), re-organizes the resulting C matrix back into the 4D output tensor O .…”

Section: Convolution Operators Via Gemm: the Im2col In Fpmentioning

confidence: 99%

“…A second source of performance inefficiency appears for the deconvgemm operator because the unpacking of C c into ∂ O /∂ I cannot be parallelized due to data dependencies between iterations, as different positions of C c may need to be accumulated into the same entries of ∂ O /∂ I . 3 In summary, the utilization of the deconvgemm operator poses a trade-off between performance and memory consumption.…”

Section: Integration Of the Col2im Transform In Bp Inside Blis Gemmmentioning

confidence: 99%

“…Fig. 9 illustrates the details of this transform using the re-indexing of an example image I of size h i = w i = 7 padded with p v = p h = 1, which has to be convolved with a filter of size k h = k w = 3 using a stride of s v = s h = 2 in 3 The overhead due to the unpacking operation depends on the convolution parameters, which are different depending on the layer and model. For many CNNs, these parameters vary significantly, and so are the case for the overheads.…”

Section: Re-indexingmentioning

confidence: 99%

“…The experiments in [3] also allow a comparison between the performance on training when performed on CPUs and GPUs. Our experiments in that paper with VGG16 show a throughput difference between 60× (1 node) and 100× (8 nodes) in favor of the GPUs.…”

Section: Comparison With Tensorflow+horovodmentioning

confidence: 99%

See 4 more Smart Citations

Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors

Barrachina

Dolz

Juan

et al. 2022

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

Section: Comparison With Tensorflow+horovodmentioning

confidence: 99%

Section: Convolution Operators Via Gemm: the Im2col In Fpmentioning

confidence: 99%

Section: Integration Of the Col2im Transform In Bp Inside Blis Gemmmentioning

confidence: 99%

Section: Re-indexingmentioning

confidence: 99%

Section: Comparison With Tensorflow+horovodmentioning

confidence: 99%

See 3 more Smart Citations

Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors

Barrachina

Dolz

Juan

et al. 2022

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs

et al. 2021

Self Cite

View full text Add to dashboard Cite

In this work, we build a general piece-wise model to analyze data-parallel (DP) training costs of convolutional neural networks (CNNs) on clusters of GPUs. This general model is based on i) multi-layer perceptrons (MLPs) in charge of modeling the NVIDIA cuDNN/cuBLAS library kernels involved in the training of some of the state-of-the-art CNNs; and ii) an analytical model in charge of modeling the NVIDIA NCCL Allreduce collective primitive using the Ring algorithm. The CNN training scalability study performed using this model in combination with the Roofline technique on varying batch sizes, node (floating-point) arithmetic performance, node memory bandwidth, network link bandwidth, and cluster dimension unveil some crucial bottlenecks at both GPU and cluster level. To provide evidence of this analysis, we validate the accuracy of the proposed model against a Python library for distributed deep learning training.

show abstract

Performance–energy trade-offs of deep learning convolution algorithms on ARM processors

et al. 2023

Self Cite

View full text Add to dashboard Cite

In this work, we assess the performance and energy efficiency of high-performance codes for the convolution operator, based on the direct, explicit/implicit lowering and Winograd algorithms used for deep learning (DL) inference on a series of ARM-based processor architectures. Specifically, we evaluate the NVIDIA Denver2 and Carmel processors, as well as the ARM Cortex-A57 and Cortex-A78AE CPUs as part of a recent set of NVIDIA Jetson platforms. The performance–energy evaluation is carried out using the ResNet-50 v1.5 convolutional neural network (CNN) on varying configurations of convolution algorithms, number of threads/cores, and operating frequencies on the tested processor cores. The results demonstrate that the best throughput is obtained on all platforms with the Winograd convolution operator running on all the cores at their highest frequency. However, if the goal is to reduce the energy footprint, there is no rule of thumb for the optimal configuration.

show abstract

A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks

Cited by 6 publications

References 13 publications

Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors

Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors

Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs

Performance–energy trade-offs of deep learning convolution algorithms on ARM processors

Contact Info

Product

Resources

About