High Performance and Portable Convolution Operators for Multicore Processors

Juan, Pablo San; Castelló, Adrián; Dolz, Manuel F.; Alonso‐Jordá, Pedro; Quintana–Ort́ı, Enrique S.

doi:10.1109/sbac-pad49847.2020.00023

Cited by 19 publications

(19 citation statements)

References 21 publications

(27 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We will focus on four algorithms in particular: im2col, blocking, Winograd convolutions, and FFT convolutions. im2col [14], Winograd [13], and FFT techniques [17] for performing convolutions are all well documented in the literature. We will focus on designing improved blocking algorithms.…”

Section: Attainabilitymentioning

confidence: 99%

Communication Bounds for Convolutional Neural Networks

Chen,

Demmel,

Dinh

et al. 2022

Preprint

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) are important in a wide variety of machine learning tasks and applications, so optimizing their performance is essential. Moving words of data between levels of a memory hierarchy or between processors on a network is much more expensive than the cost of arithmetic, so minimizing communication is critical to optimizing performance. In this paper, we present new lower bounds on data movement for mixed precision convolutions in both single-processor and parallel distributed memory models, as well as algorithms that outperform current implementations such as Im2Col. We obtain performance figures using GEMMINI, a machine learning accelerator, where our tiling provides improvements between 13% and 150% over a vendor supplied algorithm. CCS CONCEPTS• Computing methodologies → Machine learning algorithms; Parallel algorithms; • Mathematics of computing → Mathematical analysis.

show abstract

Section: Attainabilitymentioning

confidence: 99%

Communication Bounds for Convolutional Neural Networks

Chen,

Demmel,

Dinh

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In this paper, we extend our previous work in [26] to obtain an efficient integration of the convolution operators in a framework for distributed training of DNNs on clusters of computers equipped with multicore processors. In particular, this work makes the following contributions:…”

Section: Introductionmentioning

confidence: 97%

“…Unfortunately, there are two major problems with this approach: 1) a large memory workspace is required to host the intermediate matrix generated by the im2col transform; and, especially for training, 2) the time to apply this transform is not negligible for complex CNNs. In [26], we presented a portable high performance convolution algorithm based on the BLIS [33] realization of gemm, named convgemm, that practically eliminates the memory and time cost of the im2col transform, while maintaining the portability and performance of the underlying realization of the BLIS gemm for multicore processors.…”

Section: Introductionmentioning

confidence: 99%

“…• For the computation of the downstream gradients with respect to the inputs, we adopt an approach similar to that in [26] to integrate a col2im operator within the BLIS realization of gemm, yielding a deconvgemm operator that avoids the creation of a large intermediate matrix, while maintaining the portability and, to a certain extent, performance.…”

Section: Introductionmentioning

confidence: 99%

“…In Section 2, we offer a short review of supervised iterative training for DNNs via the stochastic gradient descent (SGD), paying special attention to the convolution layers appearing in the forward-backward iteration. In Section 3, we first discuss the BLIS kernel for gemm and explain how this was leveraged in [26] to obtain an efficient realization of the convolution operator in the forward pass of the training process. In the same section, we next discuss how to extend that idea to the more complex backward propagation stage, which constitutes one of the major contributions of this work.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors

Barrachina

Dolz

Juan

et al. 2022

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

et al. 2021

Self Cite

View full text Add to dashboard Cite

We introduce a high performance, multi-threaded realization of the gemm kernel for the ARMv8.2 architecture that operates with 16-bit (half precision)/queryKindly check and confirm whether the corresponding author is correctly identified. floating point operands. Our code is especially designed for efficient machine learning inference (and to a certain extent, also training) with deep neural networks. The results on the NVIDIA Carmel multicore processor, which implements the ARMv8.2 architecture, show considerable performance gains for the gemm kernel, close to the theoretical peak acceleration that could be expected when moving from 32-bit arithmetic/data to 16-bit. Combined with the type of convolution operator arising in convolutional neural networks, the speed-ups are more modest though still relevant.

show abstract

High Performance and Portable Convolution Operators for Multicore Processors

Cited by 19 publications

References 21 publications

Communication Bounds for Convolutional Neural Networks

Communication Bounds for Convolutional Neural Networks

Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

Contact Info

Product

Resources

About