Reformulating the direct convolution for high-performance deep learning inference on ARM processors

Barrachina, Sergio; Castelló, Adrián; Dolz, Manuel F.; Low, Tze Meng; Martínez, Héctor; Quintana–Ort́ı, Enrique S.; Sridhar, Upasana; Tomás, Andrés E.

doi:10.1016/j.sysarc.2022.102806

Cited by 17 publications

(21 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In previous work [19], we combined the blocking strategy in [4] for the direct convolution algorithm with the packing schemes employed in the high-performance formulation of gemm [20]. The result was a new blocked version of the direct convolution, referred to as ConvDirect and illustrated by the algorithm in Listing 2, with the following properties:…”

Section: Blocked Algorithm For Direct Convolutionmentioning

confidence: 99%

“…A significant key to attaining high performance in the blocked direct convolution lies in the utilisation of an architecture-specific micro-kernel. The decoupling of the micro-tile dimensions from the cache blocking parameters combined with the packing of the input tensor facilitates leveraging existing high-performance micro-kernels, specifically tuned for a concrete processor architecture [19]. The advantage of our approach is to directly handle the well-adopted NHWC data layout, avoiding the tensor transformation overhead of previous algorithm designs [4].…”

Section: Blocked Algorithm For Direct Convolutionmentioning

confidence: 99%

See 1 more Smart Citation

Performance–energy trade-offs of deep learning convolution algorithms on ARM processors

et al. 2023

View full text Add to dashboard Cite

In this work, we assess the performance and energy efficiency of high-performance codes for the convolution operator, based on the direct, explicit/implicit lowering and Winograd algorithms used for deep learning (DL) inference on a series of ARM-based processor architectures. Specifically, we evaluate the NVIDIA Denver2 and Carmel processors, as well as the ARM Cortex-A57 and Cortex-A78AE CPUs as part of a recent set of NVIDIA Jetson platforms. The performance–energy evaluation is carried out using the ResNet-50 v1.5 convolutional neural network (CNN) on varying configurations of convolution algorithms, number of threads/cores, and operating frequencies on the tested processor cores. The results demonstrate that the best throughput is obtained on all platforms with the Winograd convolution operator running on all the cores at their highest frequency. However, if the goal is to reduce the energy footprint, there is no rule of thumb for the optimal configuration.

show abstract

Section: Blocked Algorithm For Direct Convolutionmentioning

confidence: 99%

Section: Blocked Algorithm For Direct Convolutionmentioning

confidence: 99%

Performance–energy trade-offs of deep learning convolution algorithms on ARM processors

et al. 2023

View full text Add to dashboard Cite

show abstract

“…Following the work from Zhang et al [28], Barrachina et al [3] propose two new direct-convolution algorithms for the NHWC layout (batch 𝑁 , height 𝐻 , width 𝑊 , and channels 𝐶) on ARM processors. Like SConv, they tile in the channel dimension and use a BLAS micro-kernel.…”

Section: Sconv Reduces Cache Misses In All Levels Of Cachementioning

confidence: 99%

“…In previous work direct convolution outperforms the traditional Im2Col followed by GEMM approach under certain conditions [3,28]. This paper presents SConv: a directconvolution algorithm that uses architectural information to improve convolution's cache utilization and ISA extensions to accelerate data packing and computation, suitable for SIMD architectures.…”

Section: Introductionmentioning

confidence: 99%

Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions

Ferrari¹,

Sousa²,

Pereira³

et al. 2023

Preprint

View full text Add to dashboard Cite

Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. A traditional approach to compute convolutions is known as the Im2Col + BLAS method. This paper proposes SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers . This algorithm introduces: (a) Convolution Slicing Analysis (CSA) -a convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse over the cache hierarchy; (b) Convolution Slicing Optimization (CSO) -a code-generation pass that uses CSA to generate a tiled direct-convolution macro-kernel; and (c) Vector-Based Packing (VBP) -an architecture-specific optimized input-tensor packing solution based on vector-register shift instructions for convolutions with unitary stride. Experiments conducted on 393 convolutions from full ONNX-MLIR machine-learning models indicate that the elimination of the Im2Col transformation and the use of fast packing routines result in a total packing time reduction, on full model inference, of 2.0x -3.9x on Intel x86 and 3.6x -7.2x on IBM POWER10. The speed-up over an Im2Col + BLAS method based on current BLAS implementations for end-to-end machine-learning model inference is in the range of 9% -25% for Intel x86 and 10% -42% for IBM POWER10 architectures. The total convolution speedup for model inference is 12% -27% on Intel x86 and 26% -46% on IBM POWER10. SConv also outperforms BLAS GEMM, when computing pointwise convolutions, in more than 83% of the 219 tested instances.

show abstract

“…These implementations may often have different data layout requirements, which means that data reshape routines are often required to perform data permutations between the different layout requirements for consecutive layers. A classic example is the need to perform the im2col transformation, either implicitly or explicitly [8,9], in order to leverage high performance matrix multiplication routines. These need to support different layouts and routines that transform between different layouts further increases the size of the code-based that needs to be supported.…”

Section: Background 21 Expert ML Librariesmentioning

confidence: 99%

SMaLL: A Software Framework for portable Machine Learning Libraries

Sridhar¹,

Tukanov²,

Binder³

et al. 2023

Preprint

View full text Add to dashboard Cite

Interest in deploying Deep Neural Network (DNN) inference on edge devices has resulted in an explosion of the number and types of hardware platforms to use. While the high-level programming interface, such as TensorFlow, can be readily ported across different devices, high-performance inference implementations rely on a good mapping of the high-level interface to the target hardware platform. Commonly, this mapping may use optimizing compilers to generate code at compile time or high-performance vendor libraries that have been specialized to the target platform.Both approaches rely on expert knowledge to produce the mapping, which may be time-consuming and difficult to extend to new architectures.In this work, we present a DNN library framework, SMaLL, that is easily extensible to new architectures. The framework uses a unified loop structure and shared, cache-friendly data format across all intermediate layers, eliminating the time and memory overheads incurred by data transformation between layers. Layers are implemented by simply specifying the layer's dimensions and a kernel -the key computing operations of each layer.The unified loop structure and kernel abstraction allows us to reuse code across layers and computing platforms. New architectures only require the 100s of lines in the kernel to be redesigned. To show the benefits of our approach, we have developed software that supports a range of layer types and computing platforms, which is easily extensible for rapidly instantiating high performance DNN libraries.

show abstract

Reformulating the direct convolution for high-performance deep learning inference on ARM processors

Cited by 17 publications

References 8 publications

Performance–energy trade-offs of deep learning convolution algorithms on ARM processors

Performance–energy trade-offs of deep learning convolution algorithms on ARM processors

Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions

SMaLL: A Software Framework for portable Machine Learning Libraries

Contact Info

Product

Resources

About