2023
DOI: 10.1016/j.sysarc.2022.102806
|View full text |Cite
|
Sign up to set email alerts
|

Reformulating the direct convolution for high-performance deep learning inference on ARM processors

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 17 publications
(21 citation statements)
references
References 8 publications
0
21
0
Order By: Relevance
“…In previous work [19], we combined the blocking strategy in [4] for the direct convolution algorithm with the packing schemes employed in the high-performance formulation of gemm [20]. The result was a new blocked version of the direct convolution, referred to as ConvDirect and illustrated by the algorithm in Listing 2, with the following properties:…”
Section: Blocked Algorithm For Direct Convolutionmentioning
confidence: 99%
See 1 more Smart Citation
“…In previous work [19], we combined the blocking strategy in [4] for the direct convolution algorithm with the packing schemes employed in the high-performance formulation of gemm [20]. The result was a new blocked version of the direct convolution, referred to as ConvDirect and illustrated by the algorithm in Listing 2, with the following properties:…”
Section: Blocked Algorithm For Direct Convolutionmentioning
confidence: 99%
“…A significant key to attaining high performance in the blocked direct convolution lies in the utilisation of an architecture-specific micro-kernel. The decoupling of the micro-tile dimensions from the cache blocking parameters combined with the packing of the input tensor facilitates leveraging existing high-performance micro-kernels, specifically tuned for a concrete processor architecture [19]. The advantage of our approach is to directly handle the well-adopted NHWC data layout, avoiding the tensor transformation overhead of previous algorithm designs [4].…”
Section: Blocked Algorithm For Direct Convolutionmentioning
confidence: 99%
“…Following the work from Zhang et al [28], Barrachina et al [3] propose two new direct-convolution algorithms for the NHWC layout (batch 𝑁 , height 𝐻 , width 𝑊 , and channels 𝐶) on ARM processors. Like SConv, they tile in the channel dimension and use a BLAS micro-kernel.…”
Section: Sconv Reduces Cache Misses In All Levels Of Cachementioning
confidence: 99%
“…In previous work direct convolution outperforms the traditional Im2Col followed by GEMM approach under certain conditions [3,28]. This paper presents SConv: a directconvolution algorithm that uses architectural information to improve convolution's cache utilization and ISA extensions to accelerate data packing and computation, suitable for SIMD architectures.…”
Section: Introductionmentioning
confidence: 99%
“…These implementations may often have different data layout requirements, which means that data reshape routines are often required to perform data permutations between the different layout requirements for consecutive layers. A classic example is the need to perform the im2col transformation, either implicitly or explicitly [8,9], in order to leverage high performance matrix multiplication routines. These need to support different layouts and routines that transform between different layouts further increases the size of the code-based that needs to be supported.…”
Section: Background 21 Expert ML Librariesmentioning
confidence: 99%