SC18: International Conference for High Performance Computing, Networking, Storage and Analysis 2018
DOI: 10.1109/sc.2018.00069
|View full text |Cite
|
Sign up to set email alerts
|

Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures

Abstract: Convolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, neural machine translation and speech recognition. The computationally expensive nature of a convolution operation has led to the proliferation of implementations including matrix-matrix multiplication formulation, and direct convolution primarily targeting GPUs. In this paper, we introduce direct convolution kernels fo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
109
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 98 publications
(109 citation statements)
references
References 15 publications
0
109
0
Order By: Relevance
“…Loop-optimizations unrolling, 9,17,23,24,29,50,84,90 collapsing, 4,6,7,13,20,21,44,54 splitting 22,28 Blocking (tiling) in cache, 14,15,18,[20][21][22]27,39,44,52,54,69 registers 68,69 Compile-time optimizations using pre-computed values, 35,52 specifying array and loop bounds at compile time 6,54 Compute-related optimizations Reusing intermediate variables, 22,35 using conflict-detection instruction of AVX-512, 52,85 performing redundant computation to avoid data-communication or atomic operations 52,82 Array transpose 6, 79…”
Section: Ta B L E 3 Optimization Strategiesmentioning
confidence: 99%
See 3 more Smart Citations
“…Loop-optimizations unrolling, 9,17,23,24,29,50,84,90 collapsing, 4,6,7,13,20,21,44,54 splitting 22,28 Blocking (tiling) in cache, 14,15,18,[20][21][22]27,39,44,52,54,69 registers 68,69 Compile-time optimizations using pre-computed values, 35,52 specifying array and loop bounds at compile time 6,54 Compute-related optimizations Reusing intermediate variables, 22,35 using conflict-detection instruction of AVX-512, 52,85 performing redundant computation to avoid data-communication or atomic operations 52,82 Array transpose 6, 79…”
Section: Ta B L E 3 Optimization Strategiesmentioning
confidence: 99%
“…Georganas et al 69 note that implementing convolution as GEMM leads to large memory footprint and memory bandwidth dependency. Hence, they implement convolution using direct convolution which avoids expensive memory accesses due to shuffle/scatter/gather operations.…”
Section: Machine Learningmentioning
confidence: 99%
See 2 more Smart Citations
“…where • denotes scalar multiplication, and O, I , and K are all 4-dimensional arrays of scalars. The resulting code is little more than 7 nested loops around a multiplyaccumulate operation, but array layout, vectorization, parallelization and caching are extremely important for performance [5].…”
Section: New Ideas Often Require New Primitivesmentioning
confidence: 99%