Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures

Georganas, Evangelos; Avancha, Sasikanth; Banerjee, Kunal; Kalamkar, Dhiraj D.; Henry, Greg; Pabst, Hans; Heinecke, Alexander

doi:10.1109/sc.2018.00069

Cited by 98 publications

(109 citation statements)

References 15 publications

Supporting

Mentioning

109

Contrasting

Order By: Relevance

“…Loop-optimizations unrolling, 9,17,23,24,29,50,84,90 collapsing, 4,6,7,13,20,21,44,54 splitting 22,28 Blocking (tiling) in cache, 14,15,18,[20][21][22]27,39,44,52,54,69 registers 68,69 Compile-time optimizations using pre-computed values, 35,52 specifying array and loop bounds at compile time 6,54 Compute-related optimizations Reusing intermediate variables, 22,35 using conflict-detection instruction of AVX-512, 52,85 performing redundant computation to avoid data-communication or atomic operations 52,82 Array transpose 6, 79…”

Section: Ta B L E 3 Optimization Strategiesmentioning

confidence: 99%

“…Georganas et al 69 note that implementing convolution as GEMM leads to large memory footprint and memory bandwidth dependency. Hence, they implement convolution using direct convolution which avoids expensive memory accesses due to shuffle/scatter/gather operations.…”

Section: Machine Learningmentioning

confidence: 99%

“…Hence, they implement convolution using direct convolution which avoids expensive memory accesses due to shuffle/scatter/gather operations. Notice that Das et al 68 implement convolution as matrix-multiplication whereas Georganas et al 69 implement it as direct convolution. Figure 17A shows the forward-propagation code using direct convolution.…”

Section: Machine Learningmentioning

confidence: 99%

“…Crimi et al 87 Cache-related excessive cache misses, 17,54 lack of shared last level cache 14,29,69 Slow communication on Memory, 29,35,50 PCIe bus, 14,19,24,36,50,82,95 between two Phis in the same node 43 Core and performance-scalability related factors…”

Section: Comparison With Cpumentioning

confidence: 99%

See 3 more Smart Citations

A survey on evaluating and optimizing performance of Intel Xeon Phi

Mittal

2020

Concurrency and Computation

View full text Add to dashboard Cite

Summary Intel's Xeon Phi combines the parallel processing power of a many‐core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors that bottleneck the performance of Phi. We also review works that perform comparison or collaborative execution of Phi with CPUs and GPUs. This paper will be useful for researchers and developers in the area of computer‐architecture and high‐performance computing.

show abstract

Section: Ta B L E 3 Optimization Strategiesmentioning

confidence: 99%

Section: Machine Learningmentioning

confidence: 99%

Section: Machine Learningmentioning

confidence: 99%

Section: Comparison With Cpumentioning

confidence: 99%

See 2 more Smart Citations

A survey on evaluating and optimizing performance of Intel Xeon Phi

Mittal

2020

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…where • denotes scalar multiplication, and O, I , and K are all 4-dimensional arrays of scalars. The resulting code is little more than 7 nested loops around a multiplyaccumulate operation, but array layout, vectorization, parallelization and caching are extremely important for performance [5].…”

Section: New Ideas Often Require New Primitivesmentioning

confidence: 99%

Machine Learning Systems are Stuck in a Rut

Barham

Isard

2019

Proceedings of the Workshop on Hot Topics in Operating Systems

View full text Add to dashboard Cite

In this paper we argue that systems for numerical computing are stuck in a local basin of performance and programmability. Systems researchers are doing an excellent job improving the performance of 5-year-old benchmarks, but gradually making it harder to explore innovative machine learning research ideas. We explain how the evolution of hardware accelerators favors compiler back ends that hyper-optimize large monolithic kernels, show how this reliance on highperformance but inflexible kernels reinforces the dominant style of programming model, and argue these programming abstractions lack expressiveness, maintainability, and modularity; all of which hinders research progress. We conclude by noting promising directions in the field, and advocate steps to advance progress towards high-performance general purpose numerical computing systems on modern accelerators.

show abstract