Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs

Lai, Jinxing; Seznec, André

doi:10.1109/cgo.2013.6494986

Cited by 37 publications

(2 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…From a scientific perspective, several works have previously published auto-tuning and optimization approaches for dense matrixmatrix multiplications [7,8,10,11,17,21]. In fact, the GEMM kernel in CLBlast is based on and evolved from the work by Matsumoto et al [10].…”

Section: Related Workmentioning

confidence: 99%

CLBlast

Nugteren

2018

Proceedings of the International Workshop on OpenCL

View full text Add to dashboard Cite

This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-multiplication routine (GEMM) to accelerate the core of many applications (e.g. deep learning, iterative solvers, astrophysics, computational fluid dynamics, quantum chemistry). CLBlast has five main advantages over other OpenCL BLAS libraries: 1) it is optimized for and tested on a large variety of OpenCL devices including less commonly used devices such as embedded and low-power GPUs, 2) it can be explicitly tuned for specific problem-sizes on specific hardware platforms, 3) it can perform operations in half-precision floating-point FP16 saving bandwidth, time and energy, 4) it has an optional CUDA back-end, 5) and it can combine multiple operations in a single batched routine, accelerating smaller problems significantly. This paper describes the library and demonstrates the advantages of CLBlast experimentally for different use-cases on a wide variety of OpenCL hardware.

show abstract

Section: Related Workmentioning

confidence: 99%

CLBlast

Nugteren

2018

Proceedings of the International Workshop on OpenCL

View full text Add to dashboard Cite

show abstract

“…Larrabee [61] threading and vectorization model allowed SIMD rebundling to maintain task efficiency. Current GPUs offer large number of hardware threads, yet relying solely on thread-level parallelism is insufficient [72], and taking advantage of ILP and MLP is critical for GPU assembly-optimized libraries [35,45].…”

Section: Mlp Improvementsmentioning

confidence: 99%

Cimple: instruction and memory level parallelism

Kiriansky¹,

Xu²,

Rinard³

et al. 2018

Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Modern out-of-order processors have increased capacity to exploit instruction level parallelism (ILP) and memory level parallelism (MLP), e.g., by using wide superscalar pipelines and vector execution units, as well as deep buffers for inflight memory requests. These resources, however, often exhibit poor utilization rates on workloads with large working sets, e.g., in-memory databases, key-value stores, and graph analytics, as compilers and hardware struggle to expose ILP and MLP from the instruction stream automatically.In this paper, we introduce the IMLP (Instruction and Memory Level Parallelism) task programming model. IMLP tasks execute as coroutines that yield execution at annotated long-latency operations, e.g., memory accesses, divisions, or unpredictable branches. IMLP tasks are interleaved on a single thread, and integrate well with thread parallelism and vectorization. Our DSL embedded in C++, Cimple, allows exploration of task scheduling and transformations, such as buffering, vectorization, pipelining, and prefetching.We demonstrate state-of-the-art performance on core algorithms used in in-memory databases that operate on arrays, hash tables, trees, and skip lists. Cimple applications reach 2.5× throughput gains over hardware multithreading on a multi-core, and 6.4× single thread speedup.

show abstract