Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) 2013
DOI: 10.1109/cgo.2013.6494986
|View full text |Cite
|
Sign up to set email alerts
|

Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs

Abstract: In this paper, we present an approach to estimate GPU applications' performance upper bound based on algorithm analysis and assembly code level benchmarking. As an example, we analyze the potential peak performance of SGEMM (Single-precision General Matrix Multiply) on Fermi (GF110) and Kepler (GK104) GPUs. We try to answer the question of how much optimization space is left for SGEMM and why. According to our analysis, the nature of Fermi (Kepler) instruction set and the limited issue throughput of the schedu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2014
2014
2022
2022

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 37 publications
(2 citation statements)
references
References 16 publications
0
2
0
Order By: Relevance
“…From a scientific perspective, several works have previously published auto-tuning and optimization approaches for dense matrixmatrix multiplications [7,8,10,11,17,21]. In fact, the GEMM kernel in CLBlast is based on and evolved from the work by Matsumoto et al [10].…”
Section: Related Workmentioning
confidence: 99%
“…From a scientific perspective, several works have previously published auto-tuning and optimization approaches for dense matrixmatrix multiplications [7,8,10,11,17,21]. In fact, the GEMM kernel in CLBlast is based on and evolved from the work by Matsumoto et al [10].…”
Section: Related Workmentioning
confidence: 99%
“…Larrabee [61] threading and vectorization model allowed SIMD rebundling to maintain task efficiency. Current GPUs offer large number of hardware threads, yet relying solely on thread-level parallelism is insufficient [72], and taking advantage of ILP and MLP is critical for GPU assembly-optimized libraries [35,45].…”
Section: Mlp Improvementsmentioning
confidence: 99%