2016
DOI: 10.1007/s11227-015-1613-7
|View full text |Cite
|
Sign up to set email alerts
|

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Abstract: Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like Matrix-Matrix Multiplication. A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome thi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
3

Relationship

3
4

Authors

Journals

citations
Cited by 18 publications
(10 citation statements)
references
References 63 publications
0
8
0
Order By: Relevance
“…We note that previous studies [31], [32], [33], [34], [35], [36], [37], [38] have exploited tiling and autotuning for convolution and GEMM operations. However, these prior methods are inadequate for pointwise convolutions on GPUs due to two main drawbacks: they do not consider SM utilization when choosing the optimal tile size and are not designed for pointwise convolutions with small inputs.…”
Section: Optimizing Pointwise Convolutionmentioning
confidence: 98%
“…We note that previous studies [31], [32], [33], [34], [35], [36], [37], [38] have exploited tiling and autotuning for convolution and GEMM operations. However, these prior methods are inadequate for pointwise convolutions on GPUs due to two main drawbacks: they do not consider SM utilization when choosing the optimal tile size and are not designed for pointwise convolutions with small inputs.…”
Section: Optimizing Pointwise Convolutionmentioning
confidence: 98%
“…However, in loop interchange and blocking the data reuses gain much better performance compared to basic and transposed methods which are shown in Figure 2(c) and Figure 2(d) respectively. MMM speedup has been the major goal of many studies [8], [11]- [15] and is still ongoing today. BLAS [13], [16] is a basic linear algebra subprogram (BLAS) that provides a standard blocking method for matrix multiplication.…”
Section: Related Workmentioning
confidence: 99%
“…Many researchers have worked on the high-performance implementation of MMM [4], [6], [7]. Some cases have been performed on CPU platforms [8], [9], while other implementations have been performed on graphics processing unit (GPU) platforms. There are some software optimization techniques in both CPU and GPU implementation such as instruction-level parallelism (ILP), data-level parallelism (DLP), and thread level parallelism (TLP) [8].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Many research works as well ATLAS [51] (one of the state of the art high performance libraries) apply loop tiling by taking into account only the cache size, i.e., the accumulated size of three rectangular tiles (one of each matrix) must be smaller or equal to the cache size; however, the elements of these tiles are not written in consecutive main memory locations (the elements of each tile sub-row are written in different main memory locations) and thus they do not use consecutive data cache locations; this means that having a set-associative cache, they cannot simultaneously fit in data cache due to the cache modulo effect. Moreover, even if the tile elements are written in consecutive main memory locations (different data array layout), the three tiles cannot simultaneously fit in data cache if the cache is two-way associative or direct mapped [52], [53]. Thus, loop tiling is efficient only when cache size, cache associativity and data array layouts, are addressed together as one problem and not separately.…”
Section: Loop Tiling and Data Array Layoutsmentioning
confidence: 99%