2020
DOI: 10.1145/3378176
|View full text |Cite
|
Sign up to set email alerts
|

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Abstract: We present a systematic methodology for optimizing batched matrix multiplications on SW26010 many-core processor of the Sunway TaihuLight supercomputer. Five surrogate algorithms and a machine learning-based algorithm selector are proposed to fully exploit the computing capability of SW26010 and cope with the sophisticated algorithm characteristics of batched matrix multiplications. Experiment results show that the algorithm selector is able to adaptively choose the appropriate algorithm for various matrix sha… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(2 citation statements)
references
References 44 publications
(71 reference statements)
0
2
0
Order By: Relevance
“…We note that previous studies [31], [32], [33], [34], [35], [36], [37], [38] have exploited tiling and autotuning for convolution and GEMM operations. However, these prior methods are inadequate for pointwise convolutions on GPUs due to two main drawbacks: they do not consider SM utilization when choosing the optimal tile size and are not designed for pointwise convolutions with small inputs.…”
Section: Optimizing Pointwise Convolutionmentioning
confidence: 97%
“…We note that previous studies [31], [32], [33], [34], [35], [36], [37], [38] have exploited tiling and autotuning for convolution and GEMM operations. However, these prior methods are inadequate for pointwise convolutions on GPUs due to two main drawbacks: they do not consider SM utilization when choosing the optimal tile size and are not designed for pointwise convolutions with small inputs.…”
Section: Optimizing Pointwise Convolutionmentioning
confidence: 97%
“…This rank-3 formulation would open up interesting opportunities from a computational standpoint, since one can draw ideas from the advances on tensor algebra taking place within the deep learning community, where rank-3 tensors are at the core of formulations. Of particular interest are algorithmic developments for batched matrix multiplication kernels [37,1,25,20] and hardware innovations such as tensor cores [26] and tensor processing units [21]. Since this is outside the scope of this work, we omit a full discussion on it and reserve it for a future work.…”
Section: Rank-2 Formulation Analysismentioning
confidence: 99%