2020
DOI: 10.1016/j.asej.2020.01.003
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing matrix-matrix multiplication on intel’s advanced vector extensions multicore processor

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(4 citation statements)
references
References 12 publications
0
4
0
Order By: Relevance
“…Robust performance benchmarking is critical for the evaluation of vector extensions. While there is extensive performance evaluation of matrix multiplication on vector extensions for Intel architectures 25,26,27 , to the best of our knowledge, similar studies do not exist for the PowerPC or Arm platforms. Moreover, the introduction of matrix engines is recent in all platforms and therefore only simulated or theorized performance estimates exist for AMX, SVE, or MMA 28,24 .…”
Section: Related Workmentioning
confidence: 99%
“…Robust performance benchmarking is critical for the evaluation of vector extensions. While there is extensive performance evaluation of matrix multiplication on vector extensions for Intel architectures 25,26,27 , to the best of our knowledge, similar studies do not exist for the PowerPC or Arm platforms. Moreover, the introduction of matrix engines is recent in all platforms and therefore only simulated or theorized performance estimates exist for AMX, SVE, or MMA 28,24 .…”
Section: Related Workmentioning
confidence: 99%
“…They investigated various vectorization options and available compilers to investigate the impacts on the performances of different algorithms. The authors of [9] explored the optimization of matrix multiplication. The other general concept is sorting, a fast vectorized implementation presented in the paper [10].…”
Section: Introductionmentioning
confidence: 99%
“…With the explosive growth of data quantity and high-flux/high-performance computing requiring urgent, a single chip tends to integrate multiple processor cores [ 1 3 ]. There is a need for a large amount of data communication within the kernel; as is often the case, the kernel passes through the bus routes between mutual connections, but this bus connection mode needs to manage the clock synchronization [ 4 ].…”
Section: Introductionmentioning
confidence: 99%