2012
DOI: 10.1145/2381056.2381073
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing matrix transposes using a POWER7 cache model and explicit prefetching

Abstract: We consider the problem of efficiently computing matrix transposes on the POWER7 architecture. We develop a matrix transpose algorithm that uses cache blocking, cache prefetching and data alignment. We model the POWER7 data cache and memory concurrency and use the model to predict the memory throughput of the proposed matrix transpose algorithm. The performance of our matrix transpose algorithm is up to five times higher than that of the dgetmo routine of the Engineering and Scientific Subroutine Library and i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2015
2015
2022
2022

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(5 citation statements)
references
References 5 publications
0
5
0
Order By: Relevance
“…Based on the model, we have designed a matrix transpose code whose memory bandwidth is higher than that of the dgetmo routine. The full paper can be found in [2].…”
Section: Discussionmentioning
confidence: 99%
“…Based on the model, we have designed a matrix transpose code whose memory bandwidth is higher than that of the dgetmo routine. The full paper can be found in [2].…”
Section: Discussionmentioning
confidence: 99%
“…Two-dimensional tensor transposition (i.e., matrix transposition) is a well studied operation, including optimizations for blocking, vectorization, unrolling, and software prefetching [3,6,11,13,14,25]. The same optimizations are investigated in the context threedimensional out-of-place tensor transpositions on CPUs [10,22].…”
Section: Related Workmentioning
confidence: 99%
“…10 To assess the performance of TTC across a wide range of use cases, we report TTC's bandwidth on a synthetic tensor transpositions benchmark [14]. 11 The benchmark comprises a total of 57 transpositions ranging from 2D to 6D; each tensor of the benchmark is of size 200 MB.…”
Section: Performance Evaluationmentioning
confidence: 99%
“…10 Linux applies the first touch policy, meaning that data is allocated close to the thread which touches the data first-not the thread who allocates the data. 11 The complete benchmark is available at www.github.com/HPAC/TTC/tree/master/benchmark formance for HSW ( Fig. 5b) and M840 (Fig.…”
Section: Performance Evaluationmentioning
confidence: 99%
See 1 more Smart Citation