Optimizing matrix transposes using a POWER7 cache model and explicit prefetching

Mateescu, Gabriel; Bauer, Gregory H.; Fiedler, Robert

doi:10.1145/2381056.2381073

Cited by 11 publications

(5 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Based on the model, we have designed a matrix transpose code whose memory bandwidth is higher than that of the dgetmo routine. The full paper can be found in [2].…”

Section: Discussionmentioning

confidence: 99%

Optimizing matrix transposes using a POWER7 cache model and explicit prefetching

Mateescu

Bauer

Fiedler

2011

Proceedings of the Second International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Compu

Self Cite

View full text Add to dashboard Cite

show abstract

“…Based on the model, we have designed a matrix transpose code whose memory bandwidth is higher than that of the dgetmo routine. The full paper can be found in [2].…”

Section: Discussionmentioning

confidence: 99%

Optimizing matrix transposes using a POWER7 cache model and explicit prefetching

Mateescu

Bauer

Fiedler

2011

Proceedings of the Second International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Compu

Self Cite

View full text Add to dashboard Cite

show abstract

“…Two-dimensional tensor transposition (i.e., matrix transposition) is a well studied operation, including optimizations for blocking, vectorization, unrolling, and software prefetching [3,6,11,13,14,25]. The same optimizations are investigated in the context threedimensional out-of-place tensor transpositions on CPUs [10,22].…”

Section: Related Workmentioning

confidence: 99%

HPTT: a high-performance tensor transposition C++ library

Springer

Bientinesi

2017

Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming

View full text Add to dashboard Cite

Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multithreading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called microkernel. This modular design-inspired by BLIS-makes HPTT easy to port to different architectures, by only replacing the handvectorized micro-kernel (e.g., a 4 × 4 transpose). HPTT also offers an optional autotuning framework-guided by performance heuristics-that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen's tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1×.

show abstract

“…10 To assess the performance of TTC across a wide range of use cases, we report TTC's bandwidth on a synthetic tensor transpositions benchmark [14]. 11 The benchmark comprises a total of 57 transpositions ranging from 2D to 6D; each tensor of the benchmark is of size 200 MB.…”

Section: Performance Evaluationmentioning

confidence: 99%

“…10 Linux applies the first touch policy, meaning that data is allocated close to the thread which touches the data first-not the thread who allocates the data. 11 The complete benchmark is available at www.github.com/HPAC/TTC/tree/master/benchmark formance for HSW ( Fig. 5b) and M840 (Fig.…”

Section: Performance Evaluationmentioning

confidence: 99%

“…Due to the non-contiguous memory access patterns and the vast number of architecture-specific optimizations required by modern vector processors (e.g., vectorization, blocking for caches, nonuniform memory accesses (NUMA)), writing high-performance tensor transpositions is a challenging task. Until now, many research efforts focused on 2D [5,9,11,12,18] and 3D transpositions [8,15], while higher dimensional transpositions [10,19] are mostly still uncovered.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

TTC: a tensor transposition compiler for multiple architectures

Springer

Sankaran

Bientinesi

2016

Proceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming

View full text Add to dashboard Cite

We consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system's peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g., Intel Haswell, AMD Steamroller), Intel's Knights Corner as well as different CUDA-based GPUs such as NVIDIA's Kepler and Maxwell architectures. We report speedups of TTC over a meaningful baseline implementation generated by external C++ compilers; the results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel's latest C++ compiler on the Haswell and Knights Corner architecture, TTC yields speedups of up to 8× and 32×, respectively. We also showcase TTC's support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.

show abstract

Optimizing matrix transposes using a POWER7 cache model and explicit prefetching

Cited by 11 publications

References 5 publications

Optimizing matrix transposes using a POWER7 cache model and explicit prefetching

Optimizing matrix transposes using a POWER7 cache model and explicit prefetching

HPTT: a high-performance tensor transposition C++ library

TTC: a tensor transposition compiler for multiple architectures

Contact Info

Product

Resources

About