Cache‐oblivious matrix algorithms in the age of multicores and many cores

Heinecke, Alexander; Trinitis, Carsten

doi:10.1002/cpe.2974

Cited by 6 publications

(4 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our algorithm FUR-Hilbert, as discussed in Section 5, is implemented in C++ and compiled with gcc version 4.9.2. We compare our algorithm to the algorithm "TifaMMy" for matrix multiplication based on the Peano Curve introduced by Bader et al [11], [12] (source code has been obtained by the authors compiled with icc version 16.0.3). Furthermore we compare our algorithm to the specifically for Intel processors hardware-and hand-optimized Intel MKL library (https://software.intel.com/en-us/intel-mkl) version 11.3 (operation: dgemm).…”

Section: Matrix Multiplicationmentioning

confidence: 99%

“…The cache hit rate for "perf" is calculated as: 1 − cache-misses:u cache-references:u . We use the maximum number of threads (12) for the variation in problem size and matrices of size 10 000 are processed for the variation of threads. Figure 10 illustrates the cache hit rate for each cache level respectively and the cache hit rate for the entire cache hierachy.…”

Section: Cache Hierachy On Matrix Multiplicationmentioning

confidence: 99%

“…In [12], [19] Bader et al present variants of their algorithms for matrix operations. As in [11], [13], the general algorithmic scheme is recursive partitioning according to the Peano curve.…”

Section: Optimized Techniques For Specific Tasks or Hardwarementioning

confidence: 99%

See 2 more Smart Citations

A Novel Hilbert Curve for Cache-Locality Preserving Loops

Böhm

Perdacher

Plant

2021

IEEE Trans. Big Data

View full text Add to dashboard Cite

Modern microprocessors offer a rich memory hierarchy including various levels of cache and registers. Some of these memories (like main memory, L3 cache) are big but slow and shared among all cores. Others (registers, L1 cache) are fast and exclusively assigned to a single core but small. Only if the data accesses have a high locality, we can avoid excessive data transfers between the memory hierarchy. In this paper we consider fundamental algorithms like matrix multiplication, K-Means, Cholesky decomposition as well as the algorithm by Floyd and Warshall typically operating in two or three nested loops. We propose to traverse these loops whenever possible not in the canonical order but in an order defined by a space-filling curve. This traversal order dramatically improves data locality over a wide granularity allowing not only to efficiently support a cache of a single, known size (cache conscious) but also a hierarchy of various caches where the effective size available to our algorithms may even be unknown (cache oblivious). We propose a new space-filling curve called Fast Unrestricted (FUR) Hilbert with the following advantages: (1) we overcome the usual limitation to square-like grid sizes where the side-length is a power of 2 or 3. Instead, our approach allows arbitrary loop boundaries for all variables. (2) FUR-Hilbert is non-recursive with a guaranteed constant worst case time complexity per loop iteration (in contrast to O(log(gridsize)) for previous methods). (3) Our non-recursive approach makes the application of our cache-oblivious loops in any host algorithm as easy as conventional loops and facilitates automatic optimization by the compiler. (4) We demonstrate that crucial algorithms like Cholesky decomposition as well as the algorithm by Floyd and Warshall by can be efficiently supported. (5) Extensive experiments on runtime efficiency, cache usage and energy consumption demonstrate the profit of our approach. We believe that future compilers could translate nested loops into cache-oblivious loops either fully automatic or by a user-guided analysis of the data dependency.

show abstract

Section: Matrix Multiplicationmentioning

confidence: 99%

Section: Cache Hierachy On Matrix Multiplicationmentioning

confidence: 99%

See 1 more Smart Citation

A Novel Hilbert Curve for Cache-Locality Preserving Loops

Böhm

Perdacher

Plant

2021

IEEE Trans. Big Data

View full text Add to dashboard Cite

show abstract

“…‘Cache‐oblivious Matrix Algorithms in the Age of Multi‐ and Many‐Cores’ by Alexander Heinecke and Carsten Trinitis highlights the issue of increasing vector unit width that goes along with increasing core counts on x86 processor architectures. To demonstrate this, a cache‐oblivious numerical code has been ported to and optimized on four contemporary x86 architectures representing vector unit widths from 128 to 512 bits.…”

Section: This Special Issuementioning

confidence: 99%