On the parallel I/O optimality of linear algebra kernels

Kwasniewski, Grzegorz; Kabic, Marko; Ben-Nun, Tal; Ziogas, Alexandros Nikolaos; Saethre, Jens Eirik; Gaillard, André; Schneider, Timo; Besta, Maciej; Kozhevnikov, Anton; VandeVondele, Joost; Hoefler, Torsten

doi:10.1145/3458817.3476167

Cited by 12 publications

(13 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…𝑋 −𝑆 (Lemma 2 in Kwasniewski et al [27]). The expression 𝜌 = 𝜒 (𝑋 ) 𝑋 −𝑆 is called the computational intensity.…”

Section: I/o Lower Boundsmentioning

confidence: 93%

See 1 more Smart Citation

Pebbles, Graphs, and a Pinch of Combinatorics: Towards Tight I/O Lower Bounds for Statically Analyzable Programs

Kwasniewski,

Ben-Nun,

Gianinazzi

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Determining I/O lower bounds is a crucial step in obtaining communication-efficient parallel algorithms, both across the memory hierarchy and between processors. Current approaches either study specific algorithms individually, disallow programmatic motifs such as recomputation, or produce asymptotic bounds that exclude important constants. We propose a novel approach for obtaining precise I/O lower bounds on a general class of programs, which we call Simple Overlap Access Programs (SOAP). SOAP analysis covers a wide variety of algorithms, from ubiquitous computational kernels to full scientific computing applications. Using the red-blue pebble game and combinatorial methods, we are able to bound the I/O of the SOAP-induced Computational Directed Acyclic Graph (CDAG), taking into account multiple statements, input/output reuse, and optimal tiling. To deal with programs that are outside of our representation (e.g., non-injective access functions), we describe methods to approximate them with SOAP. To demonstrate our method, we analyze 38 different applications, including kernels from the Polybench benchmark suite, deep learning operators, and -for the first time -applications in unstructured physics simulations, numerical weather prediction stencil compositions, and full deep neural networks. We derive tight I/O bounds for several linear algebra kernels, such as Cholesky decomposition, improving the existing reported bounds by a factor of two. For stencil applications, we improve the existing bounds by a factor of up to 14. We implement our method as an open-source tool, which can derive lower bounds directly from provided C code.

show abstract

“…𝑋 −𝑆 (Lemma 2 in Kwasniewski et al [27]). The expression 𝜌 = 𝜒 (𝑋 ) 𝑋 −𝑆 is called the computational intensity.…”

Section: I/o Lower Boundsmentioning

confidence: 93%

“…To bound the sizes of rectangular subcomputations, we use two lemmas given by Kwasniewski et al [27]: [27]) For statement 𝑆𝑡, given 𝑫, the size of subcomputation H (number of vertices of 𝑆 computed during H ) is bounded by the sizes of the iteration variables' sets 𝐷 𝑡 , 𝑡 = 1, . .…”

Section: Definitionsmentioning

confidence: 99%

Pebbles, Graphs, and a Pinch of Combinatorics: Towards Tight I/O Lower Bounds for Statically Analyzable Programs

Kwasniewski,

Ben-Nun,

Gianinazzi

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Independently, using explicit enumeration of data reuse, Kwasniewski et al [8] obtain a corresponding lower bound for LU factorization: their proof is in a parallel context, but their arguments show that the minimum number of data transfers is lower bounded by 2 √ lower bound, which however makes the implicit assumption that there is no data reuse related to the symmetry of the matrix as discussed in Section 1.…”

Section: Two-level Sequential Modelmentioning

confidence: 99%

“…Regarding Cholesky, Ballard et al [2] reviewed existing parallel distributed algorithms and, based on their lower bound on communication, proved that LAPACK and other block recursive implementations are asymptotically optimal for a carefully selected block size. The work on lower bounds by Kwasniewskiet al [8] leads to the design of parallel distributed 2.5 LU (COnfLUX) and Cholesky (COnfCHOX) algorithms. These algorithms perform a volume of communication per node of 3 √ + O ( 2 ).…”

Section: Parallel Modelmentioning

confidence: 99%

“…Indeed it is akin to computing the lower half of C = A • B, where A is a × matrix and B is a × matrix. Similarly, a lower bound for Cholesky is derived in [8] under the constraint that it is forbidden to use , when , is available. The lower bound obtained with such a constraint is potentially too large: there may exist algorithms which perform fewer data transfers by making a better use of symmetry.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels

Beaumont¹,

Eyraud-Dubois²,

Vérité³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rankupdate (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we consider a machine model with a fast memory of size and an unbounded slow memory. In this model, all computations must be performed on operands in fast memory, and the goal is to minimize the amount of communication between slow and fast memories. As the set of computations is fixed by the choice of the algorithm, only the ordering of the computations (the schedule) directly influences the volume of communications.We prove lower bounds of 1 3 √ 2 3 CCS CONCEPTS• Theory of computation → Shared memory algorithms; Communication complexity; • Mathematics of computing → Solvers.

show abstract