Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 2021
DOI: 10.1145/3458817.3476167
|View full text |Cite
|
Sign up to set email alerts
|

On the parallel I/O optimality of linear algebra kernels

Abstract: Matrix factorizations are among the most important building blocks of scientific computing. However, state-of-the-art libraries are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving parallel I/O lower bounds for linear algebra kernels, and then utilize its insights to derive Cholesky and LU schedul… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
12
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 12 publications
(13 citation statements)
references
References 53 publications
1
12
0
Order By: Relevance
“…𝑋 −𝑆 (Lemma 2 in Kwasniewski et al [27]). The expression 𝜌 = 𝜒 (𝑋 ) 𝑋 −𝑆 is called the computational intensity.…”
Section: I/o Lower Boundsmentioning
confidence: 93%
See 1 more Smart Citation
“…𝑋 −𝑆 (Lemma 2 in Kwasniewski et al [27]). The expression 𝜌 = 𝜒 (𝑋 ) 𝑋 −𝑆 is called the computational intensity.…”
Section: I/o Lower Boundsmentioning
confidence: 93%
“…To bound the sizes of rectangular subcomputations, we use two lemmas given by Kwasniewski et al [27]: [27]) For statement 𝑆𝑡, given 𝑫, the size of subcomputation H (number of vertices of 𝑆 computed during H ) is bounded by the sizes of the iteration variables' sets 𝐷 𝑡 , 𝑡 = 1, . .…”
Section: Definitionsmentioning
confidence: 99%
“…Independently, using explicit enumeration of data reuse, Kwasniewski et al [8] obtain a corresponding lower bound for LU factorization: their proof is in a parallel context, but their arguments show that the minimum number of data transfers is lower bounded by 2 √ lower bound, which however makes the implicit assumption that there is no data reuse related to the symmetry of the matrix as discussed in Section 1.…”
Section: Two-level Sequential Modelmentioning
confidence: 99%
“…Regarding Cholesky, Ballard et al [2] reviewed existing parallel distributed algorithms and, based on their lower bound on communication, proved that LAPACK and other block recursive implementations are asymptotically optimal for a carefully selected block size. The work on lower bounds by Kwasniewskiet al [8] leads to the design of parallel distributed 2.5 LU (COnfLUX) and Cholesky (COnfCHOX) algorithms. These algorithms perform a volume of communication per node of 3 √ + O ( 2 ).…”
Section: Parallel Modelmentioning
confidence: 99%
See 1 more Smart Citation