Communication lower bounds for distributed-memory matrix multiplication

Irony, Dror; Toledo, Sivan; Tiskin, Alexander

doi:10.1016/j.jpdc.2004.03.021

Cited by 195 publications

(307 citation statements)

References 24 publications

Supporting

Mentioning

299

Contrasting

Order By: Relevance

“…We also prove tight lower bounds on the communication costs of rectangular matrix multiplication in all cases. Some of these bounds have appeared previously in [22], and the new bounds use the same techniques (along with those of [2]). As illustrated in Figure 1, the communication costs naturally divide into three cases that we call one large dimension, two large dimensions, and three large dimensions.…”

Section: Contributionsmentioning

confidence: 99%

“…Following [22], the classical rectangular matrix multiplication algorithm requires mnk scalar multiplications, which may Algorithm 2 CARMA(A,B,C,m,k,n,P ) Input: A is an m × k matrix and B is a k × n matrix Output:…”

Section: B Communication Cost Lower Boundsmentioning

confidence: 99%

“…For square matrix multiplication, communication cost lower bounds have been proved [22], [5], [2], suggesting that known 2D algorithms (such as SUMMA) and 3D algorithms [7], [1] are only optimal in certain memory ranges. These bounds led to "2.5D" algorithms [25], [28], [27] as well as a BFS/DFS-based algorithm [3], which are communicationoptimal for all memory sizes and provide substantial speedups in practice.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication

Demmel

Eliahu

Fox

et al. 2013

2013 IEEE 27th International Symposium on Parallel and Distributed Processing

View full text Add to dashboard Cite

Abstract-Communication-optimal algorithms are known for square matrix multiplication. Here, we obtain the first communication-optimal algorithm for all dimensions of rectangular matrices. Combining the dimension-splitting technique of Frigo, Leiserson, Prokop and Ramachandran (1999) with the recursive BFS/DFS approach of Ballard, Demmel, Holtz, Lipshitz and Schwartz (2012) allows for a communication-optimal as well as cache-and network-oblivious algorithm. Moreover, the implementation is simple: approximately 50 lines of code for the shared-memory version. Since the new algorithm minimizes communication across the network, between NUMA domains, and between levels of cache, it performs well in practice on both shared-and distributed-memory machines. We show significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.

show abstract

Section: Contributionsmentioning

confidence: 99%

Section: B Communication Cost Lower Boundsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication

Demmel

Eliahu

Fox

et al. 2013

2013 IEEE 27th International Symposium on Parallel and Distributed Processing

View full text Add to dashboard Cite

show abstract

“…Our proofs of optimality for communication and synchronization given in this section and the one to follow all derive from lower bounds on the number of communication steps required in distributed algorithms and are direct applications of previous work, particularly of Hong and Kung [23], Aggarwal and Vitter [3], Savage [33,34] and Irony, Toledo and Tiskin [25].…”

Section: Work-limited Algorithmsmentioning

confidence: 99%

“…Our lower bound results for straight-line programs we derive using the approach of Irony, Toledo and Tiskin [25] (and also of [23,33]), while the result for sorting uses an adversarial argument of Aggarwal and Vitter [3]. The bounds will be stated for Multi-BSP but the lower bound arguments for communication hold more generally, for all distributed algorithms with the same hierarchy of memory sizes and costs of communication, even if there is no bulk synchronization.…”

Section: Lower Boundsmentioning

confidence: 99%

A Bridging Model for Multi-core Computing

Valiant

2008

Algorithms - ESA 2008

View full text Add to dashboard Cite

Writing software for one parallel system is a feasible though arduous task. Reusing the substantial intellectual effort so expended for programming a second system has proved much more challenging. In sequential computing algorithms textbooks and portable software are resources that enable software systems to be written that are efficiently portable across changing hardware platforms. These resources are currently lacking in the area of multi-core architectures, where a programmer seeking high performance has no comparable opportunity to build on the intellectual efforts of others.In order to address this problem we propose a bridging model aimed at capturing the most basic resource parameters of multi-core architectures. We suggest that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully expended in designing portable algorithms, once and for all, for such a bridging model. Portable algorithms would contain efficient designs for all reasonable combinations of the basic resource parameters and input sizes, and would form the basis for implementation or compilation for particular machines.Our Multi-BSP model is a multi-level model that has explicit parameters for processor numbers, memory/cache sizes, communication costs, and synchronization costs. The lowest level corresponds to shared memory or the PRAM, acknowledging the relevance of that model for whatever limitations on memory and processor numbers it may be efficacious to emulate it.We propose parameter-aware portable algorithms that run efficiently on * This work was supported in part by NSF-CCF-04-27129. A preliminary version appeared in Proc. 16th Annual European Symposium on Algorithms, Sept 15-17, 2008, Karlsruhe Germany, LNCS Vol 5193, pp. 13-28, and had also been presented at SPAA, June [14][15][16] 2008.Email addresses: valiant@seas.harvard.edu (Leslie G. Valiant) January 15, 2010 all relevant architectures with any number of levels and any combination of parameters. For these algorithms we define a parameter-free notion of optimality. We show that for several fundamental problems, including standard matrix multiplication, the Fast Fourier Transform, and comparison sorting, there exist optimal portable algorithms in that sense, for all combinations of machine parameters. Thus some algorithmic generality and elegance can be found in this many parameter setting. Preprint submitted to Elsevier

show abstract

High‐performance direct algorithms for computing the sign function of triangular matrices

Stotland

Schwartz

Toledo

2017

Numerical Linear Algebra App

Self Cite

View full text Add to dashboard Cite

Algorithms and implementations for computing the sign function of a triangular matrix are fundamental building blocks for computing the sign of arbitrary square real or complex matrices. We present novel recursive and cache-efficient algorithms that are based on Higham's stabilized specialization of Parlett's substitution algorithm for computing the sign of a triangular matrix. We show that the new recursive algorithms are asymptotically optimal in terms of the number of cache misses that they generate. One algorithm that we present performs more arithmetic than the nonrecursive version, but this allows it to benefit from calling highly optimized matrix multiplication routines; the other performs the same number of operations as the nonrecursive version, suing custom computational kernels instead. We present implementations of both, as well as a cache-efficient implementation of a block version of Parlett's algorithm. Our experiments demonstrate that the blocked and recursive versions are much faster than the previous algorithms and that the inertia strongly influences their relative performance, as predicted by our analysis. KEYWORDS blocked matrix algorithms, cache-efficient algorithms, communication-efficient algorithms, matrix functions, partitioned matrix algorithms Numer Linear Algebra Appl. 2018;25:e2139.wileyonlinelibrary.com/journal/nla

show abstract

Communication lower bounds for distributed-memory matrix multiplication

Cited by 195 publications

References 24 publications

Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication

Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication

A Bridging Model for Multi-core Computing

High‐performance direct algorithms for computing the sign function of triangular matrices

Contact Info

Product

Resources

About