Communication Avoiding and Overlapping for Numerical Linear Algebra

Georganas, Evangelos; González-Domínguez, Jorge; Solomonik, Edgar; Zheng, Yinqiang; Touriño, Juan; Yelick, Katherine

doi:10.21236/ada561679

Cited by 13 publications

(16 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…That is, the product of the number of words and the number of messages sent is Θ(n 2 ); this trade-off is shown to be necessary in . We mention here only one of many speed-ups: up to 2.1× of 2.5D LU on 64K of an IBM BG/P machine compared to previous parallel LU factorization (Georganas et al 2012). A similar approach, applied to the direct N -body problem, leads to speed-ups of up to 11.8× on the 32K core IBM BG/P, compared to similarly tuned 2D algorithms (Driscoll et al 2013).…”

Section: Parallel Casementioning

confidence: 94%

“…This algorithm can be applied to symmetric positive definite matrices, though it uses explicit triangular matrix inversion and multiplication (ignoring stability issues) and also ignores symmetry. Georganas et al (2012) extend the ideas of Solomonik and Demmel (2011) to the symmetric positive definite case, saving arithmetic by exploiting symmetry and maintaining stability by using triangular solves. Lipshitz (2013) provides a similar algorithm for Cholesky factorization, along with a recursive algorithm for triangular solve, that also maintains symmetry and stability.…”

Section: Parallel Casementioning

confidence: 99%

See 1 more Smart Citation

Communication lower bounds and optimal algorithms for numerical linear algebra

Ballard¹,

Carson²,

Demmel³

et al. 2014

Acta Numerica

132

View full text Add to dashboard Cite

The traditional metric for the efficiency of a numerical algorithm has been the number of arithmetic operations it performs. Technological trends have long been reducing the time to perform an arithmetic operation, so it is no longer the bottleneck in many algorithms; rather, communication, or moving data, is the bottleneck. This motivates us to seek algorithms that move as little data as possible, either between levels of a memory hierarchy or between parallel processors over a network. In this paper we summarize recent progress in three aspects of this problem. First we describe lower bounds on communication. Some of these generalize known lower bounds for dense classical (O(n3)) matrix multiplication to all direct methods of linear algebra, to sequential and parallel algorithms, and to dense and sparse matrices. We also present lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices. Second, we compare these lower bounds to widely used versions of these algorithms, and note that these widely used algorithms usually communicate asymptotically more than is necessary. Third, we identify or invent new algorithms for most linear algebra problems that do attain these lower bounds, and demonstrate large speed-ups in theory and practice.

show abstract

Section: Parallel Casementioning

confidence: 94%

Section: Parallel Casementioning

confidence: 99%

Communication lower bounds and optimal algorithms for numerical linear algebra

Ballard¹,

Carson²,

Demmel³

et al. 2014

Acta Numerica

132

View full text Add to dashboard Cite

show abstract

“…In this work, we broadly focus on the case of large-scale dense linear algebra. This domain has a rich literature of parallel communication-avoiding algorithms and existing high performance implementations [2,5,6,19].…”

Section: Linear Algebra Algorithmsmentioning

confidence: 99%

Serverless linear algebra

Shankar

Krauth

Vodrahalli

et al. 2020

Proceedings of the 11th ACM Symposium on Cloud Computing

View full text Add to dashboard Cite

Datacenter disaggregation provides numerous benets to both the datacenter operator and the application designer. However switching from the server-centric model to a disaggregated model requires developing new programming abstractions that can achieve high performance while beneting from the greater elasticity. To explore the limits of datacenter disaggregation, we study an application area that near-maximally benets from current server-centric datacenters: dense linear algebra. We build NumPyWren, a system for linear algebra built on a disaggregated serverless programming model, and LAmbdaPACK, a companion domainspecic language designed for serverless execution of highly parallel linear algebra algorithms. We show that, for a number of linear algebra algorithms such as matrix multiply, singular value decomposition, Cholesky decomposition, and QR decomposition, NumPyWren's performance (completion time) is within a factor of 2 of optimized server-centric MPI implementations, and has up to 15 % greater compute eciency (total CPU-hours), while providing fault tolerance.

show abstract

“…Overlapping computation and communication has long been considered an avenue for optimizing parallel performance [33]. Benefits of the overlap have been explored for different types of algorithms [47] and on different architectures [102]. The co-processor mode of operation of Blue Gene/L [3] paired an application processor with another processor dedicated to handling its communication tasks.…”

Section: Lazy Evaluation and Its Use For Optimizing Parallel Performancementioning

confidence: 99%

Load Balancing Scientific Applications

Pearce

2014

View full text Add to dashboard Cite

The largest supercomputers have millions of independent processors, and concurrency levels are rapidly increasing. For ideal efficiency, developers of the simulations that run on these machines must ensure that computational work is evenly balanced among processors. Assigning work evenly is challenging because many large modern parallel codes simulate behavior of physical systems that evolve over time, and their workloads change over time. Furthermore, the cost of imbalanced load increases with scale because most large-scale scientific simulations today use a Single Program Multiple Data (SPMD) parallel programming model, and an increasing number of processors will wait for the slowest one at the synchronization points. To address load imbalance, many large-scale parallel applications use dynamic load balance algorithms to redistribute work evenly. The research objective of this dissertation is to develop methods to decide when and how to load balance the application, and to balance it effectively and affordably. We measure and evaluate the computational load of the application, and develop strategies to decide when and how to correct the imbalance. Depending on the simulation, a fast, local load balance algorithm may be suitable, or a more sophisticated and expensive algorithm may be required. We developed a model for comparison of load balance algorithms for a specific state of the simulation that enables the selection of a balancing algorithm that will minimize overall runtime. Dynamic load balancing of parallel applications becomes more critical at scale, while also being expensive. To make the load balance correction affordable at scale, we propose a lazy load balancing strategy that evaluates the imbalance and computes the new assignment of work to processes asynchronously to the main application computation. We decouple the load balance algorithm from the application and run it on potentially fewer, separate I would like to thank the students and faculty of Western Oregon University.

show abstract

Communication Avoiding and Overlapping for Numerical Linear Algebra

Cited by 13 publications

References 12 publications

Communication lower bounds and optimal algorithms for numerical linear algebra

Communication lower bounds and optimal algorithms for numerical linear algebra

Serverless linear algebra

Load Balancing Scientific Applications

Contact Info

Product

Resources

About