The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques

Haidar, Azzam; Abdelfattah, Ahmad; Zounon, Mawussi; Wu, Panruo; Pranesh, Srikara; Tomov, Stanimire; Dongarra, Jack

doi:10.1007/978-3-319-93698-7_45

Cited by 40 publications

(34 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One can expect that a 4× speedup will at least bring 4× energy improvement. Indeed, in our experiments [10] we measured both the power of the CPU (package+DRAM), using the Performance Application Programming Interface (PAPI) [16]), and the power of the GPU (using the NVIDIA Management Library (NVML) [20]) and we observed about 5× energy efficiency improvement.…”

Section: Discussionmentioning

confidence: 99%

“…• develop a framework for exploiting GPU TCs in mixed-precision (FP16-FP32/FP64) iterative refinement solvers and describe the path to develop highperformance, Tensor Cores-enabled dense linear algebra building blocks kernels that can be used to exploit the FP16-TC in HPC applications; • present a study on algorithmic variants of IR techniques; • include performance model study that allow users/developers to understand the effect of the number of iterations and to predict performance gain; • illustrate that a number of problems can be accelerated up to 4× through the mixed-precision solvers using fast FP16-TC, 3× using the basic FP16 mixed-precision solver, or 2× using the FP32 arithmetic; • provide an analysis of the numerical behavior of the proposed mixed-precision, TC-accelerated solvers on different types of matrices; and • quantify-in practice-the performance and limitations of this approach on V100 GPUs using TC; • we also provide experiments on dense and sparse matrices arising from real applications (see Table III); The developments will be released through the open-source MAGMA library [15] to make these experiments independently reproducible and to allow the scientific community build and study different type of research on the top of this work. We also would like to point readers who are interested in energy efficiency and power measurement to our work presented in [10] and which investigate the energy gain that can be brought by the iterative refinement techniques from a power consumption point of view.…”

Section: Contributionsmentioning

confidence: 99%

See 1 more Smart Citation

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Haidar

Tomov

Dongarra

et al. 2018

SC18: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

153

118

View full text Add to dashboard Cite

Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) applications can also harness this power. Specifically, we use the general HPC problem, Ax = b, where A is a large dense matrix, and a double precision (FP64) solution is needed for accuracy. Our approach is based on mixed-precision (FP16→FP64) iterative refinement, and we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly tuned implementations. These new methods show how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup. This is due to the performance boost that the FP16-TC provide as well as to the improved accuracy over the classical FP16 arithmetic that is obtained because the GEMM accumulation occurs in FP32 arithmetic.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Contributionsmentioning

confidence: 99%

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Haidar

Tomov

Dongarra

et al. 2018

SC18: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

153

118

View full text Add to dashboard Cite

show abstract

“…In the work on iterative refinement in [14], [15], [16] elements that overflow during conversion to fp16 are mapped to the nearest finite number, \pm x max . As we will show, for badly scaled real-life matrices this approach can lead to slow convergence, so a more sophisticated strategy is needed.…”

mentioning

confidence: 99%

“…Here, fl h denotes the operator that rounds to fp16 and sign is the function that maps positive real numbers to 1, negative real numbers to - 1, and 0 to 0. Algorithm 2.1, with \theta = 1, is the approach used in [14], [15], [16].…”

mentioning

confidence: 99%

Squeezing a Matrix into Half Precision, with an Application to Solving Linear Systems

Higham¹,

Pranesh²,

Zounon³

2019

SIAM J. Sci. Comput.

Self Cite

View full text Add to dashboard Cite

\bfA \bfb \bfs \bft \bfr \bfa \bfc \bft. Motivated by the demand in machine learning, modern computer hardware is increasingly supporting reduced precision floating-point arithmetic, which provides advantages in speed, energy, and memory usage over single and double precision. Given the availability of such hardware, mixed precision algorithms that work in single or double precision but carry out part of a computation in half precision are now of great interest for general scientific computing tasks. Because of the limited range of half precision arithmetic, in which positive numbers lie between 6 \times 10-8 and 7 \times 10 4 , a straightforward rounding of single or double precision data into half precision can lead to overflow, underflow, or subnormal numbers being generated, all of which are undesirable. We develop an algorithm for converting a matrix from single or double precision to half precision. It first applies two-sided diagonal scaling in order to equilibrate the matrix (that is, to ensure that every row and column has \infty-norm 1), then multiplies by a scalar to bring the largest element within a factor \theta \leq 1 of the overflow level, and finally rounds to half precision. The second step ensures that full use is made of the limited range of half precision arithmetic, and \theta must be chosen to allow sufficient headroom for subsequent computations. We apply the new algorithm to GMRES-based iterative refinement (GMRES-IR), which solves a linear system Ax = b with single or double precision data by LU factorizing A in half precision and carrying out iterative refinement with the correction equations solved by GMRES preconditioned with the low precision LU factors. Previous implementations of this algorithm have used a crude conversion to half precision that our experiments show can cause slow convergence of GMRES-IR for badly scaled matrices or failure to converge at all. The new conversion algorithm computes \infty-norms of rows and columns of the matrix and its cost is negligible in the context of LU factorization. We show that it leads to faster convergence of GMRES-IR for badly scaled matrices and thereby allows a much wider class of problems to be solved. \bfK \bfe \bfy \bfw \bfo \bfr \bfd \bfs. diagonal scaling, half precision arithmetic, fp16, overflow, underflow, subnormal numbers, iterative refinement, linear system, mixed precision, GMRES, preconditioning \bfA \bfM \bfS \bfs \bfu \bfb \bfj \bfe \bfc \bft \bfc \bfl \bfa \bfs \bfs \bfi fi\bfc \bfa \bft \bfi \bfo \bfn \bfs. 65F05, 65F08, 65F35, 65F10 \bfD \bfO \bfI. 10.1137/18M1229511 1. Introduction. The landscape of scientific computing is changing, because of the growing availability and usage of low precision floating-point arithmetic. The 2008 revision of IEEE standard 754 introduced a 16-bit floating point format, known as half precision (fp16) [19]. Although defined only as a storage format, it has been widely adopted for computing, and is supported by the NVIDIA P100 and V100 GPUs and the AMD Radeon Instinct MI25 GPU. On such ha...

show abstract

“…Haidar et al [22], [23] show that by taking advantage of the tensor cores on an NVIDIA V100 GPU, GMRES-IR can bring a speedup of 4 over an optimized double precision solver and can provide an energy reduction of 80%. Moreover, GMRES-IR has been shown to perform up to three times faster than an optimized double precision solver at scale on the Summit machine [24], which heads the November 2019 TOP500 list.…”

Section: Mixed Precision Algorithmsmentioning

confidence: 99%

Numerical algorithms for high-performance computational science

Dongarra

Grigori

Higham

2020

Phil. Trans. R. Soc. A.

Self Cite

View full text Add to dashboard Cite

A number of features of today’s high-performance computers make it challenging to exploit these machines fully for computational science. These include increasing core counts but stagnant clock frequencies; the high cost of data movement; use of accelerators (GPUs, FPGAs, coprocessors), making architectures increasingly heterogeneous; and multi- ple precisions of floating-point arithmetic, including half-precision. Moreover, as well as maximizing speed and accuracy, minimizing energy consumption is an important criterion. New generations of algorithms are needed to tackle these challenges. We discuss some approaches that we can take to develop numerical algorithms for high-performance computational science, with a view to exploiting the next generation of supercomputers. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

show abstract

The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques

Cited by 40 publications

References 20 publications

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Squeezing a Matrix into Half Precision, with an Application to Solving Linear Systems

Numerical algorithms for high-performance computational science

Contact Info

Product

Resources

About