2018
DOI: 10.1007/978-3-319-93698-7_45
|View full text |Cite
|
Sign up to set email alerts
|

The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques

Abstract: As parallel computers approach the exascale, power efficiency in Highperformance computing (HPC) systems is of increasing concern. Exploiting both, the hardware features, and algorithms is an effective solution to achieve power efficiency, and address the energy constraints in modern and future HPC systems. In this work, we present a novel design and implementation of an energy efficient solution for dense linear system of equations, which are at the heart of largescale HPC applications. The proposed energy ef… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

3
31
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
3
3
1
1

Relationship

3
5

Authors

Journals

citations
Cited by 40 publications
(34 citation statements)
references
References 20 publications
3
31
0
Order By: Relevance
“…One can expect that a 4× speedup will at least bring 4× energy improvement. Indeed, in our experiments [10] we measured both the power of the CPU (package+DRAM), using the Performance Application Programming Interface (PAPI) [16]), and the power of the GPU (using the NVIDIA Management Library (NVML) [20]) and we observed about 5× energy efficiency improvement.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…One can expect that a 4× speedup will at least bring 4× energy improvement. Indeed, in our experiments [10] we measured both the power of the CPU (package+DRAM), using the Performance Application Programming Interface (PAPI) [16]), and the power of the GPU (using the NVIDIA Management Library (NVML) [20]) and we observed about 5× energy efficiency improvement.…”
Section: Discussionmentioning
confidence: 99%
“…• develop a framework for exploiting GPU TCs in mixed-precision (FP16-FP32/FP64) iterative refinement solvers and describe the path to develop highperformance, Tensor Cores-enabled dense linear algebra building blocks kernels that can be used to exploit the FP16-TC in HPC applications; • present a study on algorithmic variants of IR techniques; • include performance model study that allow users/developers to understand the effect of the number of iterations and to predict performance gain; • illustrate that a number of problems can be accelerated up to 4× through the mixed-precision solvers using fast FP16-TC, 3× using the basic FP16 mixed-precision solver, or 2× using the FP32 arithmetic; • provide an analysis of the numerical behavior of the proposed mixed-precision, TC-accelerated solvers on different types of matrices; and • quantify-in practice-the performance and limitations of this approach on V100 GPUs using TC; • we also provide experiments on dense and sparse matrices arising from real applications (see Table III); The developments will be released through the open-source MAGMA library [15] to make these experiments independently reproducible and to allow the scientific community build and study different type of research on the top of this work. We also would like to point readers who are interested in energy efficiency and power measurement to our work presented in [10] and which investigate the energy gain that can be brought by the iterative refinement techniques from a power consumption point of view.…”
Section: Contributionsmentioning
confidence: 99%
“…In the work on iterative refinement in [14], [15], [16] elements that overflow during conversion to fp16 are mapped to the nearest finite number, \pm x max . As we will show, for badly scaled real-life matrices this approach can lead to slow convergence, so a more sophisticated strategy is needed.…”
mentioning
confidence: 99%
“…Here, fl h denotes the operator that rounds to fp16 and sign is the function that maps positive real numbers to 1, negative real numbers to - 1, and 0 to 0. Algorithm 2.1, with \theta = 1, is the approach used in [14], [15], [16].…”
mentioning
confidence: 99%
“…Haidar et al [22], [23] show that by taking advantage of the tensor cores on an NVIDIA V100 GPU, GMRES-IR can bring a speedup of 4 over an optimized double precision solver and can provide an energy reduction of 80%. Moreover, GMRES-IR has been shown to perform up to three times faster than an optimized double precision solver at scale on the Summit machine [24], which heads the November 2019 TOP500 list.…”
Section: Mixed Precision Algorithmsmentioning
confidence: 99%