SC18: International Conference for High Performance Computing, Networking, Storage and Analysis 2018
DOI: 10.1109/sc.2018.00050
|View full text |Cite
|
Sign up to set email alerts
|

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Abstract: Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) applications can also harness this power. Specifically, we use the general HPC problem, Ax = b, where A is a large dense matrix, and a double precision (FP64) solution is needed for accuracy. Our approach is based on mixed-precision (FP16→FP64) iterative refinement, and we… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
121
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
3
3
3

Relationship

1
8

Authors

Journals

citations
Cited by 164 publications
(122 citation statements)
references
References 24 publications
1
121
0
Order By: Relevance
“…In some cases, researchers have found that combining Iterative Refinement with an Iterative Solver like GMRES [10][11] is also beneficial, especially when the base precision is very low because the odds are that the matrix may have too high a condition number to work otherwise.…”
Section: Speeding Up One-sided Solvers With Low-precision Datatymentioning
confidence: 99%
See 1 more Smart Citation
“…In some cases, researchers have found that combining Iterative Refinement with an Iterative Solver like GMRES [10][11] is also beneficial, especially when the base precision is very low because the odds are that the matrix may have too high a condition number to work otherwise.…”
Section: Speeding Up One-sided Solvers With Low-precision Datatymentioning
confidence: 99%
“…The area of FP-FMA is dominated by the multiplier as it roughly grows squared with mantissa size (and therefore also consumes a The performance results in [5] show that the assumptions made here are correct. Similar Speed-Ups are also possible in iterative refinement scenarios [10]. Apart from having faster "FP32" on general purpose hardware such as CPUs and/or GPUs, it also means that deep learning optimized hardware, such as Google's TPU could be efficiently used for classic HPC which only requires FP32.…”
Section: Performance Ramificationsmentioning
confidence: 99%
“…Reducing the communication really makes sense, however. The so called HPL-AI benchmark used Mixed Precision 17 [50] rather than Double Precision calculations. This enabled to achieve apparently nearly 3 times better perfor-mance gain, that (as correctly stated in the announcement) "Achieving a 445 petaflops mixed-precision result on HPL (equivalent to our 148.6 petaflops DP result)", i.e.…”
Section: The Contribution Of the Interconnectionmentioning
confidence: 99%
“…Mixed-precision iterative refinement approaches have been studied for solving dense linear system of equations [26] using single and double-precision arithmetics. A new mixed precision iterative refinement approach [27] has shown a significant improvement of the performance (speedup factor up to four) using multiple precisions, i.e., 16-bit, 32-bit, and 64bit precision arithmetics for the dominant GEMM kernel, on NVIDIA V100 GPUs. These mixed-precision approaches use a unique precision arithmetic for the Cholesky factorization and subsequently, iterate using multiple precisions to refine the solution.…”
Section: Related Workmentioning
confidence: 99%