Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Haidar, Azzam; Tomov, Stanimire; Dongarra, Jack; Higham, Nicholas J.

doi:10.1109/sc.2018.00050

Cited by 164 publications

(122 citation statements)

References 24 publications

Supporting

Mentioning

121

Contrasting

Order By: Relevance

“…In some cases, researchers have found that combining Iterative Refinement with an Iterative Solver like GMRES [10][11] is also beneficial, especially when the base precision is very low because the odds are that the matrix may have too high a condition number to work otherwise.…”

Section: Speeding Up One-sided Solvers With Low-precision Datatymentioning

confidence: 99%

“…The area of FP-FMA is dominated by the multiplier as it roughly grows squared with mantissa size (and therefore also consumes a The performance results in [5] show that the assumptions made here are correct. Similar Speed-Ups are also possible in iterative refinement scenarios [10]. Apart from having faster "FP32" on general purpose hardware such as CPUs and/or GPUs, it also means that deep learning optimized hardware, such as Google's TPU could be efficiently used for classic HPC which only requires FP32.…”

Section: Performance Ramificationsmentioning

confidence: 99%

See 1 more Smart Citation

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Henry¹,

Tang²,

Heinecke³

2019

2019 IEEE 26th Symposium on Computer Arithmetic (ARITH)

View full text Add to dashboard Cite

In recent years fused-multiply-add (FMA) units with lower-precision multiplications and higher-precision accumulation have proven useful in machine learning/artificial intelligence applications, most notably in training deep neural networks due to their extreme computational intensity. Compared to classical IEEE-754 32 bit (FP32) and 64 bit (FP64) arithmetic, these reduced precision arithmetic can naturally be sped up disproportional to their shortened width. The common strategy of all major hardware vendors is to aggressively further enhance their performance disproportionately. One particular FMA operation that multiplies two BF16 numbers while accumulating in FP32 has been found useful in deep learning, where BF16 is the 16bit floating point datatype with IEEE FP32 numerical range but 8 significant bits of precision. In this paper, we examine the use this FMA unit to implement higher-precision matrix routines in terms of potential performance gain and implications on accuracy. We demonstrate how a decomposition into multiple smaller datatypes can be used to assemble a high-precision result, leveraging the higher precision accumulation of the FMA unit. We first demonstrate that computations of vector inner products and by natural extension, matrix-matrix products can be achieved by decomposing FP32 numbers in several BF16 numbers followed by appropriate computations that can accommodate the dynamic range and preserve accuracy compared to standard FP32 computations, while projecting up to 5.2ˆspeed-up. Furthermore, we examine solution of linear equations formulated in the residual form that allows for iterative refinement. We demonstrate that the solution obtained to be comparable to those offered by FP64 under a large range of linear system condition numbers.

show abstract

Section: Speeding Up One-sided Solvers With Low-precision Datatymentioning

confidence: 99%

Section: Performance Ramificationsmentioning

confidence: 99%

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Henry¹,

Tang²,

Heinecke³

2019

2019 IEEE 26th Symposium on Computer Arithmetic (ARITH)

View full text Add to dashboard Cite

show abstract

“…Reducing the communication really makes sense, however. The so called HPL-AI benchmark used Mixed Precision 17 [50] rather than Double Precision calculations. This enabled to achieve apparently nearly 3 times better perfor-mance gain, that (as correctly stated in the announcement) "Achieving a 445 petaflops mixed-precision result on HPL (equivalent to our 148.6 petaflops DP result)", i.e.…”

Section: The Contribution Of the Interconnectionmentioning

confidence: 99%

Finally, how many efficiencies the supercomputers have?

Végh¹

2020

J Supercomput

View full text Add to dashboard Cite

Using extremely large number of processing elements in computing systems leads to unexpected phenomena, such as different efficiencies of the same system for different tasks, that cannot be explained in the frame of the classical computing paradigm. The introduced simple non-technical model enables to set up a frame and formalism needed to explain the unexpected experiences around supercomputing. The paper shows that the degradation of the efficiency of the parallelized sequential system is a natural consequence of the computing paradigm, rather than an engineering imperfectness. The workload is greatly responsible for wasting the energy as well as limiting the size and the type of tasks the supercomputers can run. Case studies provide insight how the different contributions compete for dominating the resulting payload performance of the computing system, and how enhancing the technology made the computing+communication the dominating contribution in defining the efficiency of supercomputers. The model also enables to derive predictions about the supercomputer performance limitations for the near future as well as provides hints for enhancing the supercomputer components. The phenomena show interesting parallels with the phenomena experienced in science more than a century ago and through their studying a modern science was developed.1 There are some doubts about the definition of exaFLOPS, whether it means R P eak or R M ax , in the former case whether it includes accelerator cores, and in the latter case measured by which benchmark and finally using what operand length. Here the term is used as R HP L M ax , using 64-bit floating operands. 2 A special issue https://link.springer.com/journal/11714/19/10 3 https://en.wikipedia.org/wiki/PEZY Computing: The name PEZY is an acronym derived from the greek derived Metric prefixs Peta, Eta, Zetta, Yotta 4 https://blogs.nvidia.com/blog/2019/06/17/hpc-ai-performance-record-summit/ https://www.olcf.ornl.gov/2018/06/08/genomics-code-exceeds-exaops-on-summitsupercomputer/ 5 The related work and speedup deserved the Gordon Bell Prize 6 It was also learned that specific processor design is needed for exascale: As part of the announcement the development line Knights Hill [20] was canceled and instead be replaced by a "new platform and new microarchitecture specifically designed for exascale". 7 Despite its failure, the SpiNNaker2 is also under construction [19] 8 https://www.scmp.com/tech/policy/article/3015997/china-has-decided-not-fan-flamessuper-computing-rivalry-amid-us 9 https://ec.europa.eu/newsroom/dae/document.cfm? doc id =60156

show abstract

“…Mixed-precision iterative refinement approaches have been studied for solving dense linear system of equations [26] using single and double-precision arithmetics. A new mixed precision iterative refinement approach [27] has shown a significant improvement of the performance (speedup factor up to four) using multiple precisions, i.e., 16-bit, 32-bit, and 64bit precision arithmetics for the dominant GEMM kernel, on NVIDIA V100 GPUs. These mixed-precision approaches use a unique precision arithmetic for the Cholesky factorization and subsequently, iterate using multiple precisions to refine the solution.…”

Section: Related Workmentioning

confidence: 99%

Geostatistical Modeling and Prediction Using Mixed Precision Tile Cholesky Factorization

Abdulah

Ltaief

Sun

et al. 2019

2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC)

View full text Add to dashboard Cite

Geostatistics represents one of the most challenging classes of scientific applications due to the desire to incorporate an ever increasing number of geospatial locations to accurately model and predict environmental phenomena. For example, the evaluation of the Gaussian log-likelihood function, which constitutes the main computational phase, involves solving systems of linear equations with a large dense symmetric and positive definite covariance matrix. Cholesky, the standard algorithm, requires O(n 3 ) floating point operators and has an O(n 2 ) memory footprint, where n is the number of geographical locations. Here, we present a mixed-precision tile algorithm to accelerate the Cholesky factorization during the log-likelihood function evaluation. Under an appropriate ordering, it operates with double-precision arithmetic on tiles around the diagonal, while reducing to single-precision arithmetic for tiles sufficiently far off. This translates into an improvement of the performance without any deterioration of the numerical accuracy of the application. We rely on the StarPU dynamic runtime system to schedule the tasks and to overlap them with data movement. To assess the performance and the accuracy of the proposed mixedprecision algorithm, we use synthetic and real datasets on various shared and distributed-memory systems possibly equipped with hardware accelerators. We compare our mixed-precision Cholesky factorization against the double-precision reference implementation as well as an independent block approximation method. We obtain an average of 1.6X performance speedup on massively parallel architectures, while maintaining the accuracy necessary for modeling and prediction.

show abstract

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Cited by 164 publications

References 24 publications

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Finally, how many efficiencies the supercomputers have?

Geostatistical Modeling and Prediction Using Mixed Precision Tile Cholesky Factorization

Contact Info

Product

Resources

About