Using Ginkgo's memory accessor for improving the accuracy of memory‐bound low precision BLAS

Grützmacher, Thomas; Anzt, Hartwig; Quintana–Ort́ı, Enrique S.

doi:10.1002/spe.3041

Cited by 7 publications

(7 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Anzt, Flegar, Grützmacher and Quintana-Ortí (2019b) propose this approach of decoupling the data storage format from the processing format, and they focus on storing the data at a lower precision than that at which the computations are performed. This approach is used in the papers mentioned at the end of Section 8.2 and for level 1 and level 2 BLAS by Grützmacher, Anzt and Quintana-Ortí (2021). Agullo et al (2020) propose a similar approach for flexible GMRES, using as compression either reduced precision or the lossy floating-point SZ compressor (Di and Cappello 2016).…”

Section: Decoupling Formats For Data Storage and Processingmentioning

confidence: 99%

Mixed precision algorithms in numerical linear algebra

Higham¹,

Mary²

2022

Acta Numerica

View full text Add to dashboard Cite

Today’s floating-point arithmetic landscape is broader than ever. While scientific computing has traditionally used single precision and double precision floating-point arithmetics, half precision is increasingly available in hardware and quadruple precision is supported in software. Lower precision arithmetic brings increased speed and reduced communication and energy costs, but it produces results of correspondingly low accuracy. Higher precisions are more expensive but can potentially provide great benefits, even if used sparingly. A variety of mixed precision algorithms have been developed that combine the superior performance of lower precisions with the better accuracy of higher precisions. Some of these algorithms aim to provide results of the same quality as algorithms running in a fixed precision but at a much lower cost; others use a little higher precision to improve the accuracy of an algorithm. This survey treats a broad range of mixed precision algorithms in numerical linear algebra, both direct and iterative, for problems including matrix multiplication, matrix factorization, linear systems, least squares, eigenvalue decomposition and singular value decomposition. We identify key algorithmic ideas, such as iterative refinement, adapting the precision to the data, and exploiting mixed precision block fused multiply–add operations. We also describe the possible performance benefits and explain what is known about the numerical stability of the algorithms. This survey should be useful to a wide community of researchers and practitioners who wish to develop or benefit from mixed precision numerical linear algebra algorithms.

show abstract

Section: Decoupling Formats For Data Storage and Processingmentioning

confidence: 99%

Mixed precision algorithms in numerical linear algebra

Higham¹,

Mary²

2022

Acta Numerica

View full text Add to dashboard Cite

show abstract

“…Orthogonally to all previous communication optimization efforts, our optimized variant of the GMRES algorithm reduces communication in the access to the Krylov basis during the iteration loop body. In more detail, our GMRES algorithm leverages Ginkgo's memory accessor, introduced in Anzt et al (2021) and Grützmacher et al (2021), to decouple the memory storage format from the arithmetic precision so as to maintain the Krylov basis vectors in a compact "reduced precision" format. This radically diminishes the memory access volume during the orthogonalization, while not affecting the convergence rate of the solver, yielding notable performance improvements.…”

Section: Introductionmentioning

confidence: 99%

“…This radically diminishes the memory access volume during the orthogonalization, while not affecting the convergence rate of the solver, yielding notable performance improvements. Concretely, we make the following contributions in our article:• We follow the ideas in Anzt et al (2021) and Grützmacher et al (2021) and use the therein presented “memory accessor” in order to decouple the memory storage format from the arithmetic precision, specifically applying this strategy to maintain the Krylov basis in reduced precision in memory while performing all arithmetic operations using full, hardware-supported ieee 64-bit double-precision (DP).• We analyze the benefits that result from casting the Krylov basis into different compact storage formats, including the natural ieee 32-bit single-precision (SP) and 16-bit half-precision (HP) as well as some other non- ieee fixed point-based alternatives enhanced with vector-wise normalization.• We integrate the mixed-precision GMRES algorithm into the Ginkgo sparse linear algebra library (https://ginkgo-project.github.io).• We provide strong practical evidence of the advantage of our approach by developing a high-performance realization of the solver for modern NVIDIA’s V100 GPUs and testing it on a considerable number of large-scale problems from the SuiteSparse Matrix Collection (Davis and Hu, 2011) (https://sparse.tamu.edu/).…”

Section: Introductionmentioning

confidence: 99%

“…• We follow the ideas in Anzt et al (2021) and Grützmacher et al (2021) and use the therein presented “memory accessor” in order to decouple the memory storage format from the arithmetic precision, specifically applying this strategy to maintain the Krylov basis in reduced precision in memory while performing all arithmetic operations using full, hardware-supported ieee 64-bit double-precision (DP).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Compressed basis GMRES on high-performance graphics processing units

Estellés

Anzt

Grützmacher

et al. 2022

The International Journal of High Performance Computing Applica

Self Cite

View full text Add to dashboard Cite

Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of many large-scale sparse linear systems. To a large extent, the performance of practical realizations of these methods is constrained by the communication bandwidth in current computer architectures, motivating the investigation of sophisticated techniques to avoid, reduce, and/or hide the message-passing costs (in distributed platforms) and the memory accesses (in all architectures). This article leverages Ginkgo’s memory accessor in order to integrate a communication-reduction strategy into the (Krylov) GMRES solver that decouples the storage format (i.e., the data representation in memory) of the orthogonal basis from the arithmetic precision that is employed during the operations with that basis. Given that the execution time of the GMRES solver is largely determined by the memory accesses, the cost of the datatype transforms can be mostly hidden, resulting in the acceleration of the iterative step via a decrease in the volume of bits being retrieved from memory. Together with the special properties of the orthonormal basis (whose elements are all bounded by 1), this paves the road toward the aggressive customization of the storage format, which includes some floating-point as well as fixed-point formats with mild impact on the convergence of the iterative process. We develop a high-performance implementation of the “compressed basis GMRES” solver in the Ginkgo sparse linear algebra library using a large set of test problems from the SuiteSparse Matrix Collection. We demonstrate robustness and performance advantages on a modern NVIDIA V100 graphics processing unit (GPU) of up to 50% over the standard GMRES solver that stores all data in IEEE double-precision.

show abstract

“…The fourth paper titled, “Using Ginkgo's Memory Accessor for Improving the Accuracy of Memory‐Bound Low Precision Basic Linear Algebra Subprograms (BLAS)” by Quintana‐Ortí et al 4 demonstrates that memory‐bound applications operating on low precision data can increase their accuracy by relying on the memory accessor to perform all arithmetic operations in high precision. In particular, the authors demonstrate that memory‐bound BLAS operations (including the sparse matrix‐vector product) can be re‐engineered with the memory accessor and that the resulting accessor‐enabled BLAS routines achieve lower rounding errors while delivering the same performance as the fast low‐precision BLAS.…”

mentioning

confidence: 99%