Advances in Mixed Precision Algorithms: 2021 Edition

Abdelfattah, A; Anzt, H; Ayala, A; Boman, Erik G.; Carson, Erin; Cayrols, S; Cojean, T; Dongarra, J; Falgout, Robert D.; Gates, M; Gruetzmacher, T; Higham, Nicholas J.; Kruger, Scott; Li, X; Lindquist, N; Liu, Y; Loe, Jennifer; Luszczek, P; Nayak, P; Osei-Kuffuor, Daniel; Pranesh, Srikara; Rajamanickam, Siva; Ribizel, T; Smith, Barry; Swirydowicz, K; Thomas, S; Tomov, S; Tsai, Y; Yamazaki, I; Yang, Ulrike Meier

doi:10.2172/1814677

Cited by 3 publications

(4 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In 2020, the ECP community created a new multiprecision effort to design and develop new numerical algorithms that can exploit the speed provided by the lower-precision hardware while maintaining sufficient level of accuracy that is required by numerical modeling and simulations. Examples include: Mixed precision iterative refinement for a dense LU factorization in SLATE and a sparse LU factorization in SuperLU achieved 1.8× and 1.5× speedups, respectively; Mixed precision GMRES with iterative refinement in Trilinos achieved 1.4× speedup; Basis (CB) GMRES in Ginkgo achieved 1.4× speedup, and mixed precision sparse approximate inverse preconditioners achieved an average speedup of 1.2× [3]. These speedups coming from the mixed precision algorithms are "here to stay" as they will carry over to future hardware architectures.…”

Section: Algorithms: Then and Nowmentioning

confidence: 99%

“…For this purpose, we describe several software projects that are rooted in mathematical libraries and application space, and investigate their performance improvements and sustainability: the xSDK (Extremescale Scientific Software Development Kit) and its con-Better Scientific Software stituent libraries such as Ginkgo, SLATE, SuperLU, and the laser-plasma modeling application WarpX. 3 We will describe critical facets of how software development methodologies and interdisciplinary teams have been transformed, leading to improvements in the software itself and why these advances are essential for nextgeneration science.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Then and Now: Improving Software Portability, Productivity, and 100× Performance

Anzt,

Huebl,

2024

Comput. Sci. Eng.

View full text Add to dashboard Cite

The US Exascale Computing Project (ECP) has succeeded in preparing applications to run efficiently on the first reported Exascale supercomputers in the world. To achieve this, it modernized the whole leadership software stack, from libraries to simulation codes. In this article, we contrast selected leadership software before and after ECP. We discuss how sustainable research software development for leadership computing can embrace the conversation with the hardware vendors, the leadership computing facilities, the software community, and the domain scientists who are the application developers and integrators of software products. We elaborate on how software needs to take portability as a central design principle and to benefit from interdependent teams; we also demonstrate how moving to programming languages with high momentum, like modern C++, can help improve the sustainability, interoperability, and performance of research software. Finally, we showcase how cross-institutional efforts can enable algorithm advances that are beyond incremental performance optimization.

show abstract

Section: Algorithms: Then and Nowmentioning

confidence: 99%

mentioning

confidence: 99%

Then and Now: Improving Software Portability, Productivity, and 100× Performance

Anzt,

Huebl,

2024

Comput. Sci. Eng.

View full text Add to dashboard Cite

show abstract

“…For comparison, we also compute the matrix product in fp32 and fp64 arithmetics in hardware by using the Eigen C++ library. 1 For fp64 arithmetic, we use the default Eigen matrix multiplication implementation, for all other arithmetics we use the blocked FMA algorithm [6, Alg. 3.1] with a block FMA of dimension 1.…”

Section: P and Thusmentioning

confidence: 99%

“…The upcoming NVIDIA Hopper microarchitecture [32] adds yet more formats to the tensor cores (quarter precision): fp8-E5M2 (5 exponent and 2 significand bits) and fp8-E4M3 (4 exponent and 3 significand bits). Tensor cores provide a significant performance boost compared with standard floating-point units, and have been used with great success to accelerate numerical linear algebra algorithms [1], [5], [13], [14] [25]; see [20] for a survey of these algorithms. Other vendors also incorporate matrix arithmetic in their devices: for example, the accelerators in the AMD MI200 series contain units that can perform vector and matrix operations faster than their scalar counterparts [2], [3], [4].…”

mentioning

confidence: 99%

Matrix Multiplication in Multiword Arithmetic: Error Analysis and Application to GPU Tensor Cores

Fasi

Higham

Lopez

et al. 2023

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

In multiword arithmetic, a matrix is represented as the unevaluated sum of two or more lower-precision matrices, and a matrix product is formed by multiplying the constituents in low precision. We investigate the use of multiword arithmetic for improving the performance-accuracy tradeoff of matrix multiplication with mixed precision block fused multiply-add (FMA) hardware, focusing especially on the tensor cores available on NVIDIA GPUs. Building on a general block FMA framework, we develop a comprehensive error analysis of multiword matrix multiplication. After confirming the theoretical error bounds experimentally by simulating low precision in software, we use the cuBLAS and CUTLASS libraries to implement a number of matrix multiplication algorithms using double-fp16 (double-binary16) arithmetic. When running the algorithms on NVIDIA V100 and A100 GPUs, we find that double-fp16 is not as accurate as fp32 (binary32) arithmetic despite satisfying the same worst-case error bound. Using probabilistic error analysis, we explain why this issue is likely to be caused by the rounding mode used by the NVIDIA tensor cores, and propose a parameterized blocked summation algorithm that alleviates the problem and significantly improves the performance-accuracy tradeoff.

show abstract

Mathematical Tools for Simulation of 3D Bioprinting Processes on High-Performance Computing Resources: The State of the Art

Carracciuolo,

D’Amora

2024

Applied Sciences

View full text Add to dashboard Cite

Three-dimensional (3D) bioprinting belongs to the wide family of additive manufacturing techniques and employs cell-laden biomaterials. In particular, these materials, named “bioink”, are based on cytocompatible hydrogel compositions. To be printable, a bioink must have certain characteristics before, during, and after the printing process. These characteristics include achievable structural resolution, shape fidelity, and cell survival. In previous centuries, scientists have created mathematical models to understand how physical systems function. Only recently, with the quick progress of computational capabilities, high-fidelity and high-efficiency “computational simulation” tools have been developed based on such models and used as a proxy for real-world learning. Computational science, or “in silico” experimentation, is the term for this novel strategy that supplements pure theory and experiment. Moreover, a certain level of complexity characterizes the architecture of contemporary powerful computational resources, known as high-performance computing (HPC) resources, also due to the great heterogeneity of its structure. Lately, scientists and engineers have begun to develop and use computational models more extensively to also better understand the bioprinting process, rather than solely relying on experimental research, due to the large number of possible combinations of geometrical parameters and material properties, as well as the abundance of available bioprinting methods. This requires a new effort in designing and implementing computational tools capable of efficiently and effectively exploiting the potential of new HPC computing systems available in the Exascale Era. The final goal of this work is to offer an overview of the models, methods, and techniques that can be used for “in silico” experimentation of the physicochemical processes underlying the process of 3D bioprinting of cell-laden materials thanks to the use of up-to-date HPC resources.

show abstract

Advances in Mixed Precision Algorithms: 2021 Edition

Cited by 3 publications

References 0 publications

Then and Now: Improving Software Portability, Productivity, and 100× Performance

Then and Now: Improving Software Portability, Productivity, and 100× Performance

Matrix Multiplication in Multiword Arithmetic: Error Analysis and Application to GPU Tensor Cores

Mathematical Tools for Simulation of 3D Bioprinting Processes on High-Performance Computing Resources: The State of the Art

Contact Info

Product

Resources

About