High-Precision Anchored Accumulators for Reproducible Floating-Point Summation

Lutz, David; Hinds, C.N.

doi:10.1109/arith.2017.20

Cited by 9 publications

(9 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, the Kulisch long accumulator, which is the cornerstone algorithm of ExBLAS, is designed to handle severe (ill-conditioned) cases with very broad dynamic ranges, while in practice "100 bits suffice for many HPC applications" as noted by David Bailey at ARITH-21 [14]. This idea inspired the ARM team (Lutz, Burgess et al) to design a mini long accumulator with the limited range [15,16]. Therefore, we foresee to explore this motivated-by-practice idea of moderately conditioned problems with moderate dynamic ranges in order to derive a lightweight algorithmic solution from ExBLAS.…”

Section: Introductionmentioning

confidence: 99%

Reproducibility strategies for parallel Preconditioned Conjugate Gradient

Iakymchuk

Barreda

Wiesenberger

et al. 2020

Journal of Computational and Applied Mathematics

View full text Add to dashboard Cite

The Preconditioned Conjugate Gradient method is often used in numerical simulations. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we aim at a twofold goal: enhance the accuracy of the solver but also ensure its reproducibility in a message-passing implementation. We design and employ various strategies starting from the ExBLAS approach (through preserving every bit of information until final rounding) to its more lightweight performance-oriented variant (through expanding the intermediate precision). These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these strategies on modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of only 29 % (ExBLAS) and 4 % (lightweight) on 768 processes.

show abstract

Section: Introductionmentioning

confidence: 99%

Reproducibility strategies for parallel Preconditioned Conjugate Gradient

Iakymchuk

Barreda

Wiesenberger

et al. 2020

Journal of Computational and Applied Mathematics

View full text Add to dashboard Cite

show abstract

“…It is also possible to improve numerical properties of summation when it is sufficient to use floating point numbers belonging to a smaller range. 8 The GNU MPFR 7 Library provides multiple-precision floating-point computations with correct rounding. Several papers show how to improve the performance of accurate summation of floating-point numbers using parallel processing.…”

Section: Related Workmentioning

confidence: 99%

Improving accuracy of summation using parallel vectorized Kahan's and Gill‐Møller algorithms

Dmitruk

Stpiczyński

2023

Concurrency and Computation

View full text Add to dashboard Cite

The aim of this paper is to show that Kahan's and Gill‐Møller compensated summation algorithms that allow to achieve high accuracy of summing long sequences of floating‐point numbers can be efficiently vectorized and parallelized using Intel AVX‐512 intrinsics together with OpenMP constructs in order to utilize SIMD extension of modern multicore processors. Numerical experiments show that the new implementations of the algorithms achieve much better accuracy than ordinary summation in both double and single precision and their performance is comparable with the performance of the ordinary summation algorithm optimized automatically. The vectorized Gill‐Møller algorithm is faster than the vectorized Kahan's algorithm. However, in case of single precision, the accuracy of the Gill‐Møller algorithm is worse than Kahan's but it can be fixed by the use of mixed‐precision. Then the accuracy of both compensated summation algorithms is the same and the Gill‐Møller algorithm is still faster than Kahan's.

show abstract

“…The ExBLAS-based approach with its cornerstone Kulisch long accumulator Kulisch (2013) is robust but expensive since it is designed to cover severe (ill-conditioned) cases with very broad dynamic ranges. Motivated by “100 bits suffice for many HPC applications” as noted by David Bailey at ARITH-21 Bailey (2013) and a mini accumulator from the ARM team Lutz and Hinds (2017); Burgess et al (2019), we derive a faster but less generic version using FPEs, which is the other core algorithmic component in the ExBLAS approach, aiming to adjust the algorithm to the problem at hand. As a consequence, we also address the common issue of sparse iterative solvers—the accuracy while computing the residual—and propose to use solutions that offer reproducibility (and potentially correct-rounding) only while computing the corresponding dot products. Hence, we derive two hybrid (MPI + OpenMP tasks), reproducible, and accurate dot products using ExBLAS and FPEs. Finally, we demonstrate applicability and feasibility of the aforementioned idea with the ExBLAS- and FPE-based approaches in the hybrid MPI + OpenMP implementation of PCG on an example of a 3D Poisson’s equation with 27 stencil points as well as several test matrices from the SuiteSparse matrix collection. This extends our previous results with the pure MPI implementation of PGC Iakymchuk et al (2019a) to the more complex double-level dot products and reductions with dynamic scheduling of the tasks. …”

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments

Iakymchuk

Barreda

Graillat³

et al. 2020

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

The Preconditioned Conjugate Gradient method is often employed for the solution of linear systems of equations arising in numerical simulations of physical phenomena. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we propose two algorithmic solutions that originate from the ExBLAS project to enhance the accuracy of the solver as well as to ensure its reproducibility in a hybrid MPI + OpenMP tasks programming environment. One is based on ExBLAS and preserves every bit of information until the final rounding, while the other relies upon floating-point expansions and, hence, expands the intermediate precision. Instead of converting the entire solver into its ExBLAS-related implementation, we identify those parts that violate reproducibility/non-associativity, secure them, and combine this with the sequential executions. These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these approaches on two modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of less than 37.7% on 768 cores.

show abstract

High-Precision Anchored Accumulators for Reproducible Floating-Point Summation

Cited by 9 publications

References 4 publications

Reproducibility strategies for parallel Preconditioned Conjugate Gradient

Reproducibility strategies for parallel Preconditioned Conjugate Gradient

Improving accuracy of summation using parallel vectorized Kahan's and Gill‐Møller algorithms

Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments

Contact Info

Product

Resources

About